WorldTaxAI – Chatbot fiscal (Espagne) avec RAG & recherche hybride

Simon Rochwerg
October 19, 2025
4 min read
Client

WorldTaxAI

Duration

8 weeks

Budget

6 600 euros

“WorldTaxAI lets us query tens of thousands of Spanish tax documents with reliable, cited answers. Concept search works very well, and the team is responsive to adjustments.” — WorldTax Team

WorldTaxAI

Project summary

WorldTaxAI is a multilingual (EN↔ES) tax chatbot dedicated to Spanish tax law and administration. It uses a robust RAG that combines semantic search (pgvector) with lexical search (tsvector) to deliver fast, reliable, cited answers.

  1. Corpus: ~60,000 documents (BOE, DGT, AEAT, PwC), unified from heterogeneous PDFs.
  2. Quality: OCR, cleaning, Unicode normalization, 400–600 token chunking.
  3. Indexing: PostgreSQL + pgvector (HNSW) + FTS tsvector (GIN).
  4. App: FastAPI backend, React/Next front-end (streaming), chat history & deletion.
  5. Multilingual: questions in English, cited results from an ES-heavy corpus.

Objectives

  1. Fast access to regulations, rulings, notices, commentary.
  2. Reliable, cited answers (passages + page/section deep-links).
  3. Multilingual EN↔ES (terminology robust).
  4. Cost control (low-cost embeddings, unified Postgres infra).
  5. Chat features: history, resume, delete, export.

  1. Semantic (embeddings): captures conceptual proximity; robust to multilingual queries (EN↔ES) and paraphrases.
  2. Lexical (tsvector): anchors exact references (e.g., modelo 190, art. 10, acronyms).
  3. Score fusion (+ domain boosts, MMR): maximizes recall & precision without hard filters, aligned to legal intent.

Architecture & pipeline

1) Ingestion & OCR

  1. Text extraction from born-digital PDFs & scans (OCR).
  2. Language detection, Unicode normalization, removal of headers/footers, page numbers.

2) Structuring & chunking

  1. 400–600 token segments with section path and page numbers.
  2. Quality checks, deduplication, timestamps.

3) Document enrichment

  1. Metadata: jurisdiction, document_type, doc_title, source_name, year.
  2. Topics: primary_topic (closed list), secondary_topics (3–5 controlled tags).
  3. Short abstract for previews + optional doc-level embedding.

4) Embeddings & index

  1. Model: text-embedding-3-small (OpenAI).
  2. Storage: PostgreSQL
    1. pgvector vector(1536)
      1. HNSW index.
  3. Parallel lexical search via tsvector
    1. GIN index.

5) Hybrid retrieval & ranking

  1. Vector EN↔ES
    1. exact-match lexical.
  2. Normalized score fusion, domain boosts by topics/type, MMR for diversity, light re-rank.
  3. Select 5–8 passages with citations (title, section, page).

6) API & Front-end

  1. FastAPI: auth, sessions, pagination, purge.
  2. React/Next (streaming): i18n, copy cited snippets, export conversations.
  3. Speech-to-text endpoint (optional).

7) Ops & security

  1. CI/CD, metrics (latency, tokens, Recall@k), budgets.
  2. Privacy: API data not used to train ChatGPT; optional zero retention.
  3. RLS/read-only permissions, minimal logs, per-segment traceability.

User experience

  1. Per-user history: resume, rename, delete.
  2. Clickable citations to the source (with context).
  3. Fact-based answers: the model does not invent, it grounds in retrieved snippets.
  4. Multilingual: ask in EN, results in ES, optional EN summary.
  5. Streaming answers; voice input (optional).

Results & impact

  1. Seconds to access previously scattered content.
  2. High recall on technical queries (retención dividendos — modelo 190, establecimiento permanente, etc.).
  3. Significant time savings for analysts & attorneys.
  4. Ready foundation for monitoring & analytics (trends, clustering, dedup detection).

Tech stack

  1. Backend: Python, FastAPI, SQLAlchemy.
  2. Database: PostgreSQL, pgvector, tsvector, Docker.
  3. LLM & embeddings: OpenAI (chat gpt-4o-mini, embeddings text-embedding-3-small).
  4. Front-end: React/Next, streaming, i18n.
  5. Ops: CI/CD, monitoring, budgets.

Timeline & budget

  1. Phase 1 — scraping & preparation: delivered, ~2–3 weeks.
  2. Phase 2 — RAG + API + front + deployment: ~4 weeks.
  3. Phase 2 quote: €6,600 excl. VAT (21 days @ €300/day).
  4. Recurring API costs: low (one-shot indexing, lightweight queries).

CTA: Want a POC on your corpus (EN/ES) or an extension to your internal portal? Get a demo.


Client testimonial

WorldTaxAI lets us query tens of thousands of Spanish tax documents with cited, trustworthy answers. Concept-based search works well, and the team is responsive to adjustments.”
WorldTax Team


FAQ

Does the chatbot “learn” from our data?

No. API data is not used to train ChatGPT. Embeddings and texts remain private.

Can I ask in English if the corpus is in Spanish?

Yes. Multilingual embeddings + bilingual FTS (language routing).

How do you ensure quality?

Cited answers; continuous evaluation (Recall@k, Precision@k, Faithfulness).

How does it scale?

pgvector handles hundreds of thousands of vectors. HNSW/IVF indexes for speed. Stateless API.

Security & compliance?

Private PostgreSQL, read-only permissions (RLS), history deletion, auditable CI/CD, minimal logs.


Resources & next steps

Sample technical queries

  1. “Retención sobre dividendos no residentes — modelo 190 (EN→ES)”
  2. “DGT criteria on digital permanent establishment (EN→ES)”
  3. “Double tax relief exemption art. 10 — BOE 202x (ES)”

Product extension ideas

  1. Stronger re-ranking (cross-encoder), query rewriting (expansion).
  2. Quality dashboard (coverage, freshness, terminology drift).
  3. Monitoring: incremental ingestion (BOE/DGT/AEAT webhooks), topic alerts.
Simon Rochwerg

AI-enthusiast, maker, data-scientist.