A year of production RAG
Retrieval-Augmented Generation has moved from research concept to mainstream enterprise deployment faster than almost any AI pattern in recent memory. In 2024, we deployed RAG systems across BFSI, healthcare, manufacturing, and professional services clients — answering questions over financial reports, clinical guidelines, maintenance manuals, and legal contracts. This is what we learned about what actually works, what fails in ways the benchmarks don't predict, and what we'd do differently.
What works well
Structured document retrieval over bounded corpora. RAG performs consistently well when the document corpus is relatively bounded (under 100,000 pages), the documents are structurally homogeneous (all PDFs, all similar formatting), and the queries are factual rather than inferential. Financial report Q&A, policy document lookup, product specification retrieval — these work reliably in production with well-tuned chunking and embedding strategies.
Hybrid retrieval (dense + sparse). Pure vector search underperforms on queries that include specific codes, identifiers, technical terms, or proper nouns — because these are low-frequency in the embedding space but highly specific in meaning. Combining dense vector retrieval with BM25 sparse retrieval, then reranking, consistently outperforms either approach alone. This is not controversial in the research literature, but it's underimplemented in production systems.
What fails in production
Multi-hop reasoning. "What is the risk exposure of our APAC portfolio given the regulatory changes in the Q3 report compared to the position described in the board memo from September?" — this query requires the system to retrieve multiple documents, understand their relationship, and synthesise a comparative answer. Current RAG architectures handle this poorly. The retrieval step finds individual relevant passages; the synthesis step loses track of the relational logic. We handle this with explicit query decomposition and multi-step retrieval, but it requires significant engineering effort.
Hallucination with authoritative-sounding sources. One of the more dangerous failure modes in enterprise RAG is when the model generates a confident, well-formatted answer that is factually incorrect but cites a real document. The citation creates a false impression of grounding. Users trust the answer because they can see the source. We mitigate this with strict attribution — the system only outputs verbatim extracts with page references, and uses the LLM only to format and contextualise, not to synthesise facts.
The infrastructure decisions that matter most
Chunking strategy matters more than most teams expect. Fixed-size chunking with overlap is simple to implement and adequate for homogeneous documents. For heterogeneous corpora — mixing contracts, emails, reports, and presentations — semantic chunking that respects document structure (sections, clauses, paragraphs) consistently outperforms fixed chunking, at the cost of more complex preprocessing.
Embedding model selection is a distant second concern compared to retrieval pipeline design. We've seen teams spend weeks evaluating embedding models and achieve a 2% improvement in retrieval recall, while leaving obvious wins on the table in their reranking and context assembly logic. Get the pipeline right first; optimise the model second.
What we'd do differently
We would invest earlier in evaluation infrastructure. The single biggest bottleneck in RAG development is the absence of a reliable, automated way to evaluate whether the system is getting better or worse as you iterate. Building a golden dataset of representative queries with expert-validated answers — before you start tuning the system — pays for itself many times over. Without it, you are flying blind.
// Continue the conversation
Want to explore this for your organisation?
Our team works with enterprise technology leaders across India and globally.
Talk to us →