← All Articles
Building Production RAG Systems: Lessons from 50,000 Legal Documents

Photo via Unsplash

AI EngineeringMar 2025

Building Production RAG Systems: Lessons from 50,000 Legal Documents

Sariph Shrestha·14 min read

The Problem With Naive RAG

Most RAG tutorials show you the same pattern: split a PDF into 512-token chunks, embed them with OpenAI's text-embedding-ada-002, store them in a vector database, and retrieve the top-k results. This works for demos. It does not work when a partner at a law firm needs to trust your system's output before sending it to a client worth $200 million.

When we started building LEXIS AI — our legal contract analysis platform — we quickly discovered that the gap between "RAG that works in a notebook" and "RAG that works in production" is enormous. Over 18 months and 50,000 legal documents, we developed a pipeline that achieves 96.3% risk detection accuracy on blind tests against senior associates. This post documents what we learned.

Chunking Strategy: Clause-Level, Not Paragraph-Level

The single most impactful decision in our entire pipeline was how we chunk documents. Legal contracts are not like Wikipedia articles. Meaning in a contract is clause-atomic — a single sentence can create or eliminate liability. Splitting mid-clause destroys semantic integrity.

We moved through three chunking strategies:

V1: Fixed-size chunks (512 tokens, 50-token overlap). Fast to implement, terrible results. Chunks routinely split clauses, and the overlap was insufficient to preserve cross-reference context. Retrieval precision was around 61%.

V2: Recursive character splitting with section awareness. We used regex to detect section headers (Article I, Section 2.1, etc.) and split on those boundaries. Better — precision improved to 74% — but many sections were too long (2,000+ tokens) and contained multiple independent obligations.

V3: Clause-level semantic chunking. We trained a lightweight classifier (fine-tuned DeBERTa-v3-base) to detect clause boundaries in legal text. Each chunk represents exactly one contractual obligation, definition, or condition. Average chunk size dropped to 180 tokens. Retrieval precision jumped to 89%.

The classifier was trained on 3,200 manually annotated clauses across M&A agreements, credit facilities, and commercial leases. Training took 4 hours on a single A100. The model adds ~12ms per document at indexing time — negligible for a batch pipeline.

python
# Simplified clause boundary detection from transformers import AutoTokenizer, AutoModelForTokenClassification import torch class ClauseChunker: def __init__(self, model_path: str): self.tokenizer = AutoTokenizer.from_pretrained(model_path) self.model = AutoModelForTokenClassification.from_pretrained(model_path) self.model.eval() def chunk(self, text: str) -> list[str]: inputs = self.tokenizer(text, return_tensors="pt", truncation=True, max_length=4096) with torch.no_grad(): logits = self.model(**inputs).logits # Label 1 = clause boundary token predictions = torch.argmax(logits, dim=-1).squeeze().tolist() tokens = self.tokenizer.convert_ids_to_tokens(inputs["input_ids"].squeeze()) clauses, current = [], [] for token, pred in zip(tokens, predictions): if pred == 1 and current: clauses.append(self.tokenizer.convert_tokens_to_string(current)) current = [token] else: current.append(token) if current: clauses.append(self.tokenizer.convert_tokens_to_string(current)) return clauses

Embedding Model Selection

We benchmarked five embedding models on a legal retrieval test set (1,200 queries, 50,000 candidate chunks, manually labeled relevance):

ModelNDCG@10Latency (p99)Dimensions
text-embedding-ada-0020.7145ms1536
text-embedding-3-large0.7852ms3072
Cohere embed-v30.7638ms1024
BGE-large-en-v1.50.7418ms1024
Custom fine-tuned E5-large0.8422ms1024

The fine-tuned E5-large model won decisively. We fine-tuned it on 15,000 (query, positive_clause, hard_negative) triplets generated from our annotated legal corpus. The hard negatives were critical — we specifically mined clauses that shared terminology with the query but addressed different legal concepts (e.g., "indemnification" clauses vs. "limitation of liability" clauses that reference indemnification).

Fine-tuning took 6 hours on 2x A100s using the sentence-transformers library with MultipleNegativesRankingLoss. The model is served via a dedicated FastAPI endpoint with ONNX Runtime for inference.

The Retrieval Pipeline

Our production retrieval pipeline is not a single vector search. It is a four-stage cascade:

Stage 1: Sparse retrieval (BM25). We run BM25 against an Elasticsearch index of all clauses. This catches exact-match queries that dense retrieval sometimes misses — particularly clause numbers, defined terms, and statutory references. Top 100 candidates.

Stage 2: Dense retrieval (vector search). We query Pinecone with the embedded query vector. Top 100 candidates. We use the serverless tier with p2 pods, which gives us sub-50ms p99 latency at our current scale.

Stage 3: Reciprocal Rank Fusion. We merge the BM25 and dense results using RRF with k=60. This hybrid approach consistently outperforms either method alone by 8-12% on our benchmark.

python
def reciprocal_rank_fusion( results_lists: list[list[str]], k: int = 60 ) -> list[tuple[str, float]]: """Merge multiple ranked lists using RRF.""" scores: dict[str, float] = {} for results in results_lists: for rank, doc_id in enumerate(results): scores[doc_id] = scores.get(doc_id, 0) + 1.0 / (k + rank + 1) return sorted(scores.items(), key=lambda x: x[1], reverse=True)

Stage 4: Cross-encoder re-ranking. We re-rank the top 30 fused results using a cross-encoder (fine-tuned ms-marco-MiniLM-L-12-v2 on legal text). This is the most expensive step (~150ms) but provides the largest single-stage accuracy improvement. The top 8 re-ranked results are passed to the LLM.

Hallucination Prevention

In legal AI, a hallucinated case citation is not just inaccurate — it is potentially career-ending for the lawyer who relies on it. We implemented three layers of hallucination prevention:

Citation grounding. Every factual claim in the LLM's output must reference a specific retrieved chunk. We enforce this through structured output (function calling) where the model must return a list of (claim, source_chunk_id) pairs. Claims without source references are filtered.

Confidence calibration. We ask the model to self-assess confidence on a 1-5 scale for each claim. We calibrated these scores against ground truth on a holdout set. Claims with confidence below 3 are flagged for human review rather than presented as findings.

Entailment verification. For high-stakes outputs, we run a separate NLI (Natural Language Inference) model to verify that each claim is actually entailed by its cited source chunk. If the NLI model classifies the relationship as "contradiction" or "neutral," the claim is suppressed.

Across 10,000+ production reviews, we have had zero hallucinated case citations.

Evaluation Framework

We evaluate the pipeline on four axes, measured weekly on a rolling benchmark of 200 queries:

  • Retrieval quality: NDCG@10, Recall@20, MRR — measuring whether the right chunks reach the LLM.
  • Answer accuracy: Human-graded on a 3-point scale (correct, partially correct, incorrect) by a contracted legal reviewer.
  • Faithfulness: Percentage of claims that are entailed by the cited source (measured by NLI model, spot-checked by humans).
  • Latency: End-to-end p50 and p99, broken down by pipeline stage. Our SLA is p99 < 8 seconds for a full contract analysis.

We run these evaluations in CI. Any PR that touches the retrieval pipeline, prompt templates, or model weights must pass a regression threshold: no metric can degrade by more than 2% relative to the previous release.

Key Takeaways

After 18 months in production, our core lessons are: chunk at the semantic unit of your domain, not at arbitrary token counts; hybrid retrieval (sparse + dense + re-ranking) outperforms any single method; hallucination prevention is a systems problem requiring multiple complementary techniques; and evaluation infrastructure is not optional — it is the only thing that lets you iterate with confidence. The difference between a demo and a product is the quality of your evaluation framework.

RAGLLMVector SearchPineconeEmbeddings

Want to build something similar?

We help enterprises design and deploy production-grade AI systems. Let's discuss your project.

Schedule a Consultation