← All Posts
We Built an AI That Reviews Contracts 160x Faster. Here's What Went Wrong First.

Photo via Unsplash

Feb 13, 2026Product4 min read

We Built an AI That Reviews Contracts 160x Faster. Here's What Went Wrong First.

Version 1: The Overconfident Prototype

Our first LEXIS prototype was embarrassingly naive. Take a contract, chunk it, embed it, retrieve relevant sections, ask Claude to identify risks. Demo looked amazing. Client loved it. Then we tested it against real associate work product.

Accuracy: 61%. The system missed indemnification caps buried in cross-referenced schedules, misidentified standard boilerplate as high-risk, and hallucinated obligations that didn't exist in the contract. A 61% accurate legal review tool isn't just useless — it's dangerous.

Version 2: The Over-Engineered Monster

We overcorrected. Built a 14-stage pipeline with separate models for clause classification, entity extraction, obligation mapping, risk scoring, cross-reference resolution, and output validation. Each stage was individually impressive.

The problem: End-to-end latency was 47 minutes per contract. A senior associate takes about 4 hours. We'd achieved a 5x speedup at the cost of $180 per contract in API calls. The client's associates cost $75/hour. We were more expensive than the humans.

Version 3: The Right Architecture

We stripped it back. Three stages: intelligent chunking (clause-level, not paragraph-level), targeted retrieval with a custom re-ranker trained on legal risk patterns, and a single Claude Opus call with a carefully engineered prompt that included the top-k retrieved clauses plus contract metadata.

Accuracy: 94.1%. Latency: 90 seconds per contract. Cost: $2.30 per contract. We were getting close.

Version 4: Production

Added confidence scoring, human-in-the-loop for low-confidence flags, evaluation pipelines running on every deploy, and feedback loops from associate corrections. Accuracy climbed to 96.3%.

The 160x speed claim: 90 seconds vs 4 hours = 160x. But the real win isn't speed — it's consistency. Our system doesn't get tired at 11 PM, doesn't miss things because it's reviewing its fifth contract of the day, and flags the same risk the same way every time.

The lesson: start simple, measure ruthlessly, and don't confuse pipeline complexity with pipeline quality.

LEXISLegal AIContract ReviewProductionRAG

Subscribe to The Signal

Sharp takes on AI engineering, delivered weekly. Join teams at companies building production AI systems.