← All Articles
Building a Legal AI That Lawyers Actually Trust

Photo via Unsplash

AI EngineeringJan 2025

Building a Legal AI That Lawyers Actually Trust

Sariph Shrestha·15 min read

The Trust Problem

When we first pitched LEXIS AI to a top-20 law firm, the managing partner said something that shaped our entire engineering approach: "I don't care if your AI is 99% accurate. I need to know which 1% it got wrong, and I need to know before my client finds out."

This is the fundamental challenge of legal AI. Lawyers do not need a tool that is usually right. They need a tool that is transparently right — where every conclusion can be traced to its source, where uncertainty is surfaced rather than hidden, and where the system fails gracefully rather than confidently.

LEXIS AI now processes contract reviews for firms handling M&A transactions, commercial leases, and credit facility agreements. This post describes the architecture that earned that trust.

System Architecture

LEXIS AI is structured as a pipeline with five distinct stages, each independently testable and monitorable:

Stage 1: Document Ingestion. Contracts arrive as PDFs (60%), Word documents (35%), or scanned images (5%). We use a combination of Apache Tika for text extraction and a fine-tuned LayoutLMv3 model for understanding document structure — distinguishing headers, body text, signature blocks, exhibits, and schedules. For scanned documents, we run Azure Document Intelligence OCR with our own post-processing layer that corrects common legal OCR errors (§ misread as S, ¶ misread as P, Roman numerals misclassified).

Stage 2: Clause Extraction and Classification. The extracted text is segmented into individual clauses using our DeBERTa-based boundary detector (described in our previous RAG article). Each clause is then classified into one of 47 clause types — indemnification, limitation of liability, termination for convenience, change of control, assignment restrictions, and so on. Classification accuracy is 94.2% across all types, with per-class F1 scores ranging from 0.89 (complex cross-default provisions) to 0.99 (standard governing law clauses).

python
CLAUSE_TAXONOMY = { "indemnification": { "subtypes": ["mutual", "one_way_buyer", "one_way_seller", "ip_specific"], "risk_factors": ["uncapped", "survival_period", "knowledge_qualifier"], "market_benchmarks": { "cap_median": "1x_contract_value", "cap_p25": "0.5x_contract_value", "survival_median_months": 18, } }, "limitation_of_liability": { "subtypes": ["general_cap", "consequential_damages_waiver", "fundamental_breach_carveout"], "risk_factors": ["below_market_cap", "no_consequential_waiver", "broad_carveouts"], "market_benchmarks": { "general_cap_median": "1x_annual_fees", "consequential_waiver_frequency": 0.87, } }, # ... 45 more clause types }

Stage 3: Risk Analysis. Each extracted clause is evaluated against a risk rubric. The rubric encodes three types of knowledge: (a) legal principles from our advisory board of contract attorneys, (b) market benchmark data from our corpus of 50,000+ analyzed contracts, and (c) client-specific policies and risk tolerances.

Risk scoring is not a single LLM call. We decompose it into sub-evaluations: Does this clause deviate from market standard? By how much? Is the deviation favorable or unfavorable to our client? What is the potential financial exposure? Each sub-evaluation is a separate, focused LLM call with narrow context, making outputs more reliable and easier to validate.

Stage 4: Cross-Reference Resolution. Legal contracts are webs of internal cross-references. An indemnification clause might reference a defined term in Article I, be subject to a cap defined in a separate schedule, and be modified by a side letter. LEXIS AI builds a dependency graph of cross-references and resolves them before presenting findings.

This is one of the most technically challenging components. We use a combination of regex patterns for explicit references ("as defined in Section 2.1(a)") and a trained entity linking model for implicit references ("the foregoing limitation" referring to a cap three paragraphs earlier). The entity linker is a fine-tuned Longformer that can attend across the full contract length (up to 16K tokens).

Stage 5: Report Generation. The final report is generated by Claude 3.5 Sonnet, grounded entirely in the outputs of stages 2-4. The prompt includes the extracted clauses, risk scores, cross-reference resolutions, and market benchmarks. Every sentence in the report is annotated with its source — which clause it references, which benchmark it compares against, and the confidence score of the underlying analysis.

The Evaluation Methodology

Deploying LEXIS AI into a law firm's workflow required proving reliability to a skeptical audience. We developed a three-tier evaluation methodology:

Tier 1: Component-Level Evaluation

Each pipeline component is evaluated independently on held-out test sets:

  • Clause boundary detection: Token-level F1 = 0.96 (n=800 documents)
  • Clause classification: Macro F1 = 0.942 (n=12,000 clauses across 47 types)
  • Cross-reference resolution: Accuracy = 0.91 (n=3,200 cross-references)
  • Risk factor identification: Recall = 0.963, Precision = 0.941 (n=5,000 risk factors)

Tier 2: End-to-End Blind Testing

We run quarterly blind tests against senior associates. The test protocol:

1. Select 20 contracts not in the training corpus, stratified by type and complexity. 2. Have a senior associate (8+ years experience) review each contract independently. 3. Run LEXIS AI on the same contracts. 4. A supervising partner evaluates both outputs without knowing which is human and which is AI.

Results from the most recent blind test (Q4 2024):

MetricLEXIS AISenior Associate
Risk factors identified96.3%91.7%
False positives4.1%2.3%
Average review time94 seconds3.8 hours
Critical issues missed01

LEXIS AI identifies more risk factors than the human reviewer, with a higher false positive rate. The key metric: zero critical issues missed. We would rather flag something that turns out to be benign than miss something material.

Tier 3: Production Monitoring

In production, we monitor:

python
@dataclass class ProductionMetrics: # Quality metrics (measured daily) override_rate: float # % of AI findings modified by reviewer addition_rate: float # % of reviews where reviewer adds missed findings rejection_rate: float # % of AI findings rejected as incorrect # Performance metrics (measured per-request) e2e_latency_p50_ms: float e2e_latency_p99_ms: float clause_extraction_latency_ms: float risk_analysis_latency_ms: float # Reliability metrics (measured hourly) error_rate: float timeout_rate: float fallback_rate: float # % of requests hitting fallback model

Our current production metrics: 7.2% override rate (the reviewer changes an AI finding), 3.1% addition rate (the reviewer catches something AI missed), 1.8% rejection rate (the reviewer marks an AI finding as incorrect). These numbers have improved every quarter since launch.

Architecture Decisions We Would Make Again

Decomposed pipeline over end-to-end. It would be simpler to send the entire contract to GPT-4 with a prompt saying "find all risks." But a monolithic approach is untestable, unexplainable, and unreliable. Each stage of our pipeline can be evaluated, debugged, and improved independently.

Structured intermediate representations. Every stage communicates through typed data structures, not free-form text. Clauses are objects with IDs, types, text, positions, and metadata. Risk factors are objects with severity, source clause, benchmark comparison, and confidence. This makes the system debuggable and enables the traceability lawyers require.

Conservative confidence thresholds. We tuned our system to have high recall at the expense of precision. LEXIS AI over-flags rather than under-flags. Lawyers universally prefer reviewing a few extra flagged items over missing a material risk. Every deployment conversation confirms this preference.

Human-in-the-loop as a feature, not a limitation. LEXIS AI is not designed to replace the lawyer. It is designed to make the lawyer faster. The review workflow surfaces AI findings for human confirmation, not autonomous action. This framing was essential for adoption — lawyers engage with the tool as an assistant, not a replacement.

What Took 90 Seconds Used to Take 4 Hours

The result: a first-pass contract review that took a senior associate 4 hours now takes 90 seconds of AI processing plus 15-20 minutes of human review. The lawyer's time is spent on judgment — evaluating flagged risks, considering business context, advising the client — rather than on the mechanical task of reading every clause.

That managing partner who asked about the 1%? Six months after deployment, his feedback was: "The AI finds things my associates miss. But more importantly, I can see exactly why it flagged each issue. That is why we trust it."

Legal AIRAGLEXIS AINLPEvaluation

Want to build something similar?

We help enterprises design and deploy production-grade AI systems. Let's discuss your project.

Schedule a Consultation