Engineering Insights

The Trust Problem

When we first pitched LEXIS AI to a top-20 law firm, the managing partner said something that shaped our entire engineering approach: "I don't care if your AI is 99% accurate. I need to know which 1% it got wrong, and I need to know before my client finds out."

This is the fundamental challenge of legal AI. Lawyers do not need a tool that is usually right. They need a tool that is transparently right — where every conclusion can be traced to its source, where uncertainty is surfaced rather than hidden, and where the system fails gracefully rather than confidently.

LEXIS AI now processes contract reviews for firms handling M&A transactions, commercial leases, and credit facility agreements. This post describes the architecture that earned that trust.

System Architecture

LEXIS AI is structured as a pipeline with five distinct stages, each independently testable and monitorable:

Stage 1: Document Ingestion. Contracts arrive as PDFs (60%), Word documents (35%), or scanned images (5%). We use a combination of Apache Tika for text extraction and a fine-tuned LayoutLMv3 model for understanding document structure — distinguishing headers, body text, signature blocks, exhibits, and schedules. For scanned documents, we run Azure Document Intelligence OCR with our own post-processing layer that corrects common legal OCR errors (§ misread as S, ¶ misread as P, Roman numerals misclassified).

Stage 2: Clause Extraction and Classification. The extracted text is segmented into individual clauses using our DeBERTa-based boundary detector (described in our previous RAG article). Each clause is then classified into one of 47 clause types — indemnification, limitation of liability, termination for convenience, change of control, assignment restrictions, and so on. Classification accuracy is 94.2% across all types, with per-class F1 scores ranging from 0.89 (complex cross-default provisions) to 0.99 (standard governing law clauses).

python
CLAUSE_TAXONOMY = {
    "indemnification": {
        "subtypes": ["mutual", "one_way_buyer", "one_way_seller", "ip_specific"],
        "risk_factors": ["uncapped", "survival_period", "knowledge_qualifier"],
        "market_benchmarks": {
            "cap_median": "1x_contract_value",
            "cap_p25": "0.5x_contract_value",
            "survival_median_months": 18,
        }
    },
    "limitation_of_liability": {
        "subtypes": ["general_cap", "consequential_damages_waiver", "fundamental_breach_carveout"],
        "risk_factors": ["below_market_cap", "no_consequential_waiver", "broad_carveouts"],
        "market_benchmarks": {
            "general_cap_median": "1x_annual_fees",
            "consequential_waiver_frequency": 0.87,
        }
    },
    # ... 45 more clause types
}

Stage 3: Risk Analysis. Each extracted clause is evaluated against a risk rubric. The rubric encodes three types of knowledge: (a) legal principles from our advisory board of contract attorneys, (b) market benchmark data from our corpus of 50,000+ analyzed contracts, and (c) client-specific policies and risk tolerances.

Risk scoring is not a single LLM call. We decompose it into sub-evaluations: Does this clause deviate from market standard? By how much? Is the deviation favorable or unfavorable to our client? What is the potential financial exposure? Each sub-evaluation is a separate, focused LLM call with narrow context, making outputs more reliable and easier to validate.

Stage 4: Cross-Reference Resolution. Legal contracts are webs of internal cross-references. An indemnification clause might reference a defined term in Article I, be subject to a cap defined in a separate schedule, and be modified by a side letter. LEXIS AI builds a dependency graph of cross-references and resolves them before presenting findings.

This is one of the most technically challenging components. We use a combination of regex patterns for explicit references ("as defined in Section 2.1(a)") and a trained entity linking model for implicit references ("the foregoing limitation" referring to a cap three paragraphs earlier). The entity linker is a fine-tuned Longformer that can attend across the full contract length (up to 16K tokens).

Stage 5: Report Generation. The final report is generated by Claude 3.5 Sonnet, grounded entirely in the outputs of stages 2-4. The prompt includes the extracted clauses, risk scores, cross-reference resolutions, and market benchmarks. Every sentence in the report is annotated with its source — which clause it references, which benchmark it compares against, and the confidence score of the underlying analysis.

The Evaluation Methodology

Deploying LEXIS AI into a law firm's workflow required proving reliability to a skeptical audience. We developed a three-tier evaluation methodology:

Tier 1: Component-Level Evaluation

Each pipeline component is evaluated independently on held-out test sets:

Clause boundary detection: Token-level F1 = 0.96 (n=800 documents)
Clause classification: Macro F1 = 0.942 (n=12,000 clauses across 47 types)
Cross-reference resolution: Accuracy = 0.91 (n=3,200 cross-references)
Risk factor identification: Recall = 0.963, Precision = 0.941 (n=5,000 risk factors)

Tier 2: End-to-End Blind Testing

We run quarterly blind tests against senior associates. The test protocol:

1. Select 20 contracts not in the training corpus, stratified by type and complexity. 2. Have a senior associate (8+ years experience) review each contract independently. 3. Run LEXIS AI on the same contracts. 4. A supervising partner evaluates both outputs without knowing which is human and which is AI.

Results from the most recent blind test (Q4 2024):

Metric	LEXIS AI	Senior Associate
Risk factors identified	96.3%	91.7%
False positives	4.1%	2.3%
Average review time	94 seconds	3.8 hours
Critical issues missed	0	1

LEXIS AI identifies more risk factors than the human reviewer, with a higher false positive rate. The key metric: zero critical issues missed. We would rather flag something that turns out to be benign than miss something material.

Tier 3: Production Monitoring

In production, we monitor:

python
@dataclass
class ProductionMetrics:
    # Quality metrics (measured daily)
    override_rate: float        # % of AI findings modified by reviewer
    addition_rate: float        # % of reviews where reviewer adds missed findings
    rejection_rate: float       # % of AI findings rejected as incorrect

    # Performance metrics (measured per-request)
    e2e_latency_p50_ms: float
    e2e_latency_p99_ms: float
    clause_extraction_latency_ms: float
    risk_analysis_latency_ms: float

    # Reliability metrics (measured hourly)
    error_rate: float
    timeout_rate: float
    fallback_rate: float        # % of requests hitting fallback model

Our current production metrics: 7.2% override rate (the reviewer changes an AI finding), 3.1% addition rate (the reviewer catches something AI missed), 1.8% rejection rate (the reviewer marks an AI finding as incorrect). These numbers have improved every quarter since launch.

Architecture Decisions We Would Make Again

Decomposed pipeline over end-to-end. It would be simpler to send the entire contract to GPT-4 with a prompt saying "find all risks." But a monolithic approach is untestable, unexplainable, and unreliable. Each stage of our pipeline can be evaluated, debugged, and improved independently.

Structured intermediate representations. Every stage communicates through typed data structures, not free-form text. Clauses are objects with IDs, types, text, positions, and metadata. Risk factors are objects with severity, source clause, benchmark comparison, and confidence. This makes the system debuggable and enables the traceability lawyers require.

Conservative confidence thresholds. We tuned our system to have high recall at the expense of precision. LEXIS AI over-flags rather than under-flags. Lawyers universally prefer reviewing a few extra flagged items over missing a material risk. Every deployment conversation confirms this preference.

Human-in-the-loop as a feature, not a limitation. LEXIS AI is not designed to replace the lawyer. It is designed to make the lawyer faster. The review workflow surfaces AI findings for human confirmation, not autonomous action. This framing was essential for adoption — lawyers engage with the tool as an assistant, not a replacement.

What Took 90 Seconds Used to Take 4 Hours

The result: a first-pass contract review that took a senior associate 4 hours now takes 90 seconds of AI processing plus 15-20 minutes of human review. The lawyer's time is spent on judgment — evaluating flagged risks, considering business context, advising the client — rather than on the mechanical task of reading every clause.

That managing partner who asked about the 1%? Six months after deployment, his feedback was: "The AI finds things my associates miss. But more importantly, I can see exactly why it flagged each issue. That is why we trust it."

Legal AIRAGLEXIS AINLPEvaluation

Building a Legal AI That Lawyers Actually Trust

The Trust Problem

System Architecture

The Evaluation Methodology

Tier 1: Component-Level Evaluation

Tier 2: End-to-End Blind Testing

Tier 3: Production Monitoring

Architecture Decisions We Would Make Again

What Took 90 Seconds Used to Take 4 Hours

Related Articles

Evaluating LLMs for Enterprise: Beyond Vibes-Based Testing

Building Production RAG Systems: Lessons from 50,000 Legal Documents

Agentic AI in the Enterprise: From Chatbots to Autonomous Workflows

Want to build something similar?