The Vibes Problem
Most teams evaluate LLMs the same way: someone on the team types a few prompts, reads the outputs, and says "yeah, that looks good." This is vibes-based evaluation. It is how most enterprise AI projects start, and it is why many of them fail in production.
Vibes-based evaluation fails because it is not reproducible, not comprehensive, and not sensitive to regression. You cannot tell if last week's prompt change improved or degraded quality. You cannot tell if the model handles edge cases. You cannot tell if switching from GPT-4o to Claude 3.5 Sonnet is actually better for your use case or just feels better on the three examples you tried.
At Exosolve, we build evaluation infrastructure before we build the product. This post describes our evaluation framework, which we have refined across deployments in legal, finance, and infrastructure monitoring.
The Evaluation Stack
Our evaluation framework has four layers:
Layer 1: Unit-Level Evals
The atomic unit of evaluation is a single (input, expected_output, actual_output) triple. We maintain evaluation datasets as versioned JSON files in the repository:
json{ "eval_id": "contract-risk-047", "input": { "clause": "Seller shall indemnify Buyer for all losses arising from breach, without limitation as to amount or duration.", "task": "identify_risk_factors" }, "expected": { "risk_factors": ["uncapped_indemnification", "unlimited_survival_period"], "severity": "HIGH", "explanation_must_contain": ["no cap", "no time limit", "survival"] } }
Each eval is tagged with difficulty (easy, medium, hard, adversarial), category (the feature or capability being tested), and priority (critical evals that must pass vs. aspirational evals that track progress).
We maintain ~500 evals for LEXIS AI and ~300 for HALO. These grow continuously — every bug report generates a new eval that reproduces the issue.
Layer 2: Automated Grading
Human evaluation does not scale. We use three automated grading strategies depending on the task:
Exact match / structured comparison. For classification, extraction, and structured output tasks, we compare the model's output against the expected output programmatically. Did the model identify the correct risk factors? Did it classify the clause type correctly? This is deterministic and fast.
LLM-as-judge. For open-ended generation tasks (explanations, summaries, reports), we use a separate LLM call to grade the output. The judge model receives the input, expected output, and actual output, and scores on dimensions like accuracy, completeness, and relevance.
pythonclass LLMJudge: def __init__(self, judge_model: str = "claude-3-5-sonnet-20241022"): self.client = anthropic.Anthropic() self.judge_model = judge_model async def grade( self, input_text: str, expected: str, actual: str, rubric: str ) -> JudgeResult: response = await self.client.messages.create( model=self.judge_model, max_tokens=1024, messages=[{ "role": "user", "content": f"""Grade the following LLM output against the expected answer. Input: {input_text} Expected output: {expected} Actual output: {actual} Rubric: {rubric} Score each dimension 1-5 and explain your reasoning. Return JSON: {{"accuracy": int, "completeness": int, "relevance": int, "reasoning": str}}""" }] ) return JudgeResult.parse(response.content[0].text)
We validated our LLM judge against human ratings on 200 examples and found 89% agreement (within 1 point on a 5-point scale). Good enough for regression detection, though we still do periodic human spot-checks.
Semantic similarity. For tasks where the exact wording does not matter but the meaning must be preserved, we compute cosine similarity between embeddings of the expected and actual outputs. A similarity threshold of 0.85 has worked well as a pass/fail boundary.
Layer 3: Regression Testing in CI
Every pull request that touches prompts, model configuration, or pipeline logic triggers our evaluation suite. The CI pipeline:
1. Runs the full eval dataset against the changed code. 2. Compares scores against the baseline (main branch scores stored in a JSON artifact). 3. Fails the PR if any critical eval regresses or if aggregate scores drop by more than 2%. 4. Generates a report showing per-category score changes.
yaml# .github/workflows/llm-eval.yml (simplified) name: LLM Evaluation on: pull_request: paths: - 'src/prompts/**' - 'src/ml/**' - 'eval/**' jobs: evaluate: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Run evaluation suite run: python eval/run_evals.py --dataset eval/datasets/ --output eval/results/ env: ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} - name: Compare against baseline run: python eval/compare.py --current eval/results/ --baseline eval/baselines/main.json --threshold 0.02 - name: Upload results uses: actions/upload-artifact@v4 with: name: eval-results path: eval/results/
This catches regressions before they reach production. We have caught dozens of cases where a prompt improvement for one category degraded another category. Without automated regression testing, these would have shipped.
Layer 4: A/B Testing in Production
For changes that pass CI evals, we run A/B tests in production before full rollout. Our A/B framework:
- Traffic is split at the session level (not request level) to ensure consistent user experience.
- Each variant logs inputs, outputs, latency, and user feedback (thumbs up/down, edits, overrides).
- We run experiments for 1-2 weeks or until we reach statistical significance (p < 0.05 on our primary metric).
- Primary metrics vary by product: override rate for LEXIS AI, alert accuracy for HALO, task completion rate for agentic workflows.
Model Selection: The Tradeoff Triangle
Every model selection decision involves three axes: quality, latency, and cost. You can optimize for two at the expense of the third.
We benchmark candidate models on our eval dataset and plot the results:
| Model | Quality (avg score) | Latency (p50) | Cost (per 1K queries) |
|---|---|---|---|
| GPT-4o | 4.31 | 1,850ms | $14.20 |
| Claude 3.5 Sonnet | 4.38 | 1,620ms | $11.40 |
| GPT-4o-mini | 3.74 | 680ms | $1.20 |
| Claude 3.5 Haiku | 3.82 | 420ms | $0.95 |
| Llama 3.1 70B (self-hosted) | 3.91 | 950ms | $3.40 |
| Fine-tuned Llama 3.1 8B | 4.12* | 180ms | $0.40 |
*On our specific task only — general capability is much lower.
The numbers tell a clear story for our legal AI use case: Claude 3.5 Sonnet offers the best quality at reasonable cost and latency. For high-volume, lower-stakes tasks (initial document classification, metadata extraction), we route to Haiku or a fine-tuned smaller model. For the core risk analysis — where quality directly impacts client trust — we use the best available model.
Building Your Eval Dataset
The hardest part of systematic evaluation is building the dataset. Our approach:
Start with production logs. Every query to your system, and every user correction of your system's output, is a potential eval. We log all inputs and outputs (with appropriate PII handling) and regularly mine this data for new evals.
Stratify by difficulty. Easy evals (clear-cut cases, common patterns) should be ~40% of your dataset. Medium evals (ambiguous cases, less common patterns) ~35%. Hard evals (edge cases, adversarial inputs, novel patterns) ~20%. Adversarial evals (deliberately tricky inputs designed to exploit known weaknesses) ~5%.
Invest in adversarial examples. The most valuable evals are the ones that expose failure modes. We dedicate time each sprint to constructing adversarial inputs: contracts with contradictory clauses, alerts with correlated but unrelated anomalies, queries designed to trigger hallucination. These evals drive the most improvement.
Version everything. Your eval dataset is as important as your code. Version it in git. Review changes to it in PRs. Treat a bad eval (one with an incorrect expected output) as seriously as a bug in production code — it will mask real issues.
The Payoff
Systematic evaluation is expensive upfront. Building our initial eval dataset for LEXIS AI took three weeks of engineer time plus two weeks of legal expert review. The CI pipeline adds ~8 minutes to every PR.
The payoff is confidence. We can tell a client exactly how our system performs on specific categories of contracts. We can quantify the impact of every change. We can switch foundation models with empirical evidence rather than gut feel. And when something does go wrong in production — which it will — we can add a regression test and guarantee it never happens again.
Vibes-based evaluation might work for a prototype. For enterprise AI in production, systematic evaluation is the difference between a product and a demo.