Engineering Insights

The Vibes Problem

Most teams evaluate LLMs the same way: someone on the team types a few prompts, reads the outputs, and says "yeah, that looks good." This is vibes-based evaluation. It is how most enterprise AI projects start, and it is why many of them fail in production.

Vibes-based evaluation fails because it is not reproducible, not comprehensive, and not sensitive to regression. You cannot tell if last week's prompt change improved or degraded quality. You cannot tell if the model handles edge cases. You cannot tell if switching from GPT-4o to Claude 3.5 Sonnet is actually better for your use case or just feels better on the three examples you tried.

At Exosolve, we build evaluation infrastructure before we build the product. This post describes our evaluation framework, which we have refined across deployments in legal, finance, and infrastructure monitoring.

The Evaluation Stack

Our evaluation framework has four layers:

Layer 1: Unit-Level Evals

The atomic unit of evaluation is a single (input, expected_output, actual_output) triple. We maintain evaluation datasets as versioned JSON files in the repository:

json
{
  "eval_id": "contract-risk-047",
  "input": {
    "clause": "Seller shall indemnify Buyer for all losses arising from breach, without limitation as to amount or duration.",
    "task": "identify_risk_factors"
  },
  "expected": {
    "risk_factors": ["uncapped_indemnification", "unlimited_survival_period"],
    "severity": "HIGH",
    "explanation_must_contain": ["no cap", "no time limit", "survival"]
  }
}

Each eval is tagged with difficulty (easy, medium, hard, adversarial), category (the feature or capability being tested), and priority (critical evals that must pass vs. aspirational evals that track progress).

We maintain ~500 evals for LEXIS AI and ~300 for HALO. These grow continuously — every bug report generates a new eval that reproduces the issue.

Layer 2: Automated Grading

Human evaluation does not scale. We use three automated grading strategies depending on the task:

Exact match / structured comparison. For classification, extraction, and structured output tasks, we compare the model's output against the expected output programmatically. Did the model identify the correct risk factors? Did it classify the clause type correctly? This is deterministic and fast.

LLM-as-judge. For open-ended generation tasks (explanations, summaries, reports), we use a separate LLM call to grade the output. The judge model receives the input, expected output, and actual output, and scores on dimensions like accuracy, completeness, and relevance.

python
class LLMJudge:
    def __init__(self, judge_model: str = "claude-3-5-sonnet-20241022"):
        self.client = anthropic.Anthropic()
        self.judge_model = judge_model

    async def grade(
        self, input_text: str, expected: str, actual: str, rubric: str
    ) -> JudgeResult:
        response = await self.client.messages.create(
            model=self.judge_model,
            max_tokens=1024,
            messages=[{
                "role": "user",
                "content": f"""Grade the following LLM output against the expected answer.

Input: {input_text}
Expected output: {expected}
Actual output: {actual}

Rubric: {rubric}

Score each dimension 1-5 and explain your reasoning.
Return JSON: {{"accuracy": int, "completeness": int, "relevance": int, "reasoning": str}}"""
            }]
        )
        return JudgeResult.parse(response.content[0].text)

We validated our LLM judge against human ratings on 200 examples and found 89% agreement (within 1 point on a 5-point scale). Good enough for regression detection, though we still do periodic human spot-checks.

Semantic similarity. For tasks where the exact wording does not matter but the meaning must be preserved, we compute cosine similarity between embeddings of the expected and actual outputs. A similarity threshold of 0.85 has worked well as a pass/fail boundary.

Layer 3: Regression Testing in CI

Every pull request that touches prompts, model configuration, or pipeline logic triggers our evaluation suite. The CI pipeline:

1. Runs the full eval dataset against the changed code. 2. Compares scores against the baseline (main branch scores stored in a JSON artifact). 3. Fails the PR if any critical eval regresses or if aggregate scores drop by more than 2%. 4. Generates a report showing per-category score changes.

yaml
# .github/workflows/llm-eval.yml (simplified)
name: LLM Evaluation
on:
  pull_request:
    paths:
      - 'src/prompts/**'
      - 'src/ml/**'
      - 'eval/**'

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run evaluation suite
        run: python eval/run_evals.py --dataset eval/datasets/ --output eval/results/
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

      - name: Compare against baseline
        run: python eval/compare.py --current eval/results/ --baseline eval/baselines/main.json --threshold 0.02

      - name: Upload results
        uses: actions/upload-artifact@v4
        with:
          name: eval-results
          path: eval/results/

This catches regressions before they reach production. We have caught dozens of cases where a prompt improvement for one category degraded another category. Without automated regression testing, these would have shipped.

Layer 4: A/B Testing in Production

For changes that pass CI evals, we run A/B tests in production before full rollout. Our A/B framework:

Traffic is split at the session level (not request level) to ensure consistent user experience.
Each variant logs inputs, outputs, latency, and user feedback (thumbs up/down, edits, overrides).
We run experiments for 1-2 weeks or until we reach statistical significance (p < 0.05 on our primary metric).
Primary metrics vary by product: override rate for LEXIS AI, alert accuracy for HALO, task completion rate for agentic workflows.

Model Selection: The Tradeoff Triangle

Every model selection decision involves three axes: quality, latency, and cost. You can optimize for two at the expense of the third.

We benchmark candidate models on our eval dataset and plot the results:

Model	Quality (avg score)	Latency (p50)	Cost (per 1K queries)
GPT-4o	4.31	1,850ms	$14.20
Claude 3.5 Sonnet	4.38	1,620ms	$11.40
GPT-4o-mini	3.74	680ms	$1.20
Claude 3.5 Haiku	3.82	420ms	$0.95
Llama 3.1 70B (self-hosted)	3.91	950ms	$3.40
Fine-tuned Llama 3.1 8B	4.12*	180ms	$0.40

*On our specific task only — general capability is much lower.

The numbers tell a clear story for our legal AI use case: Claude 3.5 Sonnet offers the best quality at reasonable cost and latency. For high-volume, lower-stakes tasks (initial document classification, metadata extraction), we route to Haiku or a fine-tuned smaller model. For the core risk analysis — where quality directly impacts client trust — we use the best available model.

Building Your Eval Dataset

The hardest part of systematic evaluation is building the dataset. Our approach:

Start with production logs. Every query to your system, and every user correction of your system's output, is a potential eval. We log all inputs and outputs (with appropriate PII handling) and regularly mine this data for new evals.

Stratify by difficulty. Easy evals (clear-cut cases, common patterns) should be ~40% of your dataset. Medium evals (ambiguous cases, less common patterns) ~35%. Hard evals (edge cases, adversarial inputs, novel patterns) ~20%. Adversarial evals (deliberately tricky inputs designed to exploit known weaknesses) ~5%.

Invest in adversarial examples. The most valuable evals are the ones that expose failure modes. We dedicate time each sprint to constructing adversarial inputs: contracts with contradictory clauses, alerts with correlated but unrelated anomalies, queries designed to trigger hallucination. These evals drive the most improvement.

Version everything. Your eval dataset is as important as your code. Version it in git. Review changes to it in PRs. Treat a bad eval (one with an incorrect expected output) as seriously as a bug in production code — it will mask real issues.

The Payoff

Systematic evaluation is expensive upfront. Building our initial eval dataset for LEXIS AI took three weeks of engineer time plus two weeks of legal expert review. The CI pipeline adds ~8 minutes to every PR.

The payoff is confidence. We can tell a client exactly how our system performs on specific categories of contracts. We can quantify the impact of every change. We can switch foundation models with empirical evidence rather than gut feel. And when something does go wrong in production — which it will — we can add a regression test and guarantee it never happens again.

Vibes-based evaluation might work for a prototype. For enterprise AI in production, systematic evaluation is the difference between a product and a demo.

LLM EvaluationTestingBenchmarksEnterprise AI

Evaluating LLMs for Enterprise: Beyond Vibes-Based Testing

The Vibes Problem

The Evaluation Stack

Layer 1: Unit-Level Evals

Layer 2: Automated Grading

Layer 3: Regression Testing in CI

Layer 4: A/B Testing in Production

Model Selection: The Tradeoff Triangle

Building Your Eval Dataset

The Payoff

Related Articles

Building a Legal AI That Lawyers Actually Trust

Building Production RAG Systems: Lessons from 50,000 Legal Documents

Agentic AI in the Enterprise: From Chatbots to Autonomous Workflows

Want to build something similar?