Treating LLM evals like unit tests (sort of)#

An eval suite that doesn't block merges is just a dashboard nobody opens.

The idea#

Run a small, fast, deterministic eval set on every PR that touches prompts or model routing. Not the full 500-case suite — a tight 20-case canary that covers the failure modes that burned us before.

name: llm-eval-gate
on:
  pull_request:
    paths:
      - "prompts/**"
      - "services/ai-*/**"

jobs:
  canary:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install -r evals/requirements.txt
      - run: python -m evals.run --suite canary --threshold 0.90

Threshold, not target#

I keep the gate at --threshold 0.90 even when the full suite averages 0.94. The canary catches regressions, not benchmarks quality. Quality goes in the nightly full run with its own alert.

What I'd skip#

Any eval that hits a paid provider from every PR. Route canary through local Ollama or a cached fixtures mode.
Any eval that's flaky. Flakiness erodes trust in the gate faster than a slow CI.

Open question#

How do you version prompt changes so the eval gate's expected outputs stay honest? Experimenting with a prompts/vN.md scheme plus a locked fixtures file per version.