Treating LLM evals like unit tests (sort of)#
An eval suite that doesn't block merges is just a dashboard nobody opens.
The idea#
Run a small, fast, deterministic eval set on every PR that touches prompts or model routing. Not the full 500-case suite — a tight 20-case canary that covers the failure modes that burned us before.
name: llm-eval-gate
on:
pull_request:
paths:
- "prompts/**"
- "services/ai-*/**"
jobs:
canary:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- run: pip install -r evals/requirements.txt
- run: python -m evals.run --suite canary --threshold 0.90
Threshold, not target#
I keep the gate at --threshold 0.90 even when the full suite averages 0.94. The canary catches regressions, not benchmarks quality. Quality goes in the nightly full run with its own alert.
What I'd skip#
- Any eval that hits a paid provider from every PR. Route canary through local Ollama or a cached fixtures mode.
- Any eval that's flaky. Flakiness erodes trust in the gate faster than a slow CI.
Open question#
How do you version prompt changes so the eval gate's expected outputs stay honest? Experimenting with a prompts/vN.md scheme plus a locked fixtures file per version.