Your Eval Sucks and Nobody Is Coming to Save You
Your eval doesn’t test what you think it tests.
You curate a dataset. You write scoring functions. You run your agent against 50 carefully selected inputs and optimize until the numbers go up. The numbers go up. You ship. It breaks in production on the 51st input.
That’s the pitch. Every eval framework, every “rigorous testing” blog post, every conference talk about “evaluation-driven development.” And it’s broken in ways that more test cases can’t fix. Because the methodology is the problem.
I’ve been building agent harnesses for three years. I used to curate evals obsessively. I stopped. Here’s why.
You’re overfitting your prompts
The moment you optimize against an eval dataset, you’re fitting your prompts to that distribution. Not to the problem. To the dataset.
This is the same trap as overfitting a model to a training set, except it’s worse because nobody calls it overfitting. They call it “prompt engineering.” You tweak the system prompt until your 50 test cases pass. The prompt gets longer, more specific, more fragile. It works beautifully on inputs that look like your test data and falls apart on everything else.
You haven’t improved your agent. You’ve memorized your eval.
Evals don’t test what agents actually do
Here’s the thing nobody wants to say out loud. Most evals test the first message. A single input, a single output, a score.
An agent doesn’t live in single messages. An agent lives in long sequences - dozens of turns, tool calls and responses, context growing and getting compacted, decisions building on decisions. The thing that makes an agent useful is its behavior over time. The thing your eval tests is its behavior on one turn.
Multi-turn evaluation is genuinely hard. Your metrics are almost impossible to define. When did the agent “succeed”? At which turn? By whose definition? The agent’s output at turn 30 depends on every tool call, every context window compaction, every accumulated decision from turns 1 through 29. Your eval checks turn 1 and calls it a day.
And the use cases. Agents today are absurdly versatile. The number of things they can do easily overwhelms any eval you can design. You test 50 scenarios. Your users find 5,000. The eval gives you confidence. The confidence is a lie.
The bitter lesson applies here too
Rich Sutton’s bitter lesson keeps being right. General methods leveraging computation beat handcrafted solutions. Every time.
Your eval-optimized prompts are handcrafted solutions. You spent weeks tuning them for today’s model. Next quarter a new model drops. Your carefully optimized prompts become crutches the new model doesn’t need - or worse, they actively fight the model’s improved capabilities. Parts of your harness too. The scaffolding you built to work around model limitations becomes dead weight when those limitations disappear.
Claude Code’s team ships updates almost every day. Not because they have a massive eval suite catching every regression. Because they dogfood it. They use it to build itself. That’s an eval no benchmark can replicate.
What actually works
Stop treating evals as your quality signal. They’re sanity checks. Regression tests. Nothing more.
What you should actually be doing:
Test your harness mechanisms. Your context management, your tool routing, your compaction strategy, your state transitions - these are deterministic. These are testable. Unit test the infrastructure, not the model’s output.
Follow context engineering principles. Reduce, offload, isolate. If your harness manages context well - keeps it lean, offloads token-heavy work to sub-agents, reduces aggressively - the model performs better regardless of the eval scores. Good tool design is worth more than good test data.
Dogfood relentlessly. Use your agent. Every day. On real work. The failure modes you discover at 2am trying to ship a feature are worth more than 1,000 curated test cases. The teams that ship good agents don’t have better evals. They have better feedback loops.
Keep evals for what they’re good at. Regression tests. Sanity checks. “Did we break something obvious?” That’s valuable. That’s worth maintaining. Just stop pretending it tells you whether your agent is good.
The eval industry wants you to believe that rigor means more test cases, better metrics, fancier frameworks. It doesn’t. Rigor means using the thing you built and fixing what breaks.