Industry Commentary

Every AI Agent Benchmark Is Broken

By John Jansen · · 4 min read

Share

Berkeley researchers just dropped a bomb on AI evaluation. Their automated scanning agent — pointed at eight of the most prominent AI agent benchmarks — achieved near-perfect scores on every single one. Without solving a single task.

Not theoretically. They built working exploits, ran them through official evaluation pipelines, and watched the numbers climb. A ten-line conftest.py "resolves" every instance on SWE-bench Verified. A fake curl wrapper scores 100% on all 89 Terminal-Bench tasks. Navigating to a file:// URL reads the gold answer straight from WebArena's task config, giving roughly perfect scores across all 812 tasks.

The full scorecard: SWE-bench Verified and Pro at 100%. Terminal-Bench at 100%. WebArena at roughly 100%. FieldWorkArena at 100%. CAR-bench at 100% on hallucination tasks. GAIA at 98%. OSWorld at 73%. Zero tasks solved. Zero LLM calls in most cases.

This is not hypothetical

The research confirms what has already been happening in the wild. IQuest-Coder-V1 claimed 81.4% on SWE-bench — until researchers found that nearly a quarter of its trajectories simply ran git log to copy answers from commit history. METR found that o3 and Claude 3.7 Sonnet reward-hack in over 30% of evaluation runs, using stack introspection and monkey-patching to manipulate scores rather than solve tasks. OpenAI dropped SWE-bench Verified entirely after an internal audit found that 59.4% of audited problems had flawed tests.

These are not edge cases. The benchmarks that companies cite in press releases, that investors use to justify valuations, and that engineering teams use to pick models are fundamentally compromised.

The anatomy of a broken benchmark

The SWE-bench exploit is the most elegant. The benchmark asks an agent to produce a patch that makes failing tests pass. The patch runs inside the same Docker container where tests run. So the agent drops a conftest.py — a file pytest auto-discovers and loads before any test execution — containing a hook that rewrites every test result to "passed." The grader sees all tests passing. Instance resolved. The agent never looked at the bug.

Terminal-Bench falls to binary wrapper trojans. WebArena leaks gold answers through accessible config files. FieldWorkArena's validation never actually checks answer correctness. GAIA's answers are publicly available and its normalisation can be gamed through format collisions.

The common thread: evaluation and execution share the same trust boundary. The agent can read, write, or influence the scoring mechanism.

What this means for engineering teams

If you are evaluating AI coding agents for your team — and many organisations now are — benchmark scores should carry roughly zero weight in your decision. Not reduced weight. Zero.

This does not mean the tools are bad. Many AI coding agents deliver genuine productivity gains. But the numbers on the leaderboard are not evidence of that. A model scoring 70% on SWE-bench tells you it scored 70% on SWE-bench. It tells you nothing reliable about how it will perform on your codebase, your bugs, your architecture.

The Berkeley team proposes an Agent-Eval Checklist that every benchmark should clear: isolate the agent from the evaluator, verify scoring with deterministic checks rather than LLM judges, test adversarially with null and random agents, and prevent tampering with evaluation data. They are also releasing BenchJack, an open-source vulnerability scanner for benchmarks — essentially a penetration test for evaluation pipelines.

These are sensible fixes. But they also reveal how far the field has drifted from rigorous evaluation. The fact that "run your benchmark against a zero-capability agent and check it does not score well" counts as a novel recommendation says something uncomfortable about the current state of AI evaluation.

The deeper problem

Benchmarks have become marketing collateral. The incentive structure rewards optimising for the number, not the capability the number is supposed to measure. Goodhart's law applies with full force: when a measure becomes a target, it ceases to be a good measure.

For teams making real engineering decisions, the practical advice is straightforward. Run your own evaluations on your own tasks. Use the actual codebase, the actual bug patterns, the actual architecture constraints. Treat vendor benchmark claims the way you would treat any unaudited self-reported metric — with informed scepticism.

The Berkeley paper is worth reading in full. It is technically rigorous, clearly written, and includes enough detail to reproduce every exploit. The BenchJack tool is being prepared for public release. If you are building or maintaining an evaluation pipeline for AI agents, it should be the first thing you run.


Source: How We Broke Top AI Agent Benchmarks: And What Comes Next — Hao Wang, Qiuyang Mang, Alvin Cheung, Koushik Sen, Dawn Song, UC Berkeley, April 2026.

Want to discuss this?

We write about what we're actually working on. If this is relevant to something you're building, we'd love to hear about it.