We've been running coding agents in production codebases for long enough now to notice something uncomfortable: the variance in agent reliability across our projects has very little to do with which model is on the other end of the API call, and almost everything to do with the quality of the AGENTS.md file sitting at the repo root.
That's a strong claim, so let's be precise about it. Swap Sonnet for a newer Sonnet on a project with a thoughtful AGENTS.md and you'll see modest gains — fewer retries, slightly better reasoning, the usual upgrade signature. But take a project with a well-tuned AGENTS.md and replace the doc with something generic, and the same model will produce visibly worse output: wrong import paths, ignored conventions, regressions in code that was working an hour ago. The doc is doing more work than the model upgrade is.
This isn't surprising once you think about what an agent actually does on each turn. It reads context, predicts the next useful action, executes, observes, repeats. Every token of context that helps it predict correctly is worth more than a marginal improvement in the underlying weights. AGENTS.md is the highest-leverage context you control.
The no-context baseline is higher than people think
Here's the part that surprised us. A modern coding agent dropped into a repo with no AGENTS.md isn't helpless. It will read the file tree, infer the stack from package.json or pyproject.toml, look at a few files, and form a reasonable working model. For a typical TypeScript monorepo or a standard Django project, this baseline is genuinely competent. It will follow the conventions it sees. It will use the test runner that's clearly configured. It will avoid obviously stupid moves.
The implication is uncomfortable for anyone writing AGENTS.md files: a bad one is worse than nothing. If your doc says "we use Jest" but the repo migrated to Vitest six months ago, the agent now has two competing signals and will sometimes pick the wrong one. If your doc enumerates five architectural rules and the codebase only honours three of them, you've taught the agent that the rules are aspirational rather than enforced. Every stale instruction is a small adversarial example pointed at your own workflow.
We've watched agents follow outdated AGENTS.md guidance over fresh evidence in the actual code, because the doc reads with more authority than a single file. That's the failure mode: confident wrongness, induced by a document that was useful eighteen months ago.
What a high-leverage AGENTS.md actually contains
The useful AGENTS.md files we maintain share a structure that took us a while to converge on. They're not long. The best ones are 150 to 400 lines. Anything longer tends to contain filler that dilutes the signal.
Commands that actually run. The single highest-value section is a block of shell commands the agent can execute without thinking: how to install, build, test a single file, test the whole suite, lint, typecheck, run the dev server, run a one-off script. These should be copy-pasteable and current. We verify them in CI — if pnpm test:unit path/to/file doesn't work from a fresh clone, the AGENTS.md is broken and the build fails. This sounds excessive until you've watched an agent burn ten minutes guessing at the right invocation.
Architectural invariants, not architectural history. The agent doesn't need to know why you chose a hexagonal architecture in 2022. It needs to know that domain code never imports from infrastructure/, that all database access goes through the repository layer, and that React components in app/ are server components by default. State the rule, point to one canonical example, move on. Invariants the linter or type system already enforces don't belong here — they're noise.
The shape of "done". What does a finished change look like in this repo? For us it's usually: tests pass, types check, the changeset has a brief entry, the PR description follows a template, no new ESLint disables without a comment. Agents are very good at hitting an explicit definition of done and very bad at inferring one.
Known traps. Every codebase has three or four places where the obvious thing is wrong. The migration script that must be run before tests. The mock that has to be reset between suites. The package that looks unused but is loaded dynamically. List them. This is the section where institutional knowledge lives, and it's the section that pays for itself within a week.
Pointers, not prose. When the agent needs deep context, it should know where to look rather than reading a summary. "For the auth flow, read lib/auth/README.md and lib/auth/session.ts." Pointers stay correct longer than summaries do, because the underlying files are the source of truth and someone is already maintaining them.
What doesn't belong
We've cut more from our AGENTS.md files than we've added. Things that consistently fail to earn their tokens:
- Coding style rules already encoded in Prettier or ESLint config.
- Long explanations of why decisions were made. Save these for ADRs the agent can read on demand.
- Personality instructions ("be helpful, write clean code"). Modern agents already do this; the words just consume context budget.
- Aspirational rules. If "all functions must have JSDoc" isn't true today, don't write it. The agent will notice the contradiction and start producing inconsistent output.
- Lists of every directory and what it contains. The agent can read the file tree faster than you can describe it.
Treating AGENTS.md as code
The shift that changed our outcomes was treating AGENTS.md like a config file rather than a README. It gets reviewed in PRs. It has a CI check that runs the documented commands. When we change a build command, the same PR updates the doc. When we deprecate a pattern, we grep AGENTS.md before merging.
We also started keeping a small agents/ directory for sub-area docs in larger repos — agents/frontend.md, agents/api.md — and referencing them from the root AGENTS.md. This scales better than a single growing file, because each doc has a clear owner and a clear scope.
The payoff is concrete and measurable. On our projects with maintained AGENTS.md files, agent task completion on first attempt sits noticeably higher than on projects without one, and dramatically higher than on projects with stale ones. The difference between a fresh AGENTS.md and a six-month-old one is, on our internal numbers, larger than the difference between model generations.
The honest takeaway
If you're getting mediocre results from coding agents, the instinct is to wait for a better model. That's the wrong move. The faster path is to spend two hours writing an AGENTS.md that reflects how your codebase actually works today, then commit to keeping it accurate. You will get more reliability improvement from that afternoon than from the next model release.
The agents are already good enough. The bottleneck has moved to the documents we write for them, and most teams haven't caught up to that yet.