Industry Commentary

Where AI Code Review Discipline Has to Land

By John Jansen · · 7 min read

Share

The Register ran a piece recently that landed harder than usual in our internal channels: developers know the AI-generated code they're shipping is full of holes, and they ship it anyway. The survey numbers are damning enough on their own, but what makes the article worth pausing on is the implication. This isn't a tooling gap. It's not waiting on a better linter or a smarter agent. It's a question about what code review is for now that a significant share of authorship has moved to a machine that doesn't care whether its output is correct.

We think most teams haven't actually answered that question. They've absorbed AI into their workflow without revisiting the contract review has always had with the codebase. That contract needs rewriting, and the rewrite is overdue.

What review has always owed the codebase

Before AI, code review did a small number of things well and a larger number of things badly. The things it did well: catch obvious correctness bugs that the author missed, enforce a shared sense of taste, spread knowledge across a team, and create a moment of friction that made people think twice before merging. The things it did badly: catch subtle correctness bugs, evaluate architectural fit at scale, and seriously interrogate whether the change should exist at all.

Most of the value, honestly, came from the friction. A human had written the change. They had reasons. The reviewer probed those reasons, the author defended or revised them, and the resulting artefact was better not because the review caught everything but because the author had to construct a defensible position before hitting merge. Review was a forcing function on the author's reasoning, not just a quality gate on the diff.

This is the part that breaks first when AI enters the loop. If the author didn't construct the reasoning — if the model did, and the author skimmed — then the forcing function is gone before review even starts. The reviewer is now the first person reasoning about the change. That's a category change in the work, and most teams haven't acknowledged it.

What changes when authorship is partly machine

Three things change, and they compound.

First, the prior on the diff shifts. Human-written code tends to be wrong in human ways: off-by-one errors, missed edge cases the author didn't consider, naming that made sense in the author's head. Model-written code tends to be wrong in model ways: plausible-looking calls to functions that don't exist, subtly wrong API usage that compiles, security patterns lifted from training data that were already outdated when the model learned them, and a tendency to confidently invent the part the author didn't specify. Reviewers tuned for human failure modes miss machine failure modes systematically.

Second, the volume changes. A developer using an agent productively can produce three or four times the diff they used to. Review capacity does not scale the same way. So either review gets shallower per line, or it bottlenecks, or — and this is what The Register's data suggests — it gets waved through with a glance and a vague trust that someone, somewhere, would have caught the problem.

Third, the author's ownership of the code weakens. When you wrote it, you remember why. When the model wrote it and you skimmed, you remember what it looked like. Six months later, when a bug surfaces, nobody on the team has the context that used to live in the author's head. The cost of this shows up not at review time but in maintenance, and it shows up as a tax on every future change to that file.

What humans still own

If the easy parts of review are automatable — style, obvious bugs, common security antipatterns, basic test coverage — and the model can be made to check its own work to some degree, then the question is what's left that humans must do. We think the list is shorter than people pretend, but it is non-negotiable.

Whether the change should exist. No model has the context to evaluate this. It doesn't know what the product is trying to be, what the team agreed last quarter, what the unwritten constraints are, or which clever solution will paint the team into a corner in eighteen months. A human who has been in the room has to make this call, and they have to make it before the diff is written, not in review.

Whether the change fits the system. Architecture review, in the small. Does this introduce a new pattern? Does it conflict with how similar problems were solved elsewhere? Does it create a coupling that will be expensive to undo? Models are excellent at local coherence and poor at global coherence. This is a human job and will remain one for a while.

Whether the tests prove what they claim to prove. Generated tests are the most dangerous artefact in the AI-assisted codebase. They look like coverage. They often test that the code does what the code does, which is not the same as testing that the code does what it should do. A human has to read the assertions and ask: if the implementation were wrong in the way I'd most expect it to be wrong, would this test catch it? Models do not ask this question well.

Whether the failure modes are acceptable. What happens when this call times out? When this input is null? When this runs at 100x current load? Models will cheerfully write code that handles the happy path beautifully and falls over on contact with reality. Humans have to interrogate the unhappy paths.

Everything else — formatting, naming consistency, common security checks, dependency hygiene, test scaffolding — can and should be pushed to automation, including AI-driven automation. The point is not to do less review. It's to spend the review budget on the things only humans can do.

The cultural piece

None of this works if the team's posture is that AI-generated code is someone else's problem. The Register piece is, at its core, a story about people merging code they don't believe in. That's a culture problem dressed up as a productivity story. The fix isn't a policy document. It's a shift in what the team considers professionally embarrassing.

It should be embarrassing to submit a PR you haven't read. It should be embarrassing to defend a design decision with "that's what the model suggested." It should be embarrassing to land a test you can't explain. These are not new standards — they're the old standards, applied to a new authorship model. The mistake teams are making is treating AI output as if it arrived with some kind of warranty. It didn't. The person who hit merge is the author, regardless of who typed the characters.

We've found that the teams adapting best to this are the ones who made the shift explicit: AI can write the code, but you own it the moment you put your name on the commit. The model is a junior pair who never gets better and never gets tired, and you are still the engineer.

That framing is unfashionable because it puts the burden back on people at exactly the moment the industry is selling them on leverage. But the alternative is what the survey describes: shipping code you know is wrong, because the system around you stopped asking whether you'd read it. We don't think that's a sustainable equilibrium, and we don't think the tooling will save anyone from it. The discipline has to come from the engineers, and the permission to demand it has to come from the people running the teams.

Want to discuss this?

We write about what we're actually working on. If this is relevant to something you're building, we'd love to hear about it.