Most of the AI coding pitch reduces to one number: how much faster. Copilot quotes percentage gains in time-to-completion. Cursor demos lean on the speed of accepting suggestions. Devin's marketing video is a stopwatch. The implicit model is that engineering output is a throughput problem, and that any tool which moves the bottleneck — typing, lookup, boilerplate — is a net good.
We think this framing is wrong, or at least incomplete enough to be misleading at the leadership level. The more useful question is whether AI tooling, used well, lets engineers be slower at the moments that matter, and faster only at the moments that don't. That sounds like a semantic dodge, but it implies a different way of structuring teams, reviewing code, and choosing what to ship.
What actually changes when an LLM is in the loop
The naive story is that the model writes code and the engineer reviews it. In practice, what changes is the distribution of cognitive effort across the day. Without an assistant, an engineer spends a lot of time in low-stakes mechanical work: stubbing out a handler, remembering the argument order of a library call, writing the third variant of a test fixture. That work is slow but cheap — it costs minutes and almost never goes catastrophically wrong.
With a good assistant, that mechanical layer collapses to seconds. What expands is the share of the day spent on the parts that remain: deciding what to build, choosing the shape of an interface, reading the diff and asking whether it's actually correct, noticing that the generated migration silently drops a constraint.
If you measure the day in keystrokes or PRs merged, this looks like acceleration. If you measure it in judgement calls per hour, it looks like the opposite — the proportion of work that requires judgement has gone up, because the easy stuff is gone. The engineer is now doing a denser, harder job.
This is the thing the throughput framing misses. The bottleneck didn't move to a different mechanical task. It moved to a cognitive task that doesn't respond to the same optimisations.
The throughput trap
When leadership measures AI adoption in PRs per week or lines merged per engineer, two things happen, and we've seen both in the wild.
First, engineers accept more suggestions than they should. The tool is fluent, the suggestion compiles, the tests pass, and the diff goes in. Six weeks later someone is reading a function nobody quite remembers writing, trying to work out why it handles nulls in a way that contradicts the rest of the codebase. The cost of that moment is rarely attributed back to the velocity metric that caused it.
Second, the codebase starts to drift in a particular way. LLMs are trained on the median of public code, so they pull projects toward median patterns. Without active resistance, idiomatic choices specific to your system — the reason you wrote your own result type, the convention about where side effects live — get sanded down. Each individual PR looks reasonable. The aggregate is a codebase with a weaker spine.
Neither of these failure modes shows up in a throughput dashboard. They show up in onboarding time, in the rate of regressions per change, in how often senior engineers say "I don't recognise this" during review.
The judgement framing
The alternative framing is that AI assistance is most valuable when it buys engineers time and attention to make better decisions. Concretely:
- Spend the saved time on the diff. If a model generates a 200-line change in two minutes, the engineer should spend twenty minutes reading it, not two. The economic surplus from generation should be reinvested into review, not extracted as raw velocity.
- Use the model as a thinking partner before writing. The highest-leverage prompts in our work are rarely "write this function". They are "here are three ways we could model this, what breaks under each". The output is a conversation, not code.
- Treat acceptance as a deliberate act. Tab-to-accept is a UX choice that optimises for flow. Flow is good for writing prose. It is ambivalent for writing code that will run in production for five years.
This is slower per commit. It is, we'd argue, faster per correct system, which is the unit that actually matters.
What this means for engineering leaders
The uncomfortable part for leaders is that you have to take a position. The tools are not neutral; they have a default tempo, and that tempo is fast. If you don't articulate a counter-position, your team will drift toward the defaults, because the defaults feel productive and are easy to demonstrate in a standup.
A few things we think are worth doing explicitly.
Stop reporting AI-related velocity gains as headline metrics. They are real, but reporting them creates the wrong incentive. If you must measure something, measure defect rates and review depth alongside throughput, and treat divergence between them as a signal worth investigating.
Make review the prestige activity. In most teams, writing is high-status and reviewing is a chore. When generation is cheap, this inverts: the scarce skill is the ability to read a diff and notice what's wrong with it. Promote on that. Pair on that. Talk about it in retros.
Be explicit about what the model is not allowed to decide. Interface boundaries, data model changes, security-relevant code, anything touching money or identity. Not because the model can't generate plausible code in these areas — it can, fluently — but because the cost of a subtle error is asymmetric, and the engineer needs to be in slow-thinking mode when the change is made.
Hire for taste, not typing speed. This was already true and is now obvious. The engineers who get the most out of these tools are the ones who can tell when the output is subtly wrong, which is a skill built from years of reading code carefully. That skill does not emerge from a workflow that optimises for acceptance rate.
The position we hold
For what it's worth, we think the right thing to optimise as the tooling matures is judgement, and we think this will become more true, not less, as models get better. A model that's wrong 20% of the time forces you to stay alert. A model that's wrong 2% of the time is much more dangerous, because the failure rate is low enough to lull the reviewer but high enough that the failures accumulate. The discipline of slow review matters more, not less, when the assistant is good.
The throughput framing will probably win in the market for a while, because it's legible and it sells. We'd rather work on systems that are still coherent in three years than ones that shipped twice as fast in the first quarter. Those are different goals, and the tooling does not pick between them for you. That choice is still yours, and it is worth making on purpose.