Reading 12-Factor Agents Against the Reality of Shipping LLM Software

The 12-factor-agents project has been circulating in the corners of the industry that actually run LLM systems in production. It is a useful artefact — a serious attempt to codify what people are learning the hard way about shipping software where the core dependency is a probabilistic model. We have been building agent-shaped systems for clients for long enough to have opinions, and the document is worth engaging with seriously rather than nodding along to.

The original 12-factor app from Heroku worked because it captured a real shift: from snowflake servers to disposable processes, from config-in-code to config-in-env, from stateful boxes to horizontal scale. It was a Rosetta stone for a generation moving to the cloud. The question worth asking is whether 12-factor-agents does the same job for LLM software, or whether it is a useful checklist that occasionally dresses up familiar advice in new clothes.

What is genuinely new

Three of the factors carry weight that has no real precedent in web-ops folklore.

Own your prompts. This sounds obvious until you watch a team try to operate a system built on LangChain abstractions where the actual string sent to the model is buried four layers deep in framework code. The factor is really saying: the prompt is the source code of your agent. Treating it as an opaque artefact handed to you by a library is the equivalent of compiling C without being allowed to read the source. We have watched teams hit a quality ceiling they could not break through until they ripped out the abstraction and wrote the prompts as first-class assets in their repo, versioned, diffed, reviewed.

Own your context window. This is the factor that most deserves its own essay. The context window is the working memory of the system, and how you decide what goes into it — message history, retrieved documents, tool results, system instructions, prior agent state — is the single most consequential design decision in an LLM application. The frameworks that auto-assemble context for you are making product decisions on your behalf, often badly. The factor reframes context as something you construct deliberately, the way you would construct a database query, rather than something that accretes.

Small, focused agents. The factor pushes against the maximalist agent — the one that plans, reasons, tools, reflects, and replans across a long horizon. The honest observation is that long-horizon autonomous agents do not work reliably yet, and probably will not for a while. The reliable pattern is a graph of small agents, each with a bounded job, composed by deterministic code. This is not new conceptually — it is basically the Unix philosophy — but stating it plainly is valuable because the marketing pressure points the other way.

What restates web-ops folklore

Several factors are good advice that would have been good advice in 2014.

Stateless reducer, launch/pause/resume, trigger from anywhere, unify execution and business state — these are, with the serial numbers filed off, the same lessons about durable execution, idempotency, and event-driven architecture that the workflow-engine community (Temporal, Airflow, Step Functions) has been preaching for a decade. The 12-factor-agents framing is not wrong; it is just that anyone who has shipped a non-trivial async system already knows this. The contribution is mostly translation: telling the LLM-curious crowd that the patterns from durable workflow engines apply directly, and that an agent is, structurally, a workflow whose next step is decided by a model rather than a DAG.

Compact errors into context window is a nice phrasing of a specific tactic, but it sits inside the broader and older idea of error budgets and structured error propagation. Contact humans with tool calls and meet users where they are are product principles, not engineering ones — they are correct and they are also what any reasonable team would conclude after a week of user testing.

None of this is a criticism. Folklore consolidation is genuinely useful. But it is worth being clear-eyed that maybe half the document is good engineering advice that happens to be applied to agents, rather than insights that only emerge from working with LLMs.

Where the framing hides the hard problems

This is the part worth dwelling on, because a checklist of principles can give a false sense that the problems are solved once the principles are followed.

Evaluation is barely addressed. You can follow every factor and still have a system whose behaviour you cannot measure. Eval is the unsolved problem of LLM software. How do you know your prompt change did not regress a tail of cases you never see? How do you build the equivalent of a test suite for a non-deterministic system? The factors gesture at this — owning your prompts makes eval possible — but the actual work of building eval harnesses, golden datasets, LLM-as-judge pipelines, and regression gates is where most of the engineering effort goes once a system is past the demo stage. A document about production LLM software that does not centre evaluation is incomplete.

Cost and latency as design constraints. Every factor is silent on the fact that you are paying per token and waiting per token. The choice between one large prompt and a graph of smaller calls is not purely an engineering-cleanliness question; it is a cost-latency-quality trilemma that drives architecture. "Small, focused agents" sounds clean until you realise each hop is a round-trip to a model that takes two seconds and costs a cent.

Model portability is treated as a side issue. In practice, the ability to swap models — for cost, for latency, for capability, for a provider outage — shapes how you write prompts, how you structure tool calls, and how you handle structured output. A system tightly coupled to one model's quirks is the LLM equivalent of code that only compiles on one vendor's database. This deserves to be a factor in its own right.

The agent-vs-pipeline question is dodged. The factors apply equally to systems where the LLM is making routing decisions and systems where it is just transforming text inside a deterministic pipeline. These are very different beasts operationally, and the document does not help you decide which you are building or which you should be building. Our experience: most things that get called agents would be better built as pipelines with a model in one or two specific nodes.

How to use the document

Read it. Take it seriously. Then treat it as a starting set of principles rather than a finished theory. The genuinely novel factors — own your prompts, own your context, keep agents small — are worth tattooing on the wall. The factors that restate workflow-engine wisdom are worth following, and also worth recognising as the well-trodden ground they are, which means you can lean on existing tools rather than reinventing them.

The gaps — evaluation, cost-latency, portability, the pipeline-vs-agent question — are where the actual engineering judgment lives. A checklist cannot give you that judgment. It can only make sure you are not skipping the parts that someone has already figured out.

The document's real contribution may be sociological rather than technical: it gives teams a shared vocabulary for arguing about LLM architecture, which is more than the field had six months ago. That alone earns it a place on the shelf, next to the original 12-factor app, with the understanding that both are products of their moment and neither is the last word.