Industry Commentary

What 1000 Tokens per Second Unlocks in Agent Design

By John Jansen · · 7 min read

Share

MiMo-v2.5 trending this week with a trillion-parameter model serving at roughly 1000 tokens per second is the kind of number that quietly resets defaults. Not because anyone needs a chatbot to talk faster — humans read at around 5 tokens per second — but because most of our agent architectures are built around hiding latency, and the hiding stops being necessary somewhere around this regime.

We want to walk through what actually changes when the latency floor drops by an order of magnitude, and which of the patterns we've all been shipping turn out to be cost optimisations wearing latency costumes.

The two things we were optimising for, conflated

Every agent system we've built in the last two years has been making tradeoffs against two constraints that look similar from the outside: cost per token and time per token. They are not the same thing, and the architectural responses to each are different, but because frontier models had both problems simultaneously, the patterns blurred.

Consider the standard playbook:

  • Route easy queries to small models, hard ones to large models
  • Cache aggressively, including semantic caches over embeddings
  • Use structured output and constrained decoding to avoid retries
  • Run tool calls in parallel where possible
  • Pre-compute plans and summaries so the hot path is short
  • Stream tokens to the user so perceived latency drops

Three of those are cost optimisations (routing, caching, constrained decoding reducing retries). Three are latency optimisations (parallel tools, pre-computation, streaming UX). Most teams treat them as a single bag of "agent best practices" because the same models forced both problems on you at once.

At 1000 tok/s on a 1T model, the latency optimisations stop earning their complexity. The cost optimisations do not.

What streaming was actually buying you

Streaming token output to the UI was, for most products, an admission that the model could not finish its answer fast enough to feel responsive. A 500-token response at 50 tok/s is 10 seconds — unusable as a blocking call, perfectly fine as a stream. The streaming abstraction leaked everywhere: into the SDK, into the frontend state machine, into how you handled tool calls mid-generation, into how you logged and evaluated outputs.

At 1000 tok/s, that same 500-token response completes in 500ms. That is faster than most database queries you are already making. It is faster than your auth middleware. You can treat the LLM call as a synchronous function again — return a complete object, validate it, branch on it, do another call. The whole streaming machinery becomes a niche pattern for genuinely long outputs (long-form writing, code generation over thousands of lines) rather than the default.

We think the biggest practical consequence here is that agent loops can get deeper without users noticing. A five-step reasoning chain at 50 tok/s is a 50-second wait. At 1000 tok/s it is two and a half seconds. The space of "things you can do inside a single user-perceived interaction" expands by 20x. That is not a quantitative improvement, it is a qualitative one — it changes which problems you would even attempt to solve with an agent.

Deeper loops, not wider ones

The instinct when speed gets cheaper is to parallelise more — run more tool calls concurrently, fan out more sub-agents, do more speculative work. We would push back on that instinct.

Widening agent loops has always been a workaround for depth being too expensive. You parallelise three searches because you cannot afford to do one search, read the result, decide the next search, and repeat. Width is a hedge against the cost of sequential decision-making. When sequential decision-making gets cheap, the hedge becomes overhead — you are paying for three searches when one well-chosen second search would have been better.

Deep, narrow loops with frequent re-planning are what 1000 tok/s actually unlocks. The model can look at a partial result, decide it was the wrong path, back up, try again, and still respond within a perceptible budget. This is closer to how a person thinks: not a parallel fan-out of speculative branches, but a tight loop of try-observe-revise.

The architectural implication is that the orchestration layer should get simpler, not more complex. Less DAG, more while-loop. Less LangGraph, more for-statement.

Real-time UX becomes structurally cheap

There is a category of product that has been almost-but-not-quite viable for two years: real-time collaborative interfaces where the model is a participant rather than a generator. Live coding assistants that respond to every keystroke. Meeting assistants that interject. Document editors where the model is constantly re-reading and offering structural feedback.

These products fail at 50 tok/s because the model's contribution arrives after the moment has passed. They start to work at 200 tok/s with careful engineering. At 1000 tok/s they become a default mode rather than a heroic effort. The architectural compromise — debouncing, batching, only invoking the model on explicit user action — was a latency compromise dressed as a UX choice.

We expect the next round of interesting agent products to look less like ChatGPT and more like Figma multiplayer with one of the cursors being an LLM. The infrastructure for that has been the bottleneck, not the ideas.

What does not change

Three things are worth saying clearly, because the excitement around inference speed tends to obscure them.

First, context length and retrieval quality are unchanged. A faster model with the wrong context still gives you the wrong answer faster. The work of getting the right tokens in front of the model — RAG, memory systems, context engineering — is exactly as important as it was last week.

Second, cost per token has not moved with speed in the same way. A 1T model at 1000 tok/s is an inference engineering achievement, not a free lunch. If your unit economics depended on Haiku-class pricing, MiMo-v2.5 is not going to fix that. The cost optimisations in our list above still earn their keep.

Third, evaluation gets harder, not easier. Deeper agent loops produce more intermediate state, more decisions, more places where things can go subtly wrong. Speed lets you run longer chains; longer chains are harder to debug. We would invest more in trace tooling and step-level evals before we invested in any of the architectural changes above.

The practical move

If we were starting a new agent project this week, the working assumption would be: treat LLM calls as fast synchronous functions, build deep sequential loops rather than wide parallel ones, design UX as if the model is a real-time participant, and keep all the cost optimisations because they are still doing real work.

The useful exercise for existing systems is to go through your architecture and ask, for each piece of complexity, whether it exists because tokens were expensive or because tokens were slow. The slow-tokens complexity can mostly come out. The expensive-tokens complexity stays.

The interesting frontier is not making models faster from here — it is figuring out which products were quietly impossible at the old latency floor and are quietly possible now. Most teams have not updated their sense of what is possible, which means the next year is going to reward whoever does that work first.

Want to discuss this?

We write about what we're actually working on. If this is relevant to something you're building, we'd love to hear about it.