Industry Commentary

Where Model Routing Fits in the Production LLM Stack

By John Jansen · · 7 min read

Share

OpenRouter raising $113M Series B last month is the kind of funding event that tells you less about one company and more about a category. Not Like, Martian, Unify, Portkey, LiteLLM, Helicone's routing primitives — there is now a recognisable layer of the stack whose job is to sit between your application and the providers, choosing which model serves which request. The question for engineering teams isn't whether routing exists; it's whether it belongs in your architecture yet.

We've been building production LLM systems long enough to have an opinion on this, and it's a less enthusiastic one than the funding round suggests. Routing earns its keep in specific conditions. Outside those conditions, it adds a hop, a vendor, and a debugging surface for problems you didn't have.

What the routing layer actually does

Strip away the marketing and an inference router does some combination of four things: fallback (if Anthropic 5xx's, try OpenAI), cost optimisation (cheapest model that meets a quality bar), capability routing (send code to one model, summarisation to another), and rate-limit smoothing (spread load across providers when you're bumping ceilings).

These are four genuinely different problems. Most teams reach for routing because of one, then inherit the complexity of all four. That's worth being honest about. A fallback proxy is a few hundred lines of code. A cost-optimising router that maintains quality SLOs across a shifting model landscape is a real product, and the reason companies like OpenRouter can raise nine figures.

The argument for pinning

There's a respectable case for picking one provider, possibly one model, and staying there. It looks like this:

Prompts are not portable. Anyone who has moved a non-trivial prompt from GPT-4 to Claude to Gemini knows that "the same prompt" produces materially different outputs, with different failure modes, different tendencies toward verbosity, different tool-calling behaviour. Eval suites built against one model don't transfer cleanly. If your application depends on careful prompt engineering — and most serious ones do — every additional model in your routing pool is another surface you're testing, monitoring, and maintaining.

Provider-specific features compound this. Anthropic's prompt caching, OpenAI's structured outputs and Responses API, Gemini's long-context behaviour, the various flavours of computer use and tool calling — none of these route cleanly. The moment you depend on a provider feature, your router either degrades gracefully (giving you worse output) or doesn't route at all (giving you no router).

Latency budgets matter more than people admit. Adding a routing hop is usually 20–80ms of overhead before TTFT, plus whatever decision logic runs. For batch workloads this is invisible. For interactive agents already chaining four model calls, it's another four hops you're paying for, and the cost compounds with retries.

For a team shipping a focused product on a known workload, pinning to a primary provider with a hand-rolled fallback gets you 90% of what a router provides, with full visibility into the decision logic and no extra vendor in the critical path.

When routing earns its keep

That said, we've watched routing become genuinely load-bearing in a few patterns.

Heterogeneous workloads at scale. If you're running a platform — not an application — where end users or tenants drive workloads you don't fully control, the request mix is genuinely diverse. Some calls want fast and cheap, some want frontier reasoning, some want long context, some want vision. Hand-coding that decision tree is a maintenance burden that grows with the model landscape. A router with a sensible policy DSL becomes the right abstraction.

Cost-sensitive high-volume inference. When you're spending six figures a month on tokens and 70% of your calls are tasks where a smaller model would suffice, routing on task classification is real money. The break-even is roughly: does the engineering cost of building and maintaining the classifier-plus-router exceed the savings? At low volume, no. At high volume, easily yes — and outsourcing it to a routing provider with their own optimisation telemetry is rational.

Resilience requirements that exceed any single provider's SLA. This is the most underrated case. No frontier provider has the uptime numbers of mature cloud infrastructure. If your product has a hard availability requirement — say it's embedded in a workflow where LLM downtime stops revenue — multi-provider routing isn't optimisation, it's table stakes. The question becomes whether you build the fallback logic yourself or buy it.

Rapid model evaluation. Teams that genuinely want to A/B new models against production traffic benefit from a routing layer as an experimentation substrate. This is less about runtime decisions and more about the router being a control plane for traffic shaping.

The runtime-vs-deploy-time framing

The deeper shift the funding round points at is real: model choice is moving from a deploy-time decision to a runtime one. We think this is correct as a long-term direction and overstated as a present-day necessity.

It's correct because the model landscape is genuinely volatile. Frontier capability shifts on a monthly cadence. Price-performance frontiers move. New models appear that are dramatically better at specific tasks. A system that hard-codes claude-3-5-sonnet-20241022 in twelve places across a codebase has accumulated technical debt the moment that string ships.

It's overstated because most teams don't need runtime model selection — they need configurable model selection. A config-driven approach where model identifiers live in one place, versioned, with a clean abstraction for swapping, gets you most of the agility benefit. Promoting that to a runtime policy engine is a step that should be justified by workload heterogeneity, not by the existence of the abstraction.

A practical heuristic

When we advise teams on this, our rough cut is:

  • Single product, focused workload, under ~$10k/month in inference: pin to a primary, write a 200-line fallback. You don't need a router. You need good evals.
  • Multi-tenant platform or genuinely diverse workloads: a routing layer is probably correct. Buy versus build depends on whether your policy logic is generic (buy) or contains domain-specific quality signals (build, or buy with custom policies).
  • High-volume cost-sensitive inference: routing pays for itself, but the savings come from the classifier and the eval harness, not the router itself. The router is the cheap part.
  • Hard availability requirements: multi-provider is non-negotiable; the question is just where the failover logic lives.

What this means for the category

The interesting thing about a maturing routing layer isn't that more teams will adopt routers. It's that the boundary in the LLM stack is starting to harden in a familiar shape: applications on top, routing and observability in the middle, providers underneath. That's the shape every other infrastructure category eventually takes, and once the middle layer is stable, applications stop caring quite as much about which provider is underneath.

We don't think this means providers become commodities — model quality differs too much and the gap matters too much for that. But it does mean the cost of switching, hedging, and mixing drops. For teams building today, the practical implication is to design for that future without paying for it prematurely. Keep model identifiers configurable. Keep prompts versioned per model. Write evals that can run against any candidate. Then add the routing layer when your workload actually demands it — not when the category's funding announcements suggest it should.

Want to discuss this?

We write about what we're actually working on. If this is relevant to something you're building, we'd love to hear about it.