Industry Commentary

Where Context Compression Fits in the LLM Stack

By John Jansen · · 7 min read

Share

Headroom hitting the top of GitHub trending with 3,500+ stars in a day is not, on its own, a signal worth writing about. What makes it interesting is what it sits next to: codegraph, various LLM-friendly serializers, the steady stream of "context engineering" tooling. Taken together, these projects are carving out a layer in the LLM stack that didn't have a name a year ago.

We've been calling it context compression internally, and we think teams shipping LLM products in 2025 need a position on it — the same way they already have positions on retrieval, on memory, and on prompt caching. The three are related, but they are not the same thing, and conflating them leads to bad architecture decisions.

What context compression actually is

Headroom's claim is 60–95% token reduction on tool outputs and RAG chunks before they reach the model. Strip away the marketing range and the technique is straightforward: when a tool call returns a 12KB JSON blob, most of it is structural noise — repeated keys, null fields, verbose timestamps, IDs the model doesn't need. The compression layer takes that blob and produces a denser representation: sometimes a different serialization (YAML, custom DSLs), sometimes a summarization, sometimes a graph form that preserves relationships while dropping redundancy.

Codegraph and similar tools do the same thing for code: instead of dumping a 2000-line file into context, you serialize a graph of symbols, call relationships, and type signatures. The model gets more signal per token.

The key property is that compression happens after retrieval and before the model sees the context. It is not retrieval (deciding what to fetch). It is not memory (deciding what to persist across turns). It is not caching (avoiding recomputation of identical prefixes). It is a transformation on the payload itself.

Why it deserves its own layer

There is a temptation to fold compression into retrieval — "just retrieve smaller chunks" — or into the prompt builder. We think this is wrong for three reasons.

First, the compression function is content-type-specific in a way that retrieval is not. A vector store doesn't care whether it's returning a code snippet or a Stripe webhook payload; it returns the chunk you asked for. But compressing a webhook payload optimally requires knowing it's a webhook payload — knowing which fields are stable IDs, which are timestamps you can truncate, which are nested structures the model will ignore. Compression is schema-aware in a way retrieval isn't.

Second, compression has a cost-quality tradeoff that needs its own knobs. You might want aggressive compression on background context and lossless serialization on the chunk the model is being asked to reason about directly. That decision belongs to a layer that knows the role of each piece of context, not the layer that fetched it.

Third — and this is the one that bites teams in production — compression interacts with the KV cache in non-obvious ways. If your compression is non-deterministic (an LLM summarizer, for instance), every request invalidates your prefix cache. If it's deterministic but content-dependent, you need to think carefully about ordering: compressed-then-cached, or cached-then-compressed, are different systems with different cost profiles.

What we compress, and where

Our working position, refined across a few client builds:

Tool outputs: always compress, deterministically, at the tool boundary. When an agent calls a tool, the response should pass through a tool-specific serializer before it ever enters the conversation. This is the highest-leverage spot — tool outputs are the bulkiest, most repetitive context in any agent system, and they're the easiest to compress because you control the schema. Deterministic compression means the prefix cache still works on the second turn.

RAG chunks: compress selectively, based on rank. The top-1 chunk usually deserves to go in verbatim — it's the one the model is most likely reasoning over directly. Chunks 2 through N are often there for context or disambiguation, and aggressive compression (summarization, key-fact extraction) is fine. Treating all retrieved chunks identically wastes tokens on the long tail.

Code context: graph form, almost always. This is where codegraph-style tools earn their keep. Full file dumps are almost never the right answer once you're past trivial scope. A graph of relevant symbols with bodies attached only for the functions actually being modified is dramatically better, both for token count and for model accuracy.

Conversation history: this is memory, not compression. Worth flagging because teams often blur them. Summarizing prior turns to fit a window is a memory policy decision — what to remember, what to forget. It happens to use compression-like techniques but belongs in the memory layer, where it can be reasoned about as state management.

Where in the pipeline

The placement question matters more than people realize. Three options, in order of how we think about them:

  1. At the source — tool outputs are compressed by the tool wrapper itself, before they hit any conversation state. Pro: simple, deterministic, cache-friendly. Con: you commit to a single compression strategy per tool.

  2. At the assembler — the prompt builder applies compression based on context role and budget. Pro: dynamic, can react to overall context pressure. Con: harder to make deterministic, easier to break the cache.

  3. At the model boundary — a last-mile pass that rewrites the full prompt if it's over budget. Pro: simple fallback. Con: usually too late and too lossy.

We default to (1) for tools and code, (2) for RAG, and treat (3) as an emergency valve, not a design choice.

The cache interaction

This is the part most teams underweight. Prefix caching on Anthropic, OpenAI, and the open-source inference stacks all depend on byte-identical prefixes. Any compression step that is non-deterministic across requests destroys the cache for everything that follows it.

The practical rule: compression that runs on stable content (tool schemas, code structure, system prompts) should be deterministic and happen early in the prompt. Compression that runs on volatile content (user turn, latest tool output) can be more aggressive, but should happen late in the prompt where cache misses are unavoidable anyway.

Get this ordering wrong and you'll see your token bills drop while your latency and inference costs rise, because every request is recomputing prefixes that should have been cached. We've seen this happen. It's not obvious from the logs unless you're specifically watching cache hit rates.

What we think the stack looks like

If we were sketching the layers today: retrieval at the bottom, deciding what exists in the candidate set. Compression in the middle, deciding how that candidate set gets represented. Memory orthogonal to both, deciding what persists. Caching underneath all of it, optimizing the runtime. The model on top, doing the actual work.

The interesting consequence is that compression — done well — makes the other layers cheaper. Better compression means you can afford to retrieve more chunks (because each costs less), persist more memory (because summarization is more efficient), and keep more in-prompt (because the cache is doing real work). It's a multiplier, not a standalone optimization.

The rise of headroom and its neighbors isn't a fad. It's the stack growing a layer it needed. Teams that pick a position now — what gets compressed, where, and how it plays with the cache — will spend the next year shipping. Teams that don't will spend it explaining why their token bill is the way it is.

Want to discuss this?

We write about what we're actually working on. If this is relevant to something you're building, we'd love to hear about it.