Industry Commentary

Where AI Tooling Sits in the Supply Chain Threat Model

By John Jansen · · 7 min read

Share

Two recent incidents are worth sitting with. The Context.ai token exposure that propagated through Vercel's build environment, and the LiteLLM credential leak that touched Mercor's evaluation pipeline, are being discussed as ordinary supply chain breaches. We think that framing undersells what happened. These were not npm typosquats or a compromised CI runner. They were failures at a layer that most security programs still treat as a set of convenient developer tools rather than production infrastructure.

The uncomfortable part is that the layer has quietly become infrastructure. And the controls around it have not caught up.

What these incidents actually exposed

Both breaches share a shape. A developer-facing AI tool — an observability SDK in one case, a proxy gateway in the other — was handed broad credentials so it could do the thing it was bought to do: see traffic, route requests, evaluate outputs. Those credentials, once held by the tool, had access well beyond the scope a reasonable threat model would grant them. When the tool's own security posture slipped, the blast radius was not the tool. It was every system the tool had been trusted to observe or mediate.

This is not new in pattern. It is the same shape as the SolarWinds or Codecov incidents. What is new is the density and velocity of these integrations. A twelve-person startup today might be running Context or Langfuse or Helicone for observability, LiteLLM or Portkey as a gateway, Braintrust or Langsmith for evals, Cursor or Windsurf for code, plus half a dozen MCP servers that nobody has audited. Each of these holds API keys. Several of them see prompts and completions in plaintext, which means they see customer data, internal documents, and whatever else gets stuffed into a context window.

The aggregate trust surface is large, growing fast, and — this is the part that matters — structurally different from the dependency surface that existing tooling was designed to cover.

Why SBOMs miss this

Software Bills of Materials are good at answering one question: what code is running inside the artifact we ship? They are less useful at answering the question that actually matters here: what external systems does our running code trust, and with what scope?

An SBOM will happily list the openai Python package. It will not tell you that you are routing every production inference through a hosted LiteLLM instance, that this instance has keys for four model providers, that it logs full request bodies to a third-party storage bucket, and that the team operating it is five people in a different jurisdiction. SPDX and CycloneDX have extensions for services and external references, but in practice almost nobody populates them, and no scanner flags them.

The mental model of an SBOM is static: dependencies are things you compile against. AI tooling dependencies are dynamic: they are runtime relationships with credentialed scope, often established by a developer clicking through an OAuth flow or pasting a key into a dashboard. The artifacts that get scanned do not reflect them.

This is the gap. It is not that SBOMs are wrong. It is that they cover a different thing.

Why vendor review misses it too

Vendor security review exists, and in theory should catch this. In practice, it catches it for the vendors that procurement knows about. The AI tooling layer is frequently adopted bottom-up. An engineer signs up for a free tier, wires it into a side project, finds it useful, and by the time it is load-bearing nobody can remember when it got added. SOC 2 reports get requested for the CRM and the payroll system. They rarely get requested for the eval harness.

Even when review happens, the questions asked are calibrated for SaaS of a prior era. Data residency. Encryption at rest. Employee background checks. These matter, but they do not probe the specific failure modes of AI middleware: how are provider API keys stored and rotated, what prompt and completion data is retained and for how long, what is the isolation model between tenants in a multi-tenant gateway, who can access the logs, and what happens to cached responses. A vendor can pass a standard review comfortably while failing every one of those.

A more honest threat model

We have started, internally, to treat AI tooling as a distinct category with its own review checklist. The shape of it, roughly:

  • Credential scope and custody. Does this tool require a provider key, and if so, can we issue it a scoped key, a separate billing account, or proxy through our own gateway? Assume the tool will be breached and design the key accordingly.
  • Data path inventory. For every tool that sits in the request path or the observability path, what data does it actually see? Not what the marketing page says — what the network traffic shows. Prompts, completions, tool calls, system prompts, user identifiers.
  • Retention and training posture. What is retained, for how long, by whom, and is any of it used to improve the vendor's own models or shared with sub-processors.
  • Blast radius on compromise. If this vendor's admin panel is taken over tomorrow, what can the attacker do. Read historical prompts? Rotate keys? Inject responses? The last one is the sleeper — a compromised gateway can return adversarial completions and it would take a long time to notice.
  • Operational maturity. How old is the company, how many engineers, where are they, what does their incident history look like, do they have a security contact who replies.

None of this is novel as security practice. What is novel is applying it consistently to the AI middleware layer, where the cultural default has been to move fast and wire things up.

The infrastructure reframe

The useful shift is to stop thinking of these tools as developer conveniences and start thinking of them as infrastructure with the same standards you would apply to a database or a message broker. Nobody would put a hosted Postgres in the critical path without knowing who operates it, what their backup story is, and what happens if they go down. The AI tooling layer deserves the same scrutiny, and currently does not get it.

Practically, this means a few things. Self-hosting the gateway layer is worth the operational cost for most teams past a certain scale, because it collapses the credential custody problem back inside your perimeter. Scoped, per-tool API keys with aggressive rotation should be default. Observability tools should receive redacted or sampled data, not full request bodies, unless there is a specific reason otherwise. And the inventory of AI dependencies should live somewhere a CISO can actually see it, alongside the rest of the vendor register.

What we think happens next

Our expectation is that the SBOM standards will extend to cover runtime service dependencies more rigorously, probably driven by procurement requirements rather than voluntary adoption. We also expect a new category of tooling — something like a runtime dependency graph for AI services — to emerge, because the problem is real and nobody currently solves it well. The teams that treat this seriously now, before regulation forces it, will spend less time writing incident post-mortems later.

The Context.ai and LiteLLM incidents are not anomalies. They are the early, loud version of a pattern that will become quieter and more common as the tooling layer matures and consolidates. The right response is not to stop using AI tooling. It is to stop pretending it sits outside the threat model.

Want to discuss this?

We write about what we're actually working on. If this is relevant to something you're building, we'd love to hear about it.