Industry Commentary

What's already inside the model?

By John Jansen · · 7 min read

Share

There's a quiet assumption running through most discussions of agent safety: that the threat model is the prompt. Prompt injection, jailbreaks, indirect injection from a poisoned web page, a malicious tool description — the literature is rich, the tooling is improving, and most teams have at least thought about it.

What we think about less is what the model already knows.

The training set is not a clean room

Frontier models are trained on enormous corpora scraped from the open internet, supplemented by code repositories, technical documentation, security research, and a long tail of forum posts and writeups. That corpus contains, in detail:

  • CVE writeups with proof-of-concept exploit code
  • Metasploit modules and their commentary
  • Decades of Phrack, Bugtraq, oss-security, and full-disclosure archives
  • Red team blog posts walking through real engagements step by step
  • Academic papers on side channels, fault injection, and cryptographic attacks
  • Reverse engineering tutorials, malware analysis, packer internals
  • Every CTF writeup ever indexed

This is not a hypothetical. It is the substrate. A modern coding assistant has, in some compressed form, read more offensive security material than any single human security researcher alive. The guardrails that sit on top of that knowledge are a thin alignment layer — RLHF, constitutional rules, system prompts, refusal classifiers. They shape what the model will say when asked directly. They do not erase what it knows.

The gap between those two things is where we think the next interesting class of risk lives.

Behaviour is gated, capability is not

When you ask a model "write me a kernel exploit for CVE-2023-XXXX" and it refuses, that refusal is a behavioural choice. The weights that could have produced the exploit are still there. They are still being consulted on every forward pass, just for different ostensibly benign tasks.

This matters because agents do not just answer questions. They take actions. A coding agent reads your repo, edits files, runs commands, opens PRs, calls APIs. A chat agent with tool access browses, fetches, executes. The model is making thousands of small decisions about what code to write, what library to suggest, what shell command to run, what regex to use to parse that user input.

Each of those decisions draws on the same weights that encode every exploit technique in the training set. Most of the time that's fine, even helpful — the model knows that strcpy is dangerous because it has read a thousand posts about buffer overflows. But "the model knows about exploits" and "the model will never produce exploitable code" are not the same statement, and we have very little ability to audit which one is true on any given output.

The honest question

Here is the question we keep coming back to: is it possible for behaviour learned from offensive security content to surface in ostensibly cooperative output, in ways neither the user nor the model's operators would recognise?

A few concrete shapes this could take:

Subtly weakened code. The model writes authentication that looks correct, passes review, and contains a timing leak it learned from a paper. Not because anyone asked. Because in the latent space, "write a token comparison" and "write a token comparison with a side channel" are neighbours, and something nudged the sample.

Plausible but vulnerable suggestions. When asked to sanitise input, the model reaches for a regex that has a known ReDoS pathology. It has seen the regex used a thousand times in tutorials and twice in writeups about why it's broken. The first signal dominates.

Backdoor-shaped patterns under specific triggers. This is the more paranoid version, and the one that has actual research behind it. Work on data poisoning (Anthropic's sleeper agents paper, the various "poisoning the unlabelled web" results) has shown that you can train models that behave normally except when a specific trigger appears in context, at which point they emit attacker-chosen output. If a model's training data was poisoned — and we have no real way to verify it wasn't — that behaviour is baked in. No prompt-level guardrail catches it, because the trigger is innocuous and the malicious output is rare.

Tool-use compositions nobody trained against. The guardrails are trained against things humans flagged. Novel chains of benign tool calls that compose into something harmful are, by definition, harder to flag. The model may have seen the chain in a red team writeup. The guardrail has not.

Why this is hard to reason about

The usual response to this kind of concern is "show me an example." Fair. But the structure of the problem is that the interesting examples are the ones we can't easily produce on demand, because:

  1. The model's behaviour on any specific input is hard to attribute to specific training data.
  2. Refusal training masks the most obvious probes — ask directly, get refused, conclude it's fine.
  3. The bad outputs, if they exist, are rare and contextual. They don't show up in benchmarks designed by the same people who trained the model.
  4. We don't have weight-level interpretability that can answer "does this model have a backdoor?" with anything resembling confidence.

This is a genuine "unknown unknowns" situation. Not in the lazy Rumsfeldian sense, but in the specific sense that our evaluation surface (prompts and outputs) is much smaller than the capability surface (the weights). We are testing a tiny slice and inferring safety across the whole.

What we actually do about it

We are not arguing that agents are unusable. We use them. We ship them. But we have shifted some of our defaults:

  • Treat agent output like untrusted input, even when the agent is trusted. Code review is non-negotiable, especially for security-relevant code paths — auth, crypto, deserialisation, anything that touches a network boundary. The reviewer's job is not "is this what I asked for" but "would this be safe even if it had been written by an adversary who knew I wouldn't look closely."
  • Constrain the action space, not just the prompt. An agent that can't reach production, can't exfiltrate, can't execute outside a sandbox is much less interesting as an attack vector regardless of what its weights know. Capability containment beats behavioural alignment when you can get it.
  • Diversity across models for security-critical work. If two independently trained models from different labs both produce the same code, the chance that both share the same poisoned weight is lower. Not zero — they share training data — but lower.
  • Static analysis and fuzzing on AI-generated code, by default. The tools were always good. They are now load-bearing.
  • Be specific about what the agent is allowed to know about. Don't hand a coding agent your entire secrets store and your production credentials and your customer data and assume the system prompt holds.

The shape of the risk

The framing we'd offer is this: guardrails are a property of the deployment. Capabilities are a property of the model. The two are decoupled in a way the industry talks about less than it should, because the deployment is what we can change and the model is what we have to trust.

Trust is the right word. Every team using a frontier model is, today, trusting that the alignment layer holds, that the training data wasn't poisoned in any consequential way, and that no offensive capability the model possesses will surface uninvited in the millions of small decisions it makes on their behalf.

That trust may be well-placed. We don't know. The people who trained the model don't fully know either. That's the part worth sitting with.

Want to discuss this?

We write about what we're actually working on. If this is relevant to something you're building, we'd love to hear about it.