The Part Nobody Talks About
When people worry about AI agents going wrong, they picture the model itself — the LLM hallucinating, going rogue, ignoring its instructions. The discourse is almost entirely about prompts, alignment, and guardrails inside the model.
That conversation is missing the point.
The actual vulnerabilities in AI agent systems almost never live in the model. They live in the scaffolding around it — the loop that calls the model, parses its output, decides which tools to execute, and feeds the results back in. This is the code that turns a stateless text predictor into something that can read your email, run shell commands, and make API calls. And in most agent frameworks, this code is shockingly thin.
How an AI Agent Actually Works
Strip away the marketing and an AI agent is a while loop. Here is the basic shape, simplified but not exaggerated:
while True:
response = llm.chat(messages)
if response.has_tool_calls:
for tool_call in response.tool_calls:
result = execute_tool(tool_call.name, tool_call.arguments)
messages.append(result)
else:
break # model is done
The model receives a conversation history, produces a response, and sometimes that response includes structured tool calls — "run this bash command", "read this file", "call this API". The wrapper code parses those tool calls, executes them, appends the results to the conversation, and sends it all back to the model for the next iteration.
The model itself never touches your filesystem, never opens a network connection, never executes code. It produces text. The wrapper does everything else.
This distinction matters enormously for security, because it means the attack surface isn't the model. It's the execute_tool function.
Where the Gaps Are
Consider what happens in that loop:
1. Tool call parsing is a trust boundary. The model returns JSON (or something like it) specifying which tool to call and with what arguments. The wrapper parses this and dispatches execution. In many frameworks, this parsing is permissive — it will attempt to execute whatever the model asks for. The model is untrusted input. The tool dispatcher treats it as trusted instructions.
2. Tool execution is typically ungated. When the model says "run bash with argument rm -rf /", something needs to decide whether to actually do that. In many agent frameworks, especially those designed for autonomous operation, the answer is: nothing. The tool runs. Some frameworks add confirmation prompts for human-in-the-loop approval, but the moment you remove the human (which is the entire point of autonomous agents), you remove the gate.
3. Results flow back unfiltered. After a tool executes, its output — which could be the contents of a file, an API response, or error messages — gets appended to the conversation and sent back to the model. If that output contains prompt injection (instructions disguised as data), the model may follow those instructions on the next loop iteration. The wrapper has no concept of "this text came from an untrusted source". It is all just conversation history.
4. The loop has no memory of intent. Each iteration of the loop is stateless from a security perspective. The model asked to read a file to check something benign. The file contained an injected instruction. On the next iteration, the model follows that instruction and calls a different tool. The wrapper has no way to know that the model's behaviour has been hijacked — it just sees another valid tool call.
The Prompt Injection Problem Is Really a Tool Execution Problem
Prompt injection gets discussed as a model problem: how do we make models robust against adversarial instructions embedded in their input? But the model is not the thing that does damage. The model generates text. The text becomes dangerous when the wrapper code executes it.
A model that has been prompt-injected into wanting to exfiltrate data can only succeed if:
- The tool dispatcher gives it a tool capable of making network requests
- The execution layer doesn't restrict which endpoints that tool can contact
- No monitoring layer notices that the agent's behaviour has deviated from its original task
All three of those are wrapper problems. The model is just the text generator in the middle.
This is why hardening the prompt (system message, role definitions, "you must never...") is necessary but insufficient. You're asking the model to police itself while giving the execution layer no ability to enforce anything. It is like putting a security policy in a README and leaving the front door unlocked.
What Good Architecture Looks Like
The frameworks that handle this well tend to share a few characteristics:
Scoped tool access. Instead of giving an agent a bash tool that can run any command, give it specific tools with specific capabilities. "Read files in this directory" rather than "execute arbitrary shell commands." The principle of least privilege applies here exactly as it does in traditional software.
Tool-level gating. Before executing any tool call, the wrapper evaluates whether this specific call is permitted given the agent's current context and task. This can be rule-based (allowlists, argument validation), or it can involve a separate evaluation — but it has to exist somewhere outside the model's own judgement.
Output sanitisation. Data flowing back into the conversation from tool execution should be treated as untrusted. Some frameworks are starting to add tagging — marking content as "tool output" versus "user instruction" versus "system prompt" — so the model can (in theory) weight them differently. This is early and imperfect, but it recognises the right problem.
Execution boundaries. Run the agent in a sandbox. Not just the model — the entire execution environment. If the tool dispatcher is running in a container with no network access and a read-only filesystem, the blast radius of a compromised agent is bounded regardless of what the model tries to do.
Audit trails. Every tool call, every argument, every result should be logged. When something goes wrong (and it will), you need to reconstruct the exact sequence: what did the model ask for, what did the wrapper execute, and what came back. Most agent frameworks don't log at this level of detail, which makes post-incident analysis nearly impossible.
The Industry Is Slowly Getting This
Cloudflare's recent work on sandboxed execution (workerd, Containers for Workers) is pointed in the right direction — it treats agent execution as a containment problem, not a prompt engineering problem. Same with Anthropic's approach to computer use, which runs tool execution inside a constrained VM rather than on the host system.
But the majority of agent frameworks in production today — the ones people are building startups on, the ones running in enterprise environments — still rely primarily on the model's good behaviour. The wrapper code is often a few hundred lines of Python that parses JSON and calls functions. No gating, no scoping, no sandboxing.
We are building increasingly powerful autonomous systems where the security model is "the LLM will probably say no if you ask it to do something bad." That is not a security model. That is optimism in production.
The Hard Part
The fundamental tension is that gating tool execution reduces agent capability. The whole point of an autonomous agent is that it can decide what tools to use without asking. Every restriction you add — every allowlist, every confirmation prompt, every sandboxed execution environment — makes the agent slower, less flexible, or less autonomous.
There is no free lunch here. The question is whether you're making that tradeoff consciously or whether you're not making it at all because you haven't realised the loop around the model is where the real risk lives.
Most teams building with AI agents right now are focused on making the model smarter, the prompts better, the tools more powerful. Very few are focused on the fifty lines of code that sit between the model and the tools, deciding what actually gets executed.
That's where the work needs to happen.