How to use this review

An agent that reads external content and holds real tools is a production system with an attack surface. It is not a prompt you tuned once and forgot. This review is the gate it passes before it ships and the audit you re-run every quarter while it is live. Capabilities drift, dependencies update, and the threat model you wrote in week one no longer matches what the agent can do by week twelve.

For each item, decide where the agent actually sits today: L0 (not yet), L1 (basic), L2 (standard), or L3 (hardened). Rate honestly. Aspiration is not a control. A checklist that scores your intentions tells you nothing about your exposure.

Note

Many deployments are L0 across the board. That is the starting position, not a failure. Improvement starts from an honest baseline. An L0 you have named and accepted is worth more than an L2 you assumed and never verified.

Work top-down. The items are ordered so the earliest ones return the most risk reduction per unit of effort. Per-agent identity and default-deny tooling close larger fractions of the attack surface than anything further down the list, so they come first. If you only have time to fix three things before a launch, fix the first three.

Diagram: an agent at the center surrounded by its threat surfaces (identity and privilege, harness and network, secrets, supply chain, channels and memory), each annotated with the one defense that contains it.
The agent threat landscape. Each surface maps to one primary defense, reviewed before deployment and re-checked each quarter.

Identity, scope, and privilege

This layer bounds the blast radius of everything else. If the agent runs as you, holds shared credentials, or carries tools it never uses, no downstream control recovers the ground you lost here.

Harness, network, and audit

The harness is where you regain the control the model gives away. Every guarantee in this section holds regardless of what the model decides to do. That is why this layer is worth the investment.

The reconstruction test

The audit log earns its place only if you can reconstruct, after the fact, exactly which tool calls an agent made and with what arguments. If the answer to "what did it do?" is a shrug, the logging is theater. Test reconstruction before you ship, not during the incident.

Secrets, supply chain, channels, and memory

These surfaces share a property: they leak slowly and quietly until they do not. A cleartext secret, an auto-updated dependency, a bot in a shared channel, an agent-writable memory file. Any one of them turns a contained agent into a pivot point.

Six threat vectors, one defense each

Each vector below maps to one primary defense. The mapping is deliberately reductive: it gives you a single thing to verify per vector at the gate, not an exhaustive treatment. Depth lives in the module. This is the pre-flight check.

Threat vectorOne defense
Prompt injectionStrict-ignore guard system prompt on every agent, plus provenance tagging on fetched content.
Rogue or over-privileged agentDefault-deny tools, narrow scopes per role, sandbox for shell-using agents, no shared credentials.
Harness and runtime compromiseDefault-deny outbound, VPN-gate everything internal, PreToolUse hooks, audit log outside the agent.
Secret leakSecrets off disk to password manager or vault, redact in PostToolUse hooks, per-agent identity with scoped tokens.
Supply-chain compromisePin model and dependency versions, inspect tool descriptions, prefer signed and provenanced artifacts.
Channels, memory, multi-agent compromisePrivate channels with per-agent identity, memory reviewed as code, inter-agent boundaries treated as external input.
Watch

One defense per vector is the floor, not the ceiling. A high-consequence agent (irreversible production actions, access to sensitive data) needs layered controls per vector, not a single line item ticked. Use the one-defense column to confirm nothing is missing, then go deeper wherever the blast radius justifies it.

Adversarial sign-off

The checklist tells you which controls exist. The adversarial questions tell you whether they hold. Run these before sign-off, with someone playing attacker, and treat any answer you cannot give as an open finding that blocks the launch.

Record the result as an L-level summary per section, with the reviewer, the date, and the open findings. A signed-off review with three named L0 items and a remediation date is a real artifact. A review with no findings usually means nobody looked hard enough.

Checklist

The four questions every agent must survive before sign-off

  • What is the most damaging tool call that could be induced from a fetched webpage, and do the controls that bound it fire?
  • What does the audit log show after that tool call, and is the action reconstructable from it alone?
  • Which credential, if exfiltrated, lets an attacker pivot beyond this agent, and is its scope and TTL tight enough?
  • If this agent is suspected compromised right now, what is the kill-switch command, and has it been exercised?

Putting an agent into production?

We run these pre-deployment reviews and quarterly re-audits on production AI systems, and we train teams to run them in-house. Tell us what you are about to ship.