How to use this review
An agent that reads external content and holds real tools is a production system with an attack surface. It is not a prompt you tuned once and forgot. This review is the gate it passes before it ships and the audit you re-run every quarter while it is live. Capabilities drift, dependencies update, and the threat model you wrote in week one no longer matches what the agent can do by week twelve.
For each item, decide where the agent actually sits today: L0 (not yet), L1 (basic), L2 (standard), or L3 (hardened). Rate honestly. Aspiration is not a control. A checklist that scores your intentions tells you nothing about your exposure.
Many deployments are L0 across the board. That is the starting position, not a failure. Improvement starts from an honest baseline. An L0 you have named and accepted is worth more than an L2 you assumed and never verified.
Work top-down. The items are ordered so the earliest ones return the most risk reduction per unit of effort. Per-agent identity and default-deny tooling close larger fractions of the attack surface than anything further down the list, so they come first. If you only have time to fix three things before a launch, fix the first three.
Identity, scope, and privilege
This layer bounds the blast radius of everything else. If the agent runs as you, holds shared credentials, or carries tools it never uses, no downstream control recovers the ground you lost here.
- Each agent has its own service account and its own credentials. No agent runs under the operator's identity. No two agents share an OAuth token or a tool credential. Shared identity collapses your ability to attribute an action, to revoke one agent without breaking another, or to scope a token to the work it actually does.
- Each agent has only the tools it needs. Default deny. Add capabilities consciously, one at a time, with a reason. Where a tool exposes scope (channel, repository, path, domain), narrow the scope at the tool boundary, not in the prompt. A prompt that says "only touch this repo" is a wish. A tool that rejects every other repo is a control.
- High-risk tools require human-in-the-loop confirmation. Sending email, posting to public channels, executing arbitrary shell, deleting state, invoking irreversible actions: the operator confirms each one, every time. If confirmation volume becomes impractical, that is the diagnosis, not the obstacle. The agent has too many sensitive capabilities and should be split.
- Agents that need shell access run in a sandbox. Container, VM, or microVM with ephemeral state, sized to the risk. A sandbox does not stop injection. It bounds what injection can do once it lands.
- The orchestrator is not implicitly trusted. Hand-offs from an orchestrator are external input. Where possible they are signed by the sender and verified at the receiver's Policy Enforcement Point. Teams forget the internal trust boundary more often than any other, which is exactly why it is worth checking.
Harness, network, and audit
The harness is where you regain the control the model gives away. Every guarantee in this section holds regardless of what the model decides to do. That is why this layer is worth the investment.
- Network egress is default-deny. Each agent's outbound destinations are an explicit allow-list per tool. An agent that can reach arbitrary hosts is an exfiltration channel waiting for a payload to use it.
- Internal services are reachable only over a VPN. Nothing is exposed on the public internet, and agents do not hold SSH credentials between hosts. A compromised agent cannot walk laterally because the path is not there.
- The harness exposes hooks at tool boundaries. PreToolUse and PostToolUse hooks redact secrets, validate arguments, and emit audit events. These are the insertion points for enforcement that does not depend on the model behaving.
- Audit logs are written outside the agent's reach. Tool calls, decisions, redacted arguments, and outcomes stream to a tamper-evident store the agent cannot modify. If an agent can edit its own audit trail, you have a log, not evidence.
- Resource and cost limits are configured. Loop iteration cap, per-tool rate limit, token and cost budget, and an outbound bytes-per-session ceiling. These bound a runaway agent and an attacker spending your compute on your bill alike.
The audit log earns its place only if you can reconstruct, after the fact, exactly which tool calls an agent made and with what arguments. If the answer to "what did it do?" is a shrug, the logging is theater. Test reconstruction before you ship, not during the incident.
Secrets, supply chain, channels, and memory
These surfaces share a property: they leak slowly and quietly until they do not. A cleartext secret, an auto-updated dependency, a bot in a shared channel, an agent-writable memory file. Any one of them turns a contained agent into a pivot point.
- No cleartext secrets in workspaces or configuration files. This is the L1 floor. Secrets are injected via environment variables, ideally pulled from a password manager or vault at session start.
- Where consequence is high, credentials are short-lived and just-in-time. A vault issues per-call, scope-limited credentials, and the agent's process never sees them. A credential that expires in minutes survives exfiltration poorly.
- Tool outputs and arguments pass through redaction hooks. Known secret patterns are stripped before content enters the model's context or leaves the harness, in both directions.
- Models, MCP servers, plugins, and skills are pinned to specific versions. Auto-update is disabled and upgrades are deliberate. Inspect tool descriptions before deployment, because a tool description is untrusted input that the model reads as instruction.
- Channels are private and per-agent. Avoid bots in shared channels and unauthenticated email-to-agent paths unless they are explicitly required and confirmed. A shared channel is a public injection inbox.
- Memory surfaces are reviewed as code. Long-lived agent files (
CLAUDE.mdand equivalents) are committed, reviewed, and never auto-updated by the agent itself. An agent that can rewrite its own instructions can be made to persist an injection across sessions.
Six threat vectors, one defense each
Each vector below maps to one primary defense. The mapping is deliberately reductive: it gives you a single thing to verify per vector at the gate, not an exhaustive treatment. Depth lives in the module. This is the pre-flight check.
| Threat vector | One defense |
|---|---|
| Prompt injection | Strict-ignore guard system prompt on every agent, plus provenance tagging on fetched content. |
| Rogue or over-privileged agent | Default-deny tools, narrow scopes per role, sandbox for shell-using agents, no shared credentials. |
| Harness and runtime compromise | Default-deny outbound, VPN-gate everything internal, PreToolUse hooks, audit log outside the agent. |
| Secret leak | Secrets off disk to password manager or vault, redact in PostToolUse hooks, per-agent identity with scoped tokens. |
| Supply-chain compromise | Pin model and dependency versions, inspect tool descriptions, prefer signed and provenanced artifacts. |
| Channels, memory, multi-agent compromise | Private channels with per-agent identity, memory reviewed as code, inter-agent boundaries treated as external input. |
One defense per vector is the floor, not the ceiling. A high-consequence agent (irreversible production actions, access to sensitive data) needs layered controls per vector, not a single line item ticked. Use the one-defense column to confirm nothing is missing, then go deeper wherever the blast radius justifies it.
Adversarial sign-off
The checklist tells you which controls exist. The adversarial questions tell you whether they hold. Run these before sign-off, with someone playing attacker, and treat any answer you cannot give as an open finding that blocks the launch.
- "What is the most damaging tool call that could be induced from a fetched webpage?" Name it, then verify that the controls meant to bound it actually fire. If you cannot name it, you do not yet know your exposure.
- "What does the audit show after that tool call?" Pull the log and confirm the action is reconstructable from it. A control with no trace is a control you cannot prove worked.
- "Which credential, if exfiltrated, lets the attacker pivot beyond this agent?" Trace the scope and TTL of every credential the agent touches. Anything broad or long-lived is a finding.
- "If this agent is suspected compromised right now, what is the kill-switch command?" Run it in a drill. Revocation you have never exercised is revocation you are guessing about.
Record the result as an L-level summary per section, with the reviewer, the date, and the open findings. A signed-off review with three named L0 items and a remediation date is a real artifact. A review with no findings usually means nobody looked hard enough.
Checklist
The four questions every agent must survive before sign-off
- What is the most damaging tool call that could be induced from a fetched webpage, and do the controls that bound it fire?
- What does the audit log show after that tool call, and is the action reconstructable from it alone?
- Which credential, if exfiltrated, lets an attacker pivot beyond this agent, and is its scope and TTL tight enough?
- If this agent is suspected compromised right now, what is the kill-switch command, and has it been exercised?
Putting an agent into production?
We run these pre-deployment reviews and quarterly re-audits on production AI systems, and we train teams to run them in-house. Tell us what you are about to ship.