Incident response for compromised agents

How agent incidents differ

Classic incident response assumes a deterministic, inspectable principal: a server, a service account, a workstation. You pull it offline, read its logs, rotate its key, and you end up with a fixed account of what happened. A compromised agent violates each of those assumptions.

The principal is non-deterministic. The same prompt can produce a different tool-call sequence on the next run, so replay alone won't reproduce the incident and the agent's own account of what it did is worthless. The agent holds credentials in its own right: OAuth tokens, API keys, vault leases, short-lived session tokens. Revoking the human operator's access does nothing. It has also written to memory and audit surfaces, so the record you'd use to investigate may itself be poisoned. And it reaches downstream systems through channels that look benign: a Slack message, a commit, a calendar invite, a row written to a shared store.

So containment can't run sequentially. You work identity, memory, channels, and downstream propagation in parallel, because each one is a live path the compromise keeps using while you handle the others.

The shift that matters

A compromised agent authenticates as itself, writes the evidence you'll later read, and reaches other systems through traffic that trips no alarms. Treat its memory and audit trail as suspect until you've diffed them against a known-good baseline.

Phase 1 · Detect and triage

The first fifteen minutes set up everything that follows. Confirm the symptom, preserve state before it decays, classify severity, and find the person who can revoke the agent's credentials. Order matters here: evidence preservation comes before any action that mutates the agent.

Identify the symptom. An anomalous tool call, a secret in output, an unexpected egress destination, a monitoring alert, a third-party report. Write down what tripped first. It anchors the timeline.
Preserve evidence first. Snapshot the conversation transcript, the tool-call audit log, the egress log, and any sandbox or process state before you touch the agent.
Identify operator and on-call. Who owns this agent, and who can revoke its credentials right now? If you can't answer that in the first few minutes, the gap is your first post-incident finding.

Preserve evidence first

Do not restart the agent. A restart rolls the context window and can destroy the only record of the injected instruction. When you do stop it, send SIGTERM rather than deleting the container. You want the process to exit cleanly with its forensic state intact, not vaporised.

Classify severity before you contain. The level dictates how aggressively you reverse external actions later.

Level	Definition	Implication
L1	Agent confused, no irreversible action taken.	Contain and investigate. No external reversal needed.
L2	Agent acted outside scope, but the actions can be undone.	Contain, then roll back every out-of-scope action.
L3	Irreversible action taken: a deployment, a payment, a public post, a deletion, or suspected exfiltration.	Full response. Assume external impact and notify accordingly.

Phase 2 · Contain

Containment is the parallel-path phase. The agent's identity, its hosting, its channels, and the systems it has already written to are four independent exposures. Close them concurrently, not one after another. Target window is the first thirty minutes.

Kill the agent process. Stop the harness loop with SIGTERM. Don't kill it in a way that destroys forensic state.
Revoke credentials. Every OAuth token scoped to the agent, every API key in its environment, any active vault lease issued to it, and any session credentials (STS, OIDC) still inside their TTL.
Quarantine the host or sandbox. Disable network egress and freeze the sandbox or container. Don't destroy it until forensics are done.
Disable channels. Remove the bot from Slack, Discord, and Telegram. Disable the email-to-agent address. Pause scheduled tasks and webhooks.
Notify downstream systems. Flag anything the agent wrote to in the past N hours for review. This is the propagation path that outlives the agent process.

Phase 3 · Investigate

With the agent contained, reconstruct what happened from the evidence you preserved, not from the agent. Five questions structure the first twenty-four hours. Each maps to one of the parallel exposures.

Reconstruct the timeline. What was the user prompt? What did the agent fetch? Which tool calls fired, in what order, and what came back? This is the backbone every other finding hangs on.
Identify the injection vector. Which fetched URL, file, channel message, or sub-agent output carried the adversarial content? If there's no vector, the cause may be credential exposure rather than injection.
Identify credential exposure. Which secrets sat in the agent's environment, and which surfaced in tool arguments, tool outputs, transcripts, or logs? A secret printed once is a secret compromised.
Identify memory poisoning. Did the agent write to an agent file, a vector store, a shared blackboard, or any persistent surface? Poisoned memory re-compromises the next run after you think you're done.
Identify lateral movement and external impact. Did the agent invoke other agents, touch hosts beyond its declared scope, or open new egress destinations? Was data exfiltrated, were external accounts modified, were public channels posted to?

Note

"The agent denies it did X" is not evidence. The model's self-report is generated text, subject to the same compromise you're investigating. Verify every claim against logs the agent never controlled.

Phase 4 · Remediate

Remediation closes the exposure and patches the vector so the same incident class can't recur. The recurring failure here is partial rotation: rotating only the credential that obviously leaked and leaving the rest in place.

Rotate every secret the agent handled. Anything that appeared in a transcript, log, or tool output, not just the one that visibly leaked.
Scrub poisoned memory. Restore agent files, vector stores, scratchpads, and shared blackboards from a known-good baseline. Diff before and after, and review every change.
Reverse external actions where you can. Withdraw a public post, retract an email, roll back a deployment, refund a payment. Where reversal isn't possible, document it.
Patch the vector. Add the adversarial URL or pattern to the screen model's denylist, tighten the relevant tool scope, and add a hook for the specific argument pattern.
Reset the agent. Re-issue its identity, re-mint credentials, and redeploy from a known-good configuration in a fresh sandbox.

Rotate the full set

Rotate every secret the agent has handled, not only the one that obviously leaked. A compromised agent has read its own environment, so assume the attacker knows everything in it. Partial rotation leaves a working key behind and turns a closed incident into a second one.

Phase 5 · Learn

The incident isn't closed when the agent comes back online. It's closed when the response is documented, the controls are updated, and you've shown the next response will be faster.

Write a post-incident report. Symptom, timeline, root cause, controls that held, controls that failed, and the remediation actions taken.
Update the threat ladder. If a control failed, the L-level for that vector regresses. Document what has to happen to recover it.
Update the relevant checklist. The pre-deployment review and any per-control checklist gain a line that keeps this incident class from recurring.
Drill the response. Schedule a tabletop in 90 days that replays the incident and confirms the response is now faster.

Roles to confirm in advance

The slowest part of a real response is finding who has the authority to act. Confirm these roles before an incident, not during one. If a single person fills several of them, that concentration is itself a risk worth recording.

Role	Responsibility
Agent owner	Holds authority over scope, credentials, and deployment.
Security on-call	Coordinates the response across systems.
Vault / KMS owner	Runs credential revocation and rotation.
Channel admin	Removes the bot from channels and disables webhooks.
Communications	Handles external-facing notifications when data was exposed.

Checklist

Anti-patterns during response

Restarting the agent before snapshotting evidence.
Killing the container before exporting logs.
Treating "the agent denies it did X" as evidence.
Rotating only the credential that obviously leaked.
Skipping memory review because "the agent doesn't write to memory" (verify before you believe it).
Closing the incident without updating the relevant L-level.

Need a response plan before you need one?

We build incident-response runbooks for production AI, run tabletop drills, and threat-model agents before they ship. Tell us what you're running.

Get in touch