AI agent incident management: when your agent breaks at 2am

Your support automation agent stopped producing useful responses forty minutes ago. You do not know this yet. You are asleep. The first signal you will get is a user complaint in your Slack, probably around 6am. By then, hundreds of support tickets have been handled poorly, and the cost of those bad interactions — churn, manual remediation, trust damage — is already on the board.

This is not a hypothetical. It is the default state of most production AI agent deployments today. The teams that have avoided it are the ones that treated agent incident management as a first-class engineering concern before they needed it.

Why agent incidents are different

Incident response for traditional software has a clear starting point: something stopped working, and you can usually tell what and when. The service returned 500 errors starting at 02:17 UTC. The database connection pool was exhausted at 02:14 UTC. The deployment that went out at 01:45 UTC introduced the bug. The timeline is reconstructible from logs and metrics.

Agent incidents often do not have a clear starting point because the agent did not stop working. It kept running. It just started producing wrong outputs. Wrong here might mean: subtly off in a way that requires domain knowledge to catch, or confidently incorrect in a way that looks plausible to monitoring systems that only check for errors.

The second difference is causality. When a service breaks, the cause is usually in the code path — a bug, a dependency failure, a resource exhaustion. When an agent degrades, the cause could be in the prompt (someone changed it), the tools (a downstream API changed its behavior), the model (the LLM provider updated the model weights), or the input distribution (the types of queries the agent is receiving shifted). Diagnosing which of these caused the incident is a fundamentally different kind of investigation.

Detection: the problem you have to solve first

You cannot respond to an incident you do not know about. For agents, detection requires something most teams do not have: output quality monitoring.

Infrastructure monitoring will tell you the agent is running. It will tell you run count, run latency, and infrastructure-level errors. What it will not tell you is whether the runs that completed successfully actually produced good outputs. For that, you need output quality checks built into your monitoring pipeline.

What this looks like in practice:

Structural validation — does the output conform to the expected schema? If your agent is supposed to produce a JSON object with specific fields, validate that on every run.
Semantic validation — does the output make sense in context? This requires a validation model or function that evaluates the output against the input and expected behavior. It is harder than structural validation but catches more issues.
Statistical monitoring — is the distribution of outputs shifting? If your agent normally produces responses in a certain length range or with certain key phrases, anomaly detection on those distributions can catch degradation before structural validation does.
Cost and step count monitoring — a sudden change in average cost or average step count is often a leading indicator of behavioral drift. If the agent suddenly started taking twice as many steps, something changed.

The alert that wakes you up at 2am

The alert should be generated by your monitoring system, not by a user. To make this work, you need thresholds defined ahead of time. What is the acceptable output quality pass rate for your agent? What is the acceptable cost-per-run range? What is the acceptable step count distribution?

These thresholds require you to know your baseline, which means you need to have been measuring these things before the incident. This is the ops practice most teams skip: establish the baseline, define the thresholds, and set the alerts before you need them.

Once the alert fires:

Who gets paged? Define this now, not during the incident.
What is the immediate mitigation? This should be in the runbook before you need it.
What information does the on-call engineer need to start diagnosing?

Immediate mitigation: the options

When your agent is actively degrading in production, you have a small set of mitigation options. The right one depends on the nature of the incident, but you need to know what they are before 2am.

Rollback to a previous version. If the degradation started after a prompt update or agent logic change, rolling back to the last known-good version is the fastest path to restoration. This requires that you version your prompts and agent configurations, and that you have a rollback mechanism that does not require a code deployment. The time to build that mechanism is not during an incident.

Disable the agent and fall back to a manual process. If rollback is not possible or the incident cause is unknown, disabling the agent and routing to a human-handled fallback may be the right call. This requires that the fallback exists and is reachable. Again, the time to build the fallback is before you need it.

Circuit break the degraded tool or dependency. If the root cause is a specific tool call — a search API returning bad results, a database query returning stale data — you may be able to disable just that tool and let the agent continue with degraded capability, rather than disabling it entirely.

Implement a cost guard. If the agent is in a runaway retry loop or accumulating cost unexpectedly, a cost guard that terminates runs above a threshold is a containment mechanism that limits blast radius while you investigate.

The post-incident review for agents

After the incident is resolved, the post-incident review for agents has some additional questions compared to traditional software incidents:

What changed in the agent's inputs, prompts, tools, or model in the 48 hours before the incident?
Was the incident detectable from metrics before the first user complaint? If not, what monitoring would have caught it earlier?
How long did it take from first degradation to detection? From detection to mitigation? From mitigation to resolution?
Was rollback available? If not, why not?
What is the runbook change from this incident?

The answer to most of these questions in the early days of most teams' agent operations practice is "we did not have that." That is fine. The point of the post-incident review is to add it.

Build the practice before you need it

The single most important thing you can do for agent incident management is to build the practice before you have an incident. Define your SLOs. Establish your monitoring. Set your alerts. Write your runbooks. Test your rollback mechanism. Make sure the on-call rotation knows what they are on call for.

None of this is new thinking. It is SRE applied to a new system. The failure mode for most teams is not that they do not know the principles — it is that they ship the agent before they have the operations practice in place. Do not be that team.

Get alerted before your users are

Agent Opz monitors output quality, cost, and uptime across your entire agent fleet and alerts you before users notice.

Get early access