The SRE guide to AI agent reliability

Site reliability engineering is the discipline of applying software engineering to operations problems. Its core insight is that reliability is a feature, it can be engineered like any other feature, and it requires the same rigor: clear targets, measurable metrics, systematic investigation, and continuous improvement.

Every principle of SRE applies to AI agents. The adaptation required is less than most people expect. What is required is specificity: the exact form that SLOs, error budgets, monitoring, and runbooks take for an agent system is different from their form for a traditional service. This guide maps those differences.

Service level objectives for agents

An SLO is a target for a reliability metric over a time window. For a web service, the canonical SLO is availability: 99.9% of requests return a 2xx response within 500ms. Simple to define, simple to measure.

For an agent, availability has two components that you need to define separately:

Infrastructure availability — the agent accepts and starts runs. This is equivalent to the web service availability SLO. It is relatively easy to measure: what fraction of run attempts result in a started run?

Output quality rate — the agent produces outputs that meet your quality standard. This is harder to measure because it requires defining and evaluating quality. But it is the SLO that actually matters to users. An agent that starts every run but produces wrong outputs 40% of the time is not meeting its reliability target.

A complete agent SLO set:

Example agent SLOs

Infrastructure availability: >99.5% of run attempts start within 5s

Output quality rate: >95% of completed runs pass quality validation

Latency (p95): <8s to first output token for synchronous agents

Cost-per-run (p99): <$0.05

Hard failure rate: <1% of runs result in an unhandled error

The right numbers depend on your agent and your users. The point is to have numbers, to measure against them, and to treat violation as an incident.

Error budgets for agents

An error budget is the allowed amount of unreliability implied by your SLO. If your output quality SLO is 95%, your error budget is 5%: you are allowed 5% of runs to fail quality checks before you are in violation. The error budget frames reliability as a shared resource — it can be spent on shipping new features that introduce risk, and it is exhausted by incidents.

Error budgets change team behavior. When a team knows it has consumed 80% of its error budget for the month, it makes different decisions about whether to ship a risky agent change before month end. That is the point: the error budget makes the cost of unreliability visible in terms that drive good decisions.

For agents, error budgets should be tracked for output quality rate and hard failure rate separately. These are different budget pools with different drivers and different remediation actions.

The monitoring stack for agent reliability

Traditional service monitoring has three layers: infrastructure metrics (CPU, memory, network), application metrics (request rate, latency, error rate), and business metrics (transactions processed, revenue, user engagement). Agent monitoring needs the same three layers with agent-specific definitions at each level.

Infrastructure layer: compute utilization, memory, network, cold start rate (for serverless agents), queue depth (for async agents). Standard infrastructure monitoring tools handle this.

Agent execution layer: run count, run success rate, latency (wall clock, not just LLM latency), step count distribution, tool call count and failure rate, retry rate, cost-per-run, token count per step. This layer requires agent-specific instrumentation. You need to emit these metrics from your agent orchestration layer.

Output quality layer: pass rate against your validation function, distribution of output characteristics (length, structure, sentiment if relevant), comparison of current output distribution against baseline. This layer is the hardest to build and the most important. Without it, your monitoring is measuring the wrong thing.

Toil and automation

One of SRE's core concepts is toil: manual, repetitive operational work that could be automated. SRE practice calls for teams to spend no more than 50% of their time on toil, with the remainder on engineering work that reduces future toil.

For agent operations, common sources of toil:

Manually reviewing agent outputs to check quality (automate with a validation model)
Manually checking dashboards to see if the agent is working (automate with alerting)
Manually rolling back after a bad prompt update (automate with canary rollouts and automatic triggers)
Manually reconstructing what happened during an incident (automate with structured run logging)
Manually estimating agent costs (automate with per-run cost instrumentation)

Every item on this list is engineering work that eliminates ongoing operational toil. It is the SRE model applied to agents.

Runbooks for agent incidents

A runbook is a documented procedure for responding to a known type of incident. For agents, the runbooks you need most:

Output quality regression: Agent is completing runs but output quality has dropped. Steps: confirm the regression with output quality metrics. Check for recent prompt changes. Check for model provider incidents or model version changes. Check for distribution shift in input types. Mitigation: roll back to last known-good prompt version. Investigation: diff the current prompt against the rolled-back version. Evaluate on a sample of failing inputs to understand the regression.

Cost spike: Cost-per-run has increased significantly from baseline. Steps: check step count distribution — are runs taking more steps? Check retry rate — are retries increasing? Check tool call costs — are any external tools billing more? Check context length — is context accumulation contributing? Mitigation: implement a cost guard at the run level to prevent runaway spending while investigating.

Tool dependency failure: A downstream tool or API the agent depends on is failing. Steps: confirm which tool is failing and at what rate. Check the tool provider's status page. If the failure is partial, check whether the agent is falling back correctly. Mitigation: if the tool failure is complete, consider disabling agents that require it and routing to manual handling.

Latency regression: Agent runs are taking significantly longer than baseline. Steps: check LLM provider latency (is the provider having issues?). Check step count — are agents taking more steps? Check tool call latency — are downstream tools slow? Check context length — is context accumulation slowing later steps? Mitigation: if the cause is LLM provider latency, consider switching to a backup model endpoint.

Chaos engineering for agents

Chaos engineering is the practice of intentionally introducing failures to test a system's resilience. The classic form is terminating random instances to test fault tolerance. For agents, chaos engineering takes different forms:

Inject tool failures at a configurable rate to verify fallback behavior
Introduce malformed inputs to test validation and error handling
Simulate LLM API rate limits to test backoff and retry logic
Deploy a deliberately degraded prompt to verify that quality monitoring catches it
Test rollback by deploying a known-bad version in a staging environment and verifying automatic rollback fires

The last one is particularly important: if your rollback mechanism has never been tested, you do not know if it works. Test it before you need it.

The reliability mindset for agent teams

The most important thing SRE transferred from Google's practices to the broader industry was not a tool or a framework — it was a mindset. Reliability is not the natural state of complex systems. It is an engineered property that requires deliberate effort to create and maintain.

That mindset applies to AI agents. Your agent will not remain reliable without active effort. The prompt will drift. The input distribution will change. Model providers will update their models. Tool dependencies will change their behavior. The reliability you have today is not guaranteed tomorrow.

Building an agent reliability practice means accepting that reality and investing in the systems — monitoring, alerting, deployment safety, incident response — that let you detect and respond to reliability changes faster than your users notice them.

The SRE toolkit for your agent fleet

Agent Opz gives you SLO tracking, output quality monitoring, deployment safety, and runbook-ready incident data — built for agents from the ground up.

Get early access