SREs Don't Need Replacing, They Need Pairing
Most SRE work follows a pattern. An alert fires, you check the metrics, pull the logs, form a theory, fix it. Then you do the same thing next week. That sequence is a workflow. When agents can execute it and produce structured data at every step, the work changes shape.
Most SRE work follows a pattern. An alert fires. You look at the metrics. You check the logs. You correlate what you see with what you know about the system. You form a theory, verify it, fix it. Then you do the same thing next week when the same class of issue shows up somewhere else.
Agents are built for this work. Not because they're smarter than you, but because they don't get bored, they don't skip steps, and they can hold the context of a hundred resources while methodically working through each one.
Why agents fit here
Incident investigation isn't one big decision. It's dozens of small ones. What changed? Which service is actually failing? Is this cause or symptom? What should I check next? The hard part isn't any individual step, it's maintaining context across all of them. You're switching between Grafana, your log platform, kubectl, your runbook, a Slack thread from the last time this happened, and the mental model of how these services actually depend on each other.
That's where humans drop things. You skip a step because you're pretty sure it's not the problem. You forget to check the upstream dependency because the logs from the downstream service looked suspicious enough. You lose track of which services you've already verified. An agent executing a structured investigation doesn't get impatient, jump ahead, or skip verification steps. It works through each one in order, carries the full context forward, and doesn't take shortcuts based on intuition. Sometimes the methodical approach finds what the shortcut would have missed.
The investigation is a workflow
Once you look at an investigation this way, something becomes obvious: it's already a workflow. Your observability stack has the data. The problem has never been the data, it's that a human has to stitch it all together in their head at 2am under pressure.
Alert fires on elevated error rates. You open Grafana, see the spike, check which services are affected, pull logs from the time window, notice a pattern in the errors, check the upstream dependency. Every one of those steps produces data that feeds the next one.
In swamp, you write extensions that know how to talk to Prometheus, query your log platform, and check service health. An agent takes those models, shapes them into the right investigation workflow for the problem at hand, and executes it. Every step produces versioned, immutable data: the metrics snapshot, the log analysis, the health check results, the root cause assessment.
The diagnosis and the fix are connected through the data model, not through prose in a chat window. The agent reads the investigation results, checks the pre-flight constraints, and executes the fix. The data from "what's wrong" feeds directly into "how to fix it" through typed CEL expressions, not through an LLM re-reading its own output and hoping it remembers correctly.
swamp data query 'workflowName == "incident-response" && specName == "diagnosis"' \
--select '{"service": attributes.service, "issue": attributes.rootCause, "action": attributes.recommendedAction}'
service issue action
─────────────────── ──────────────────────── ──────────────────
payment-gateway upstream timeout circuit-breaker
auth-service connection pool exhausted restart with drain
cache-layer memory pressure scale replicas
When the remediation workflow runs, it reads this data to decide what to do.
It compounds
Building an investigation workflow for a class of incident takes effort. You need extensions that talk to your observability stack, model types that understand your service topology, checks that validate whether a remediation is safe. That's SRE work: encoding what you know about how your systems behave.
But the workflow survives the incident. The next time that class of issue fires, the agent runs the existing workflow and the investigation takes minutes instead of an hour. The structured data from the previous incident is still there, so the agent can compare what it's seeing now against what happened before. Every run that reveals a gap is an opportunity to add a check or a log query, and the investigation gets more thorough over time without anyone maintaining a runbook.
A runbook tells someone what to do. This does it, produces structured evidence, and gets better every time it runs.
More SREs, not fewer
SREs have spent years building the exact mental model that agents need: what operations are safe, what order things need to happen in, what to check before and after a change, what failure looks like and how to recover from it. I wrote about this in the encoding knowledge post. SREs aren't learning a new discipline to work with agents. They're encoding the discipline they already have.
The skepticism SREs bring is healthy. You've seen what happens when untested automation runs against production. The question isn't whether agents should touch your infrastructure, it's whether the system around them is rigorous enough to make it safe. Deterministic execution against typed schemas with pre-flight checks is how you get there.
We've seen this pattern before. Better automation doesn't reduce operational complexity. It increases the amount of infrastructure people are willing to build and operate. The bottleneck moves from execution to judgment. When agents handle the investigation and execute remediations against encoded guardrails, you need more people making the calls: is this remediation safe, should we scale or fix the underlying issue, is this alert noise or the early signal of something bigger. That's the work SREs have always wanted to spend their time on instead of grepping logs at 2am.