The pager goes off at 3:11 a.m. It's the fifth time tonight, and it's the same alert: HighMemoryUsage on a node that's running a memory-mapped cache doing exactly what it was designed to do. You ack it half-asleep, knowing it'll fire again in twelve minutes. By the time the real incident shows up at 4:40 — a slow API degradation that's quietly eating your error budget — you're too fried to see it clearly. That's not a tooling failure. That's a design failure, and most of us have lived it.

I run production OpenStack, Kubernetes, Terraform, and the observability stack that watches all of it. I've spent more nights than I'd like fighting my own alerts. So when "AI for SRE" started showing up in every vendor deck, my first reaction was a tired no. The last thing on-call needs is an autonomous bot deciding to restart my database at 3 a.m. based on a hunch.

But there's a version of this that actually works, and it has nothing to do with autonomy. It's about using AI for the narrow set of things it's genuinely good at — clustering noise, summarizing storms, drafting hypotheses, surfacing the right runbook — while a human stays the final decision-maker. AI triages and proposes. You decide what pages a human and what gets fixed. That's what I mean by humanizing AI: not making the machine more human, but using the machine to keep the on-call human rested, focused, and in control.