Prompt Engineering Patterns for SRE Playbooks and Postmortems

Originally published on kuryzhev.cloud

The Scenario

Your runbook was written six months ago by someone who no longer works here — and it's 3am, three payments-api pods are OOMKilling in a cascade, and the on-call engineer is staring at a Confluence page that references a Datadog dashboard that got renamed in February. This is the real failure mode nobody talks about in postmortems: the documentation debt that compounds silently until it explodes during a P1.

I've been in that seat. The thing that changed how I handle incident documentation wasn't a better wiki tool or a stricter postmortem template. It was treating prompt engineering for SRE workflows as reusable infrastructure — not one-off ChatGPT queries. Generic prompts fail in SRE contexts for a specific reason: they have no system topology, no severity framing, and no structured output contract. You ask "how do I fix an OOMKill?" and you get a textbook answer that has nothing to do with your 512Mi memory limit, your Redis connection pool, or your GKE 1.29 cluster. What you actually need is a prompt pattern that injects your environment's context and enforces a structured response you can act on immediately.

This post walks through three production-tested prompt engineering patterns for SRE playbooks: structured context injection for runbook generation, two-step postmortem synthesis, and LLM-as-reviewer for runbook auditing. These patterns are most powerful when maintained like code — versioned, reviewed, and updated after every incident.

Originally published on kuryzhev.cloud

The Scenario

Prompt Engineering Patterns for SRE Playbooks and Postmortems

Prompt Engineering Patterns for SRE Playbooks and Postmortems

Related reading

Building an AI SRE That Learns From Every Outage: Inside Nexus Sentinel

What is SRE? A Beginner's Guide to Site Reliability Engineering

Runbooks in Minutes: An On-Call Incident Copilot with HazelJS

Designing a Developer-Centric Incident Response Playbook

How We Stopped Losing 45 Minutes Every Time Production Broke

We stopped writing eval cases by hand. Now every prod incident becomes one.

Related reading

Building an AI SRE That Learns From Every Outage: Inside Nexus Sentinel

What is SRE? A Beginner's Guide to Site Reliability Engineering

Runbooks in Minutes: An On-Call Incident Copilot with HazelJS

Designing a Developer-Centric Incident Response Playbook

How We Stopped Losing 45 Minutes Every Time Production Broke

We stopped writing eval cases by hand. Now every prod incident becomes one.