Originally published on kuryzhev.cloud
The Scenario
Your runbook was written six months ago by someone who no longer works here — and it's 3am, three payments-api pods are OOMKilling in a cascade, and the on-call engineer is staring at a Confluence page that references a Datadog dashboard that got renamed in February. This is the real failure mode nobody talks about in postmortems: the documentation debt that compounds silently until it explodes during a P1.
I've been in that seat. The thing that changed how I handle incident documentation wasn't a better wiki tool or a stricter postmortem template. It was treating prompt engineering for SRE workflows as reusable infrastructure — not one-off ChatGPT queries. Generic prompts fail in SRE contexts for a specific reason: they have no system topology, no severity framing, and no structured output contract. You ask "how do I fix an OOMKill?" and you get a textbook answer that has nothing to do with your 512Mi memory limit, your Redis connection pool, or your GKE 1.29 cluster. What you actually need is a prompt pattern that injects your environment's context and enforces a structured response you can act on immediately.
This post walks through three production-tested prompt engineering patterns for SRE playbooks: structured context injection for runbook generation, two-step postmortem synthesis, and LLM-as-reviewer for runbook auditing. These patterns are most powerful when maintained like code — versioned, reviewed, and updated after every incident.






