The prompt regression that cost me the worst Sunday of 2025 was twelve characters long.

I had been tuning the system prompt of an LLM feature for about a month. The feature was a structured-output endpoint that took a free-form user message and returned a JSON object with five fields. It worked. Users liked it. Conversion was up. On a Friday afternoon I added what I thought was a small clarification to the prompt: a sentence telling the model to "prefer concise wording" in one of the fields. I deployed. I went into the weekend feeling good about my week.

By Sunday morning, support had three tickets and a Slack DM from our biggest customer. The endpoint was returning malformed JSON on roughly 14% of requests. Not always. Just often enough that retries hid it from our dashboards. The "prefer concise wording" sentence had pushed the model into a mode where it sometimes dropped a comma in the JSON output, because conciseness and valid JSON syntax were apparently in tension in a way I had not anticipated.

The fix was rolling back the prompt. The fix took ninety seconds. The problem was that I had no idea which version of the prompt was running. I had edited the prompt in three places that month: the source code, a draft in Notion, a quick test in the model playground. The deployed version was a copy-paste of one of them. I had no diff. I had no test for "does this prompt still return valid JSON for our golden set of inputs." I had no rollback button. The Sunday was spent reconstructing what I had typed, retesting it by hand against a list of fifty inputs I dug out of the logs, and deploying the rebuilt prompt.