Each roast was taking 50 seconds per upload. Quality was unknown — we had a feeling, not data. The prompt had been written "by instinct" and never seriously evaluated. The question was simple: how do you know if a prompt is good, and how do you improve it without spending the whole day reading roasts manually?
The answer: automate the evaluation work using AI itself, in a loop. Write a tool that sends 30 photos to Claude, measures quality metrics, and produces a report. Modify the prompt, rerun, compare. Five iterations later, here's what we learned.
Context: RateMyFace
RateMyFace is an AI roast-by-photo site: the user uploads a photo, Claude analyzes it and generates satirical text along with a score and a "tier label" (e.g. "WiFi Signal With Legs"). The result is rendered as a collectible trading card.
The stack: Go monolith, SQLite, Claude CLI (claude --print) called as a subprocess. The prompt asked Claude to produce 5 roast styles (standard, rap, Shakespeare, passive-aggressive mom, Gordon Ramsay) + a score + a label, all in JSON.







