I wanted to know which AI notetaker transcribes most accurately — Granola, Fathom, or Otter. So I did the obvious thing: I recorded a real meeting, ran it through all three, and compared the transcripts.

That experiment is worthless, and it took me one afternoon to see why. To score a transcript you need the correct transcript to score it against. But the only record of what was actually said in my meeting was… the transcripts I was trying to grade. I was marking the exam with the students' own answers. There was no answer key.

The fix turned out to be the interesting part, and it's a trick worth stealing for any speech-to-text evaluation: if you don't have ground truth, manufacture it. Write the script first, synthesize the audio from it, and now the exact words are something you typed — not something you have to reconstruct after the fact.

Generate the meeting, keep the answer key

I wrote an 80-second, two-speaker product meeting and deliberately stuffed it with the tokens that actually matter in a work call and that ASR engines love to fumble: quarter labels (Q3, Q2), percentages (5.2%, 6.8%, 41%, 58%), dollar figures ($16 → $19), jargon (churn, cohort, activation, SSO, deep links, p95, P1), names (Sarah, David, Priya, Marcus), and a few real action items with deadlines.