This article originally appeared in The Forward Pass, a weekly newsletter for ML engineers who ship. Get a free issue every month →

The Synthetic Data Trap: When It Helps, When It Lies

When it helps, when it lies, and what to watch for in your eval set.

by Maxim Enis · 5 min read

Synthetic data has quietly become the default answer to every data problem. Not enough training examples? Generate them. Edge cases underrepresented? Synthesize them. Eval set feels thin? Ask the model to write more questions. I have seen this pattern across dozens of teams in the last year, and the results are uneven in a very specific way: synthetic data works well when you use it for training augmentation, and it corrupts your evals in ways that are invisible until you ship something embarrassingly wrong.