The Synthetic Data Trap: When It Helps, When It Lies

This article originally appeared in The Forward Pass, a weekly newsletter for ML engineers who ship. Get a free issue every month →

When it helps, when it lies, and what to watch for in your eval set.

by Maxim Enis · 5 min read

Synthetic data has quietly become the default answer to every data problem. Not enough training examples? Generate them. Edge cases underrepresented? Synthesize them. Eval set feels thin? Ask the model to write more questions. I have seen this pattern across dozens of teams in the last year, and the results are uneven in a very specific way: synthetic data works well when you use it for training augmentation, and it corrupts your evals in ways that are invisible until you ship something embarrassingly wrong.

This article originally appeared in The Forward Pass, a weekly newsletter for ML engineers who ship. Get a free issue every month →

The Synthetic Data Trap: When It Helps, When It Lies

When it helps, when it lies, and what to watch for in your eval set.

by Maxim Enis · 5 min read

The Synthetic Data Trap: When It Helps, When It Lies

The Synthetic Data Trap: When It Helps, When It Lies

Other newsrooms on this story

Related reading

Designing synthetic datasets for the real world: Mechanism design and reasoning…

Why "It Works" Is the Wrong Bar for AI-Generated Code in Agentic Systems

Catching AI Red-Handed in Financial Data

The Sequence AI of the Week #887: Meta's Autodata: When Models Learn to Make…

AI Evaluators Struggle with Models That Know When They’re Being Tested

Xerox Is an AI Trap — This Company Is a Better "Match"

Other newsrooms on this story

Related reading

Designing synthetic datasets for the real world: Mechanism design and reasoning…

Why "It Works" Is the Wrong Bar for AI-Generated Code in Agentic Systems

Catching AI Red-Handed in Financial Data

The Sequence AI of the Week #887: Meta's Autodata: When Models Learn to Make…

AI Evaluators Struggle with Models That Know When They’re Being Tested

Xerox Is an AI Trap — This Company Is a Better "Match"