Training data has always been the hard part of machine learning. Not the architecture, not the compute, but the data itself. You need enough of it, you need it clean and diverse, and you need it to cover the cases that actually matter. Before LLMs, that mostly meant labeling pipelines and feature engineering. Now, with LLMs, the problem doesn't go away but it shifts. When you fine-tune a foundation model for a specific domain, you're teaching behavior: how the model should interpret requests, handle ambiguity, and recognize when a request is genuinely impossible. We’ve learned that the last one is the hardest to teach.
It's hardest because production data has a blind spot. Every example in a production training corpus is a success story: the model did something right, it got evaluated, it shipped. The cases where it should have refused, the edge cases, the impossible requests… those never make it into the logs. So you end up with a model trained entirely on successful queries. When it hits something it can't fulfill, there's no learned behavior to fall back on. It improvises, usually badly.
We ran into this while building Sidekick, our AI assistant for merchants. Sidekick works on two layers: an outer planner that interprets the merchant's overall intent, and a set of specialized skill models that each handle a specific capability. The planner routes "send a discount to my best customers" to segmentation, analytics, email, and so on. As we covered in our article on building production-ready agentic systems, keeping those skills performing well takes continuous work.













