TL;DRAI

Shopify automated training data curation via 4-model LLM judges in consensus mode to teach Sidekick refusal behavior—fixing the production-data blind spot where only successful queries get logged. This cuts annotation costs while improving agentic reliability in production, a transferable pattern for domain-specific models.

Training data has always been the hard part of machine learning. Not the architecture, not the compute, but the data itself. You need enough of it, you need it clean and diverse, and you need it to cover the cases that actually matter. Before LLMs, that mostly meant labeling pipelines and feature engineering. Now, with LLMs, the problem doesn't go away but it shifts. When you fine-tune a foundation model for a specific domain, you're teaching behavior: how the model should interpret requests, handle ambiguity, and recognize when a request is genuinely impossible. We’ve learned that the last one is the hardest to teach.

It's hardest because production data has a blind spot. Every example in a production training corpus is a success story: the model did something right, it got evaluated, it shipped. The cases where it should have refused, the edge cases, the impossible requests… those never make it into the logs. So you end up with a model trained entirely on successful queries. When it hits something it can't fulfill, there's no learned behavior to fall back on. It improvises, usually badly.

We ran into this while building Sidekick, our AI assistant for merchants. Sidekick works on two layers: an outer planner that interprets the merchant's overall intent, and a set of specialized skill models that each handle a specific capability. The planner routes "send a discount to my best customers" to segmentation, analytics, email, and so on. As we covered in our article on building production-ready agentic systems, keeping those skills performing well takes continuous work.

shopify.engineering

Teaching Sidekick to say no: automated data curation with LLM judge consensus (2026) - Shopify

Our Sidekick team used LLM judges to automatically curate training data and teach domain-specific models how to refuse impossible queries.

lunedì 15 giugno 2026 New tab

TL;DRAI

1,876 words~9 min read

Teaching Sidekick to say no: automated data curation with LLM judge consensus (2026) - Shopify

Teaching Sidekick to say no: automated data curation with LLM judge consensus (2026) - Shopify

Other newsrooms on this story

Related reading

Exploring LLM-as-a-Judge

The Three Phases of Post-Training: How LLMs Learn to Provide Sensible Responses

Overcoming LLM Limitations

Teaching the model: Designing LLM feedback loops that get smarter over time

LLMs suck at generating large, structured data. Tips on how to get your AI…

An open source LLM eval tool with two independent quality signals

Other newsrooms on this story

Related reading

Exploring LLM-as-a-Judge

The Three Phases of Post-Training: How LLMs Learn to Provide Sensible Responses

Overcoming LLM Limitations

Teaching the model: Designing LLM feedback loops that get smarter over time

LLMs suck at generating large, structured data. Tips on how to get your AI…

An open source LLM eval tool with two independent quality signals