A model that scores 95% on your test set feels like the finish line. Then you ship it, and you find out it was the starting line. The model was maybe 10% of the work; everything that makes it survive production is the other 90%.

We deploy machine-learning systems for companies, and the projects that stall almost never stall on model accuracy. They stall on the engineering around the model. Here's what actually breaks.

1. Training-serving skew

Your model was trained on clean, batch-computed features. In production it gets features computed by different code, at request time, sometimes from a slightly different source. The distributions drift apart and accuracy quietly craters — with no error, no crash, just worse predictions.

The fix is sharing one feature-computation path between training and serving (a feature store or shared library), and logging production features so you can compare them against training. If training and serving don't compute features the same way, nothing else matters.