Earlier this year, a bunch of people figured out they could use a customer service chatbot for a popular fast-food chain as a free coding assistant. It went viral. Some customers came looking for burritos and others left with LeetCode solutions. Everyone got what they wanted except the company paying for the inference.
The chatbot was backed by a capable general-purpose model with no way to enforce what it should and shouldn't answer. If you asked it to invent a novel approach to bubble sort, it would try. The model didn't know it was only supposed to be a burrito bot, it just saw a prompt and responded.
If your AI endpoint doesn't restrict who can sent it requests, and have a way to limit what it will and won't answer, any general-purpose model you expose becomes a general-purpose model for everyone, on your dime.
That's an easy way to become the victim of inference theft.
Inference theft occurs when someone repurposes your AI application as a model endpoint that you never intended to expose. They route requests through your application and let you pay the inference bill. Inference theft is one of the fastest ways to create a denial-of-wallet event.






