If you call an open-weight model behind an API, whether that is your own box, a hosted endpoint, or a router, you are trusting that the thing answering is the model on the label. Providers have every incentive to serve a smaller or more aggressively quantized model under load. I wanted to know if you can catch that from the outside.
Short version: the obvious method fails, a less obvious one works, and it only works if you accumulate evidence.
Attempt 1: grade the output. Dead on arrival.
The intuitive idea is to send a prompt, look at the answer, and flag low-quality responses. I scored served outputs by perplexity under the model that was supposed to produce them. The result was backwards. A cheaper model (I used a 0.5B as the impostor) produces simpler, more generic, more predictable text, and predictable text has low perplexity under any model. The impostor's output scored better than the genuine model's own output, by about 0.65 bits per byte, on 9 of 10 prompts. So "flag the improbable answers" rewards the cheaper model. Scratch that.
Attempt 2: the scoring challenge.






