I see a lot of claims about which model is "best." Best at what? For whom? At what cost?

I got tired of guessing. So I ran my own comparison.

The setup

I took 500 real queries from my production logs – a mix of:

Code generation (120 queries)