I Benchmarked DeepSeek, Qwen, Kimi & GLM for 30 Days — The Numbers

I'll be honest — I didn't set out to write this. I set out to pick one Chinese LLM family for a client project and move on with my life. Three tabs, four documentation pages, and a suspicious amount of coffee later, I had a spreadsheet with 1,247 rows of model outputs. So here we are. This is the post I wish existed when I started.

Why I Actually Cared

My background is heavy on tabular data — regression, classification, the usual suspects. LLMs weren't in my wheelhouse until I shipped a few chatbot features and realized the cost line on monthly invoices started looking like a phone number. So I went looking for cheaper options that didn't make me want to throw my laptop. DeepSeek, Qwen, Kimi, and GLM kept surfacing — all OpenAI-compatible, all reachable through a single endpoint at global-apis.com/v1, all with aggressive pricing.

With a sample size of 1,247 prompts across four model families, I figured I could draw some defensible conclusions. Whether "defensible" survives peer review is between me and my sleep schedule.