The leaderboard moved — again. Between April and June 2026, at least five major open-weight coding models shipped, two of them from labs most Western developers haven't heard of. If you read a "best open-source LLMs" guide from three months ago, it's already wrong.
This post is current as of June 8, 2026. Every benchmark number below has a source. Where benchmarks are self-reported by labs (which most are — we'll get into that), we say so.
First: Stop Trusting HumanEval Scores
Everyone above 85% on HumanEval can be ignored for ranking purposes. That includes Qwen, DeepSeek, Codestral, Llama — all of them cross that threshold now. The benchmark is saturated, and there's strong evidence of training data contamination across the board.
The numbers that actually discriminate in 2026:






