I gave GLM-4.5-Air (106B, open weights) 12 coding tasks through opencode on my RTX 3090. It scored 0% — never edited a single file.
Same model, same GPU, same tasks, but driven by a ~150-line LangGraph agent instead: 93%.
The model was never the problem. The orchestrator was. Here's the benchmark — including the part nobody else measures, the electricity cost per correct task.
Setup
RTX 3090 (24 GB) + 128 GB RAM, models via ollama, Q4 quants, temp 0.2







