Every few weeks a thread blows up: "Is Claude Code getting worse?" Someone swears Opus got lazy after an update; someone else says it's placebo. The arguments are always vibes — nobody posts numbers.
So I built a tracker. For ~95 days it's logged the daily SWE-Bench-Pro pass rate for Claude Code and Codex — the % of real coding tasks each agent completes unassisted — and plotted them as candlesticks (open = yesterday, close = today, wick = the 90% confidence interval for that day's sample). Same idea as a stock K-line, except the "price" is how often the agent actually solves the task.
Here's what the data says — and it's more interesting than "it got dumber."
Claude Code: a real step up, then a recent slide
Plotting per-model-version baselines (median of the first 14 days after each release) makes the story obvious:






