I tracked Claude Code and Codex pass-rates for 95 days — what "getting dumber" actually looks like

Every few weeks a thread blows up: "Is Claude Code getting worse?" Someone swears Opus got lazy after an update; someone else says it's placebo. The arguments are always vibes — nobody posts numbers.

So I built a tracker. For ~95 days it's logged the daily SWE-Bench-Pro pass rate for Claude Code and Codex — the % of real coding tasks each agent completes unassisted — and plotted them as candlesticks (open = yesterday, close = today, wick = the 90% confidence interval for that day's sample). Same idea as a stock K-line, except the "price" is how often the agent actually solves the task.

Here's what the data says — and it's more interesting than "it got dumber."

Claude Code: a real step up, then a recent slide

Plotting per-model-version baselines (median of the first 14 days after each release) makes the story obvious:

Here's what the data says — and it's more interesting than "it got dumber."

Claude Code: a real step up, then a recent slide

Plotting per-model-version baselines (median of the first 14 days after each release) makes the story obvious:

I tracked Claude Code and Codex pass-rates for 95 days — what "getting dumber" actually looks like

I tracked Claude Code and Codex pass-rates for 95 days — what "getting dumber" actually looks like

Related reading

I Reach for Cursor 90% of the Time — Here's the 10% Where Claude Code Wins

Building a Custom Status Line for Claude Code

Why You're Getting 30% of Claude Code (And the 5 Files That Fix It)

Are You Actually Using Claude Code Well? I Built a Free Scorer Based on…

Where Claude Code Tokens Actually Go (and How to Cut the Waste)

We Ran 4 Claude Code Dialogs for 28 Hours. Here's What the Memory Layer Caught…

Related reading

I Reach for Cursor 90% of the Time — Here's the 10% Where Claude Code Wins

Building a Custom Status Line for Claude Code

Why You're Getting 30% of Claude Code (And the 5 Files That Fix It)

Are You Actually Using Claude Code Well? I Built a Free Scorer Based on…

Where Claude Code Tokens Actually Go (and How to Cut the Waste)

We Ran 4 Claude Code Dialogs for 28 Hours. Here's What the Memory Layer Caught…