I asked Claude Code to refactor 100 functions across a Python service I owned. It did the job in two passes. CI was green on both. The PR description was so neat I almost felt bad shipping it on a Friday.
Two weeks later, on-call paged me because the p95 of one endpoint had drifted from 180 ms to 240 ms. I started bisecting. The bisect landed on the refactor PR. I started reading the refactor PR. Seven of the 100 functions were slower in production. CI never noticed because CI does not measure "slower." It measures "returns the same value."
This post is about what those seven slow functions had in common, why mutation tests and unit tests both missed them, and the four checks I now run before I let Claude, or any AI, refactor anything that ships under load.
The setup, so you can tell whether this generalizes
The codebase: a Python 3.12 service with about 18k lines of business logic, FastAPI on the edge, asyncpg to Postgres, a Redis cache, and a CPU-bound scoring module that runs on every request. The 100 functions were a curated batch: small to medium, pure where possible, all with unit tests. I asked Claude Code to apply a standard set of cleanups: early returns, extracted variables for magic numbers, comprehensions where loops did one thing, dataclass conversions for ad hoc tuples.






