"If you see cascading errors, find the first thing that fails and stop reading the log there. Everything after the first failure is the system reacting to the first failure."
A production Python Telegram bot I was looking after started crashing every 2-3 hours. The traceback was a horror show — TelegramRetryAfter, then asyncio.TimeoutError, then sqlite3.OperationalError: database is locked, then 47 leaked sessions, then the process got OOM-killed, then systemd restarted it. Then it happened again, 140 minutes later, like clockwork.
The temptation when you see this kind of cascade is to throw the whole architecture out. "SQLite can't handle our scale, let's move to Postgres." "Bare asyncio is too low-level, let's add a queue." "Let's rewrite it in Go."
I didn't do any of those things. The fix was 18 lines of code in one middleware file. The bot has been up for weeks since.
Here's the diagnosis, the fix, and the takeaway. The code is real (anonymized of any client specifics) and the numbers are real.







