A Production Python Telegram Bot Was Crashing Every 2 Hours. The Fix Was 18 Lines.

"If you see cascading errors, find the first thing that fails and stop reading the log there. Everything after the first failure is the system reacting to the first failure."

A production Python Telegram bot I was looking after started crashing every 2-3 hours. The traceback was a horror show — TelegramRetryAfter, then asyncio.TimeoutError, then sqlite3.OperationalError: database is locked, then 47 leaked sessions, then the process got OOM-killed, then systemd restarted it. Then it happened again, 140 minutes later, like clockwork.

The temptation when you see this kind of cascade is to throw the whole architecture out. "SQLite can't handle our scale, let's move to Postgres." "Bare asyncio is too low-level, let's add a queue." "Let's rewrite it in Go."

I didn't do any of those things. The fix was 18 lines of code in one middleware file. The bot has been up for weeks since.

Here's the diagnosis, the fix, and the takeaway. The code is real (anonymized of any client specifics) and the numbers are real.

"If you see cascading errors, find the first thing that fails and stop reading the log there. Everything after the first failure is the system reacting to the first failure."

I didn't do any of those things. The fix was 18 lines of code in one middleware file. The bot has been up for weeks since.

Here's the diagnosis, the fix, and the takeaway. The code is real (anonymized of any client specifics) and the numbers are real.

A Production Python Telegram Bot Was Crashing Every 2 Hours. The Fix Was 18 Lines.

A Production Python Telegram Bot Was Crashing Every 2 Hours. The Fix Was 18 Lines.

Related reading

How I Built an AI Agent That Fixes Production Errors Using Memory — And Why…

I Added a 71-Line Black Box to My Python Agent, Then Queried the $200 Crash…

Deep Dive: Resolving Asyncio Deadlocks and Memory Leaks in Python SQLAlchemy

Why your AI agent loops forever (and how to break the cycle)

5 Claude API Errors That Cost Me Money (And How I Trapped Them)

The Operators Regret: How We Blew Up the Event Bus at 3 AM

Related reading

How I Built an AI Agent That Fixes Production Errors Using Memory — And Why…

I Added a 71-Line Black Box to My Python Agent, Then Queried the $200 Crash…

Deep Dive: Resolving Asyncio Deadlocks and Memory Leaks in Python SQLAlchemy

Why your AI agent loops forever (and how to break the cycle)

5 Claude API Errors That Cost Me Money (And How I Trapped Them)

The Operators Regret: How We Blew Up the Event Bus at 3 AM