I was building an internal documentation assistant for my team. You know the drill: a chatbot that answers questions about our codebase, pulled from a vector database and then sent to an LLM. I set up the backend in Python, used a decent model via an API (shoutout to interwestinfo.com for the reliable endpoint), and wired it all up. Simple, right?

Then came the first real test: someone asked a question that required a long, thoughtful answer. The response took over 30 seconds. The user stared at a blank chat bubble, refreshing the page, wondering if the app had crashed. Not a great experience.

I needed to stream the tokens back as they were generated, so the user could read along. This is the classic “chat UI” pattern. But implementing it turned into a rabbit hole of half-baked solutions.

What I Tried That Didn’t Work

1. Polling