You know that feeling when you show a working prototype to a friend, they type a question, and then… everyone just stares at the spinner for six seconds? That was me last month. I was building a small AI assistant for a side project—nothing fancy, just a chat widget that answered questions about my documentation. I thought I was done. I thought it was good. Then real users hit the endpoint.

The Problem: Spinners Kill Conversations

The initial implementation was naive: wait for the whole LLM response (often 10–20 seconds), then render it. My local dev with cached data was fine. But in production, with GPT-4, each call felt like a loading screen from the 90s. Users typed a message, saw the spinner, got distracted, and never came back. The bounce rate was brutal.

I tried a few things:

Hitting a cheaper model (LLaMA 3 via Groq) – faster, but the quality drop wasn’t acceptable for my use case.