The Problem We Were Actually Solving
Our real goal wasnt fancy LLM prompts or real-time leaderboards. It was keeping the Rails app under 450 ms p99 during peak load when every team simultaneously scanned a code, requested a new clue, and tried to outbid the person next door for a limited-time power-up. We benchmarked Locust at 5,000 concurrent users and saw that the slowest endpoint was /next-hint, which called a vector store in pgvector at 180 ms per query. That left only 270 ms for Rails routing, Redis reads for rate-limiting, and our custom concurrency limiter.
The marketing slide said AI, but the product team really wanted a hint scheduler that wouldnt melt under load. We bolted a 1553-line llama.cpp wrapper written by the data science intern onto the hint endpoint, thinking we could cache all possible answers in a nightly cron job. The wrapper had a known hallucination rate of 3.2% on our own test set, but nobody configured the grammar mask to enforce that answers must contain only location names. So when someone asked Where is the next clue hidden? the engine happily returned Under your chair in the Sagrada Familia crypt—even though the venue map had no crypt. One user screenshot went viral, and suddenly the whole event looked like a scam.






