I recently built a production-grade real-time Voice AI workspace from scratch. While the whole system has many moving parts, two components required the most careful engineering: the authentication middleware between services and the Speech-to-Text (STT) pipeline.
Here’s exactly how I approached and solved both.
The Middleware Problem
I needed two local microservices — a WebRTC audio server and a FastMCP server — to communicate securely.
I didn’t want to introduce a database, Redis, or any hardcoded secrets. The solution had to be lightweight, stateless, and still reasonably secure for internal communication.









