My core is backend engineering Java/Spring, .NET, Python, cloud services. Over the last few months I've been building something well outside that comfort zone: a platform that lets businesses deploy AI-powered voice and WhatsApp assistants, built on LiveKit, retrieval-augmented generation (RAG), and telephony/SIP integrations.

What it does. Businesses can stand up an AI assistant that answers customer calls and WhatsApp messages, pulls accurate answers from their own knowledge base via RAG, and routes or escalates when it needs to. Under the hood it ties together SIP telephony, a real-time media pipeline (LiveKit/WebRTC), speech processing, and an LLM orchestration layer.

The unfamiliar part. Almost none of the real-time stack was in my background. WebRTC, SDP/media negotiation, ICE, codec handling, SIP trunking, AudioHook-style streaming — this is low-level, finicky territory where a single wrong assumption costs you a day. Coming from request/response backend systems, the mental model for continuous, stateful, real-time media was the steepest part.

How AI let me punch above my weight. I didn't ask AI to "build a voice agent." I used it as an on-demand expert on the protocol details while I owned the architecture and business logic. Concretely: