When I started building Hoovik — a distributed video conferencing platform — I expected WebRTC signaling and transcription pipelines to be the hardest problems.

They weren’t.

The real engineering challenge was building a production-ready real-time multimodal emotion inference engine capable of processing live video meetings under strict latency constraints.

Unlike offline ML systems, live meeting environments are unstable by default:

microphones get muted