Half a year ago I asked a simple question: during an online call, could a short, to-the-point hint appear on my screen in a second or two — while the other person is still talking? Not an after-the-fact transcript, but help in the moment.
The result is a desktop assistant (macOS + Windows). Below is an honest breakdown of what turned out to be hard, and which solutions worked. Engineering only, no marketing.
Architecture in one paragraph
On the device there are only two things: audio capture and a thin UI overlay. All the "brains" (provider keys, prompts, model selection) live on the server. The client gets a short-lived per-session token and streams audio; the server returns the transcript and the generated answer. I picked this split not for "security theater" but because otherwise keys and prompts would have to be baked into the binary — and both leak instantly.
Hard part #1: system audio, not the microphone






