Every "add captions to your short" tool works the same way: you upload your clip to their servers, they transcribe and render it in the cloud, and they meter your exports. That means an upload wait, a queue, file-size caps, a per-export bill, and your footage sitting on someone else's disk.

I wanted to know if you could do the whole thing in the browser instead. Turns out you can, and the result (CapStudio) has a strange property for a video tool: it costs me almost nothing to run, because there is no render farm and no transcription API. The only server is auth, billing, and syncing a tiny project file. That is the entire reason one person can run it.

Here is how the pieces fit together.

Transcription: Whisper on WebGPU, in a tab

Transcription runs locally with @huggingface/transformers (transformers.js v4), which can execute Whisper on WebGPU. The clip's audio is decoded to a mono 16kHz Float32 buffer with decodeAudioData + an OfflineAudioContext, then fed to the pipeline.