I benchmarked OpenAI's new GPT-Realtime-Translate against four other live translation systems

OpenAI shipped GPT-Realtime-Translate on May 8. It's their first model purpose-built for live speech translation, and it supports 70+ input languages.

I've been building a live translation pipeline at VoiceFrom, so I ran it through the same eval harness I use on our own system and three other competitors: Google Meet, LiveVoice, and Palabra. Same source audio, same scoring, eight language pairs.

How I scored it:

Accuracy: GEMBA-MQM v2, an LLM judge that annotates specific translation errors (type + severity) rather than giving a single score. 10 scoring passes per segment, outlier removal, rank-reciprocal weighted aggregation. Ranked #1 on WMT24.

Latency: Automated Ear-Voice Span, the time between when a source phrase is spoken and when the translation starts playing.

I benchmarked OpenAI's new GPT-Realtime-Translate against four other live translation systems

Related reading

OpenAI's new voice model brings GPT-5-level reasoning to real-time conversations

OpenAI unveils GPT-Live, enhancing ChatGPT with real-time voice capabilities

OpenAI bets on voice as AI's primary interface with new models, and crypto…

[AINews] GPT-Realtime-2, -Translate, and -Whisper: new SOTA realtime voice APIs

OpenAI's GPT-Live: ChatGPT voice that listens and talks

GPT Transcribe improves on its predecessor but can't catch ElevenLabs, Google,…