OpenAI is shipping GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper - a new generation of voice models built to reason, translate, and transcribe on the fly.
ChatGPT has had an audio mode for a while, and Google offers a similar real-time conversation feature through Gemini. But the models behind these voice interactions have been significantly weaker than their text-only counterparts, especially compared to text reasoning models that take time to think through problems.
According to OpenAI, that's no longer cutting it. A modern voice agent needs to understand what someone actually means, keep track of context, roll with changes, use tools, and respond appropriately - all at the same time.
The company came up with three new interaction patterns that can also be combined. With "Voice-to-Action," a user describes what they need out loud, and the system reasons through the request, calls the right tools, and gets the job done.
With "Systems-to-Voice," software turns context into spoken guidance. A travel app could tell a passenger that their connecting flight is still reachable despite a delay, give them the fastest route to the new gate, and confirm their luggage transfer.






