New open-source voice model listens nonstop and decides every 0.4 seconds whether to speak or stay silent

Researchers want to close the gap between today's audio speech models and real listeners. Their system handles dialog, translation, and sound recognition all at once.

Today's audio voice models, like GPT-4o or Qwen 3.5-Omni, work like a dictation machine with a button: they only respond when the recording ends. Streaming systems like Moshi for dialog or Paraformer for live subtitles do listen in, but they can only handle one task at a time and treat sounds like coughing as background noise.

Researchers from China, Hong Kong, and Singapore want to combine both approaches with "audio interaction." The model listens to an audio stream continuously, breaks it into 0.4-second chunks, and decides after each chunk whether to stay silent or speak. Translation, transcription, chatting, and reacting to everyday noises all run in a single three-billion-parameter model.

The model listens to a continuous audio stream and decides moment by moment whether to stay silent or react, combining classical and streaming audio capabilities in one system. | Image: Xie et al.

One special token every 0.4 seconds

New open-source voice model listens nonstop and decides every 0.4 seconds whether to speak or stay silent

Other newsrooms on this story

Related reading

OpenAI releases new voice models for more natural live conversations |…

OpenAI launches GPT-Live voice models that can speak and listen at the same time

OpenAI bets on voice as AI's primary interface with new models, and crypto…

OpenAI's GPT-Live: ChatGPT voice that listens and talks

OpenAI launches GPT-Live-1, a full-duplex voice model that talks and listens at…

OpenAI's new voice model brings GPT-5-level reasoning to real-time conversations