Researchers want to close the gap between today's audio speech models and real listeners. Their system handles dialog, translation, and sound recognition all at once.
Today's audio voice models, like GPT-4o or Qwen 3.5-Omni, work like a dictation machine with a button: they only respond when the recording ends. Streaming systems like Moshi for dialog or Paraformer for live subtitles do listen in, but they can only handle one task at a time and treat sounds like coughing as background noise.
Researchers from China, Hong Kong, and Singapore want to combine both approaches with "audio interaction." The model listens to an audio stream continuously, breaks it into 0.4-second chunks, and decides after each chunk whether to stay silent or speak. Translation, transcription, chatting, and reacting to everyday noises all run in a single three-billion-parameter model.
The model listens to a continuous audio stream and decides moment by moment whether to stay silent or react, combining classical and streaming audio capabilities in one system. | Image: Xie et al.
One special token every 0.4 seconds









