How to Fine-Tune Nemotron 3.5 ASR for Your Language, Domain, or Accent

Back to Articles

Introducing NVIDIA Nemotron 3.5 ASR, streaming multilingual: a 600M-parameter speech-to-text model that transcribes 40 language-locales from a single checkpoint, in real time, with punctuation and capitalization built in. It is the successor of the popular Nemotron 3 ASR model (English only) which was released on Hugging Face and as a NIM earlier this year. Since its release, Nemotron 3 ASR has been validated by independent benchmarks at Artificial Analysis, where it ranks 2nd in latency among all streaming ASR models— with just 0.07 seconds to final transcript after end of speech — and sits in the "most attractive quadrant" of the AA-WER Streaming Index vs. Time to Final Transcription leaderboard, placing it among the best models on the combined accuracy-latency tradeoff. The model uses a Cache-Aware FastConformer-RNNT architecture that streams audio without the redundant recomputation that makes most streaming ASR slow — so you get low latency and high accuracy, not one at the expense of the other. Nemotron 3.5 ASR ships as open weights on Hugging Face — you can inspect, fine-tune, and deploy it without API dependencies or per-call billing. No data leaves your infrastructure unless you choose. And because it's a strong base model, you can fine-tune it for your own language, domain, or accent. The second half of this post walks through exactly how.

How to Fine-Tune Nemotron 3.5 ASR for Your Language, Domain, or Accent

Other newsrooms on this story

Related reading

NVIDIA Releases Nemotron 3.5 ASR: A 600M-Parameter Cache-Aware Streaming Model…

Other newsrooms on this story

Related reading

NVIDIA Releases Nemotron 3.5 ASR: A 600M-Parameter Cache-Aware Streaming Model…

Introducing NVIDIA Nemotron 3 Nano Omni: Long-Context Multimodal Intelligence…

NVIDIA Releases Audex (Nemotron-Labs-Audex-30B-A3B): A Unified Audio-Text LLM…

Nemotron 3 Nano 4B: A Compact Hybrid Model for Efficient Local AI

NVIDIA Launches Nemotron 3 Nano Omni Model, Unifying Vision, Audio and Language…

NVIDIA AI Releases Nemotron-Labs-Diffusion: A Tri-Mode Language Model with 6×…