Evaluate Clinical ASR Models Faster with Agent Skills and NVIDIA Nemotron Speech | NVIDIA Technical Blog

Training a speech AI model to correctly recognize or synthesize clinical terminology is surprisingly difficult. Drug names like Acetaminophen, Amlodipine, Cefazolin, and Biktarvy are not part of everyday vocabulary. Procedure names, anatomy terms, and specialty-specific diagnoses introduce the same problem in a different form. Off-the-shelf speech systems can sound fluent and still miss the words that matter most to a clinical workflow.

Synthetic data generation (SDG) can help close this gap, but only if the synthesized speech is phonetically accurate. A text-to-speech (TTS) system that mispronounces a medication or procedure name produces training or evaluation data that teaches the wrong pronunciation. Instead of fixing the original problem, it can make the failure more difficult to detect. When correctly implemented, SDG enables a team to stand up a domain benchmark in hours without collecting real clinical audio or waiting on annotation pipelines or IRB approval.

This post presents a clinical automatic speech recognition (ASR) workflow for generating pronunciation-aware synthetic audio, reviewing clinical terms, and evaluating recognition quality. NVIDIA agent skills guide the workflow, while NVIDIA NeMo Data Designer and NVIDIA Nemotron Speech provide the data generation and speech services.

Evaluate Clinical ASR Models Faster with Agent Skills and NVIDIA Nemotron Speech | NVIDIA Technical Blog

Other newsrooms on this story

Related reading

How speech models fail where it matters the most and what to do about it

Corti's new Symphony for Speech-to-Text model beats OpenAI at medical…

Building NVIDIA Nemotron 3 Agents for Reasoning, Multimodal RAG, Voice, and…

NVIDIA Releases Nemotron 3.5 ASR: A 600M-Parameter Cache-Aware Streaming Model…

How to Fine-Tune Nemotron 3.5 ASR for Your Language, Domain, or Accent

NVIDIA Launches Nemotron 3 Nano Omni Model, Unifying Vision, Audio and Language…