Training a speech AI model to correctly recognize or synthesize clinical terminology is surprisingly difficult. Drug names like Acetaminophen, Amlodipine, Cefazolin, and Biktarvy are not part of everyday vocabulary. Procedure names, anatomy terms, and specialty-specific diagnoses introduce the same problem in a different form. Off-the-shelf speech systems can sound fluent and still miss the words that matter most to a clinical workflow.

Synthetic data generation (SDG) can help close this gap, but only if the synthesized speech is phonetically accurate. A text-to-speech (TTS) system that mispronounces a medication or procedure name produces training or evaluation data that teaches the wrong pronunciation. Instead of fixing the original problem, it can make the failure more difficult to detect. When correctly implemented, SDG enables a team to stand up a domain benchmark in hours without collecting real clinical audio or waiting on annotation pipelines or IRB approval.

This post presents a clinical automatic speech recognition (ASR) workflow for generating pronunciation-aware synthetic audio, reviewing clinical terms, and evaluating recognition quality. NVIDIA agent skills guide the workflow, while NVIDIA NeMo Data Designer and NVIDIA Nemotron Speech provide the data generation and speech services.