Building an agent is more than just “call an API”—it requires stitching together retrieval, speech, safety, and reasoning components so they behave like one cohesive system. Each layer has its own interface, latency constraints, and integration challenges, and you start to feel them as soon as you move beyond a simple prototype.

In this tutorial, you’ll learn how to build a voice-powered RAG agent with guardrails using the latest NVIDIA Nemotron models released at CES 2026 for speech, RAG, safety, and reasoning. By the end, you’ll have an agent that:

Listens to spoken input

Uses multimodal RAG to ground itself in your data

Reasons over long context