Back to Articles

Introduction The Benchmark Data Pipeline Evaluation Methodology Findings A. How well do models perform on our benchmark for codeswitching? WER results (lower is better) SWER and AER results (lower is better) B. What additional cost does code-switching add compared to plain monolingual speech? C. How does code-switching break ASR systems? Limitations Conclusion

Introduction

Over half of the world's population speaks more than one language. And for many bilingual speakers, code-switching — seamlessly switching between languages, even mid-sentence — is a natural part of everyday communication. Whether in casual conversations, contact centers, or IT helpdesks, speakers fluidly adapt to whichever language feels most natural in the moment.

Despite the prevalence of bilingual speakers across the world, there has been little work focused on how voice agents handle code-switched speech in enterprise settings. So, when a customer asked us how our voice agents would perform for their largely bilingual customer base who routinely code-switched, we decided to build our own benchmark and dataset to evaluate models. We focused on automatic speech recognition (ASR) — the first step in any voice agent pipeline — because transcription errors propagate forward into every downstream component. In enterprise settings, where a misrouted ticket or misunderstood policy question has real operational consequences, getting the transcript right is an especially important step of the voice agent pipeline.