Back to Articles
TL;DR — We built a multilingual benchmark of 2,312 child–AI conversational prompts in 23 languages, evaluated four production-grade language models against it, and validated the LLM-as-judge pipeline with five independent judges. The dataset, all model responses, all judge scores, and the iOS companion app are all open source.
Why a benchmark for kids?
A four-year-old asked Alexa for a "challenge", and Alexa surfaced a real instruction to put a coin into a live electrical outlet. A toddler with a speech impediment asked for music and the AI solicited inappropriate clothing details. These are not synthetic edge cases but they are real incidents that motivated this work.
Voice assistants — Alexa, Siri, Google — are already part of how young children interact with technology. But the next wave is bigger: LLM-backed agents are moving into homes, classrooms, and tutoring apps, and they're going to become daily companions for children, closer than a smart speaker, embedded in their education, and trusted with the kind of unfiltered questions kids only ask the people they feel safe with. The benchmarks that drive LLM development (TruthfulQA, MMLU, ARC, HELM, MT-Bench) were written for adult users in English with adult prompts. There is no widely-used, multilingual, behaviourally-grounded benchmark for evaluating how an AI handles a child's voice and a child's needs.







