Generative Simulation Benchmarking for heritage language revitalization programs with embodied agent feedback loops
My Learning Journey into Heritage Language AI
It started with a quiet realization during a late-night coding session. I was experimenting with generative AI for language modeling, training transformer-based systems on low-resource languages like Quechua, Navajo, and Māori. The models performed decently on standard benchmarks—BLEU scores, perplexity, and translation accuracy—but something felt hollow. These metrics captured fluency, not cultural resonance. They measured correctness, not connection.
I remember staring at a generated sentence in Quechua that was grammatically perfect but semantically meaningless to a native elder. The AI had mapped words correctly but missed the metaphorical weight, the ceremonial context, and the embodied knowledge embedded in the language. That's when I realized: heritage language revitalization isn't just about vocabulary and syntax—it's about living interaction between speakers, environments, and cultural practices.
This article documents my personal exploration into building a new benchmarking framework—one that uses generative simulations and embodied agent feedback loops to evaluate and improve heritage language programs. It's not a finished product; it's a journey of discovery, failure, and iterative refinement.






