Véronique Özkaya is CEO of DATAmundi.ai, delivering high-quality human data for leading global AI labs and enterprises.

​At first glance, AI is viewed as a global technology. However, if you look at its linguistic foundations, AI remains far from global.

Of course, AI generates content and writes in dozens of languages, translates instantly and powers products used across continents. The trouble is that most AI systems still think in one language. You guessed it: English.​

Despite the frequent claim that today’s models are “multilingual,” the reality is that modern AI has largely been built on English. As highlighted by the World Economic Forum, most AI systems are trained on only a small subset, roughly 100 languages, of the approximately 7,000 languages spoken worldwide.​

Analyses of large public training datasets for large language models (LLMs) show a strong dominance of English. For example, studies such as Meta’s LLaMA 2 paper indicate that roughly 90% of training tokens are English, while broader web data suggests English still accounts for nearly half of online content. If AI models such as ChatGPT are primarily trained on internet data, this imbalance inevitably shapes and skews how they understand and represent the world.​