AGI Is Not Multimodal

mercoledì 4 giugno 2025 New tab

"In projecting language back as the model for thought, we lose sight of the tacit embodied understanding that undergirds our intelligence." –Terry WinogradThe recent successes of generative AI models have convinced some that AGI is imminent. While these models appear to capture the essence of human intelligence, they defy even our most basic intuitions about it. They have emerged not because they are thoughtful solutions to the problem of intelligence, but because they scaled effectively on hardware we already had. Seduced by the fruits of scale, some have come to believe that it provides a clear pathway to AGI. The most emblematic case of this is the multimodal approach, in which massive modular networks are optimized for an array of modalities that, taken together, appear general. However, I argue that this strategy is sure to fail in the near term; it will not lead to human-level AGI that can, e.g., perform sensorimotor reasoning, motion planning, and social coordination. Instead of trying to glue modalities together into a patchwork AGI, we should pursue approaches to intelligence that treat embodiment and interaction with the environment as primary, and see modality-centered processing as emergent phenomena.Preface: Disembodied definitions of Artificial General Intelligence — emphasis on general — exclude crucial problem spaces that we should expect AGI to be able to solve. A true AGI must be general across all domains. Any complete definition must at least include the ability to solve problems that originate in physical reality, e.g. repairing a car, untying a knot, preparing food, etc. As I will discuss in the next section, what is needed for these problems is a form of intelligence that is fundamentally situated in something like a physical world model. For more discussion on this, look out for Designing an Intelligence. Edited by George Konidaris, MIT Press, forthcoming.Why We Need the World, and How LLMs Pretend to Understand ItTLDR: I first argue that true AGI needs a physical understanding of the world, as many problems cannot be converted into a problem of symbol manipulation. It has been suggested by some that LLMs are learning a model of the world through next token prediction, but it is more likely that LLMs are learning bags of heuristics to predict tokens. This leaves them with a superficial understanding of reality and contributes to false impressions of their intelligence.The most shocking result of the predict-next-token objective is that it yields AI models that reflect a deeply human-like understanding of the world, despite having never observed it like we have. This result has led to confusion about what it means to understand language and even to understand the world — something we have long believed to be a prerequisite for language understanding. One explanation for the capabilities of LLMs comes from an emerging theory suggesting that they induce models of the world through next-token prediction. Proponents of this theory cite the prowess of SOTA LLMs on various benchmarks, the convergence of large models to similar internal representations, and their favorite rendition of the idea that “language mirrors the structure of reality,” a notion that has been espoused at least by Plato, Wittgenstein, Foucault, and Eco. While I’m generally in support of digging up esoteric texts for research inspiration, I’m worried that this metaphor has been taken too literally. Do LLMs really learn implicit models of the world? How could they otherwise be so proficient at language?One source of evidence in favor of the LLM world modeling hypothesis is the Othello paper, wherein researchers were able to predict the board of an Othello game from the hidden states of a transformer model trained on sequences of legal moves. However, there are many issues with generalizing these results to models of natural language. For one, whereas Othello moves can provably be used to deduce the full state of an Othello board, we have no reason to believe that a complete picture of the physical world can be inferred by a linguistic description. What sets the game of Othello apart from many tasks in the physical world is that Othello fundamentally resides in the land of symbols, and is merely implemented using physical tokens to make it easier for humans to play. A full game of Othello can be played with just pen and paper, but one can’t, e.g., sweep a floor, do dishes, or drive a car with just pen and paper. To solve such tasks, you need some physical conception of the world beyond what humans can merely say about it. Whether that conception of the world is encoded in a formal world model or, e.g., a value function is up for debate, but it is clear that there are many problems in the physical world that cannot be fully represented by a system of symbols and solved with mere symbol manipulation.Another issue stated in Melanie Mitchell’s recent piece and supported by this paper, is that there is evidence that generative models can score remarkably well on sequence prediction tasks while failing to learn models of the worlds that created such sequence data, e.g. by learning comprehensive sets of idiosyncratic heuristics. E.g., it was pointed out in this blog post that OthelloGPT learned sequence prediction rules that don’t actually hold for all possible Othello games, like “if the token for B4 does not appear before A4 in the input string, then B4 is empty.” While one can argue that it doesn’t matter how a world model predicts the next state of the world, it should raise suspicion when that prediction reflects a better understanding of the training data than the underlying world that led to such data. This, unfortunately, is the central fault of the predict-next-token objective, which seeks only to retain information relevant to the prediction of the next token. If it can be done with something easier to learn than a world model, it likely will be.To claim without caveat that predicting the effects of earlier symbols on later symbols requires a model of the world like the ones humans generate from perception would be to abuse the “world model” notion. Unless we disagree on what the world is, it should be clear that a true world model can be used to predict the next state of the physical world given a history of states. Similar world models, which predict high fidelity observations of the physical world, are leveraged in many subfields of AI including model-based reinforcement learning, task and motion planning in robotics, causal world modeling, and areas of computer vision to solve problems instantiated in physical reality. LLMs are simply not running physics simulations in their latent next-token calculus when they ask you if your person, place, or thing is bigger than a breadbox. In fact, I conjecture that the behavior of LLMs is not thanks to a learned world model, but to brute force memorization of incomprehensibly abstract rules governing the behavior of symbols, i.e. a model of syntax.Quick primer:Syntax is a subfield of linguistics that studies how words of various grammatical categories (e.g. parts of speech) are arranged together into sentences, which can be parsed into syntax trees. Syntax studies the structure of sentences and the atomic parts of speech that compose them.Semantics is another subfield concerned with the literal meaning of sentences, e.g., compiling “I am feeling chilly” into the idea that you are experiencing cold. Semantics boils language down to literal meaning, which is information about the world or human experience.Pragmatics studies the interplay of physical and conversational context on speech interactions, like when someone knows to close an ajar window when you tell them “I am feeling chilly.” Pragmatics involves interpreting speech while reasoning about the environment and the intentions and hidden knowledge of other agents.Without getting too technical, there is intuitive evidence that somewhat separate systems of cognition are responsible for each of these linguistic faculties. Look no further than the capability for humans to generate syntactically well-formed sentences that have no semantic meaning, e.g. Chomsky’s famous sentence “Colorless green ideas sleep furiously,” or sentences with well-formed semantics that make no pragmatic sense, e.g. responding merely with “Yes, I can” when asked, “Can you pass the salt?” Crucially, it is the fusion of the disparate cognitive abilities underpinning them that coalesce into human language understanding. For example, there isn’t anything syntactically wrong with the sentence, “The fridge is in the apple,” as a syntactic account of “the fridge” and “the apple” would categorize them as noun phrases that can be used to produce a sentence with the production rule, S → (NP “is in” NP). However, humans recognize an obvious semantic failure in the sentence that becomes apparent after attempting to reconcile its meaning with our understanding of reality: we know that fridges are larger than apples, and could not be fit into them.

mercoledì 4 giugno 2025 New tab

AGI Is Not Multimodal

AGI Is Not Multimodal

Other newsrooms on this story

Related reading

Deepmind's Hassabis sees humanity "in the foothills of the singularity" while…

Cognitive Architectures of AGI: 7 Patterns That Transform LLMs from Oracles…

Busting The Misleading Assertion That AI Will Intellectually Homogenize Our…

Gen AI is an Amplification Machine

How to use AI without losing our minds

The second wave of AI coding is here

Related reading

Deepmind's Hassabis sees humanity "in the foothills of the singularity" while…

Cognitive Architectures of AGI: 7 Patterns That Transform LLMs from Oracles…

Busting The Misleading Assertion That AI Will Intellectually Homogenize Our…

Gen AI is an Amplification Machine

How to use AI without losing our minds

The second wave of AI coding is here

Other newsrooms on this story