While large language models (LLMs) have mastered the art of processing text and images, they remain largely confined to the digital realm. Moving from generating code to folding laundry requires a fundamental shift in how AI perceives the world. Microsoft is attempting to bridge this gap with Rho-alpha (⍴ɑ), a new robotics foundation model designed to bring adaptivity to physical tasks.
Rho-alpha falls under the category of Vision-Language-Action (VLA) models. These systems ingest visual data and natural language commands to output robot arm actions. However, standard VLAs often struggle with precision tasks where vision is obstructed or insufficient, such as manipulating a slippery object or inserting a plug behind a desk. Rho-alpha addresses this by integrating tactile sensing directly into its decision-making process, a capability Microsoft refers to as “VLA+.”
The architecture of VLA+
The core innovation of Rho-alpha lies in how it processes sensory data. Most multimodal models attempt to tokenize every input, converting images and text into discrete units that a transformer can process. However, tactile feedback is a high-frequency, continuous signal that represents force and resistance and can’t be represented as discrete tokens.






