Motivation

Robot manipulation is the ability of a robot to interact with and manipulate objects in the physical world, such as grasping objects, moving them precisely, and adapting to changes in the environment. Traditional approaches such as Imitation Learning (IL) [ACT, Diffusion Policy] learn directly from human demonstrations, mapping visual observations to actions. While effective in controlled settings, these policies are difficult to generalize. Vision-Language-Action (VLA) models [RT-2, OpenVLA, π series] represent a promising new paradigm. A VLA typically consists of a VLM backbone and an action expert: the VLM, pretrained on internet-scale vision-language data, provides rich high-level semantic understanding of the scene and the natural language instruction; the action expert then takes this semantic representation and outputs concrete robot actions. The entire architecture is trained end-to-end, enabling VLAs to not only understand what they are asked to do, but also execute it — rather than simply memorizing fixed scene-action mappings like traditional IL approaches.

A typical VLA model consisting of a VLM backbone and an action expert (image from π₀)

A VLA model is first pretrained on large-scale diverse data to acquire general visual and language understanding, then finetuned on a smaller dataset of demonstrations for a target task and environment. However, recent work has raised serious concerns about this finetuning process. Several studies suggest that finetuning causes VLAs to degrade into imitation learners that memorize scene-specific action sequences based on training distribution, rather than genuine understanding of the scene through the VLM backbone. LIBERO-PRO finds that model trajectories remain nearly identical when the target object is replaced, removed, or the instruction is corrupted. LIBERO-Plus further shows that models fail when the target object is displaced.