Vision-Language-Action models (VLAs) are a popular recipe for generalist robot policies: take a vision-language model pretrained on internet-scale image-text data, then continue training it on (comparatively scarce) robot demonstrations. A growing trend pushes these models to reason before they act by training the model to emit an intermediate chain-of-thought (CoT) that bridges high-level intent and low-level control, much like step-by-step thinking for language models.
But what should a robot actually think about? Currently, reasoning VLAs usually have a fixed, hand-designed template: enumerate every visible object's bounding box, create a high-level plan, describe affordances, identify the end-effector gripper position, and so on, at every single step. This reasoning trace is expensive to design as teams spend months writing annotation guidelines. More fundamentally, it may be the wrong thing to think about. Listing every object in a cluttered scene can drown out the one cue that matters. Re-planning at every timestep can be redundant. Verbose reasoning also slows the policy down at test time, where generating a few seconds of text per control step creates real lag.
In our paper "Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning," we introduce R&B-EnCoRe (short for Refine and Bootstrap Embodiment-specific Chain-of-Thought Reasoning), which is a self-improvement pretraining cycle for embodied reasoning VLAs. Instead of relying on a fixed reasoning template, the model generates its own candidate reasoning, refines it by measuring how much each piece of CoT reasoning actually helps predict the correct action, and bootstraps a stronger policy by retraining on the refined data. The whole cycle is self-supervised: no external rewards, no verifiers, no human annotation.










