MolmoPoint is a new vision-language model architecture that replaces text-based coordinate outputs with a more natural, token-based pointing mechanism that directly selects regions from visual features.

MolmoPoint and MolmoWeb extend the Molmo family from visual understanding to visual action, giving researchers open tools for models that can point, navigate, and interact with…

MolmoPoint is a new vision-language model architecture that replaces text-based coordinate outputs with a more natural, token-based pointing mechanism that directly selects…