Microsoft Research's Lens proves detailed captions matter more than raw scale for training efficient image generators

Microsoft Research presents Lens, a text-to-image model with just 3.8 billion parameters that matches much larger rivals on benchmarks, at a fraction of the training cost. The secret sauce: 800 million detailed image captions generated by GPT-4.1 instead of vague web alt-text. Code and weights are openly available under an open-source license.

lunedì 8 giugno 2026 New tab

While Microsoft's MAI team grabs the spotlight with souped-up image models, Microsoft Research is proving how far you can go with limited compute, thanks to detailed captions and smart architecture choices.

Microsoft Research is introducing Lens, a text-to-image model that aims to compete with much larger rivals while using a fraction of the compute during training. According to the technical report, Lens needs roughly one-fifth the compute that comparable models like Z-Image require for pre-training. It beats models many times its size across several benchmarks. Hunyuan-Image-3.0, for example, has about 80 billion parameters. Lens has just 3.8 billion.

Lens and Lens-Turbo score high on benchmarks while keeping inference time short and model size small; larger models need far more compute. | Image: Microsoft

In macro photography, Lens nails the skin texture and color contrasts of a red-eyed tree frog. | Image: Microsoft

Rich captions matter more than raw data volume

Lens and Lens-Turbo score high on benchmarks while keeping inference time short and model size small; larger models need far more compute. | Image: Microsoft

In macro photography, Lens nails the skin texture and color contrasts of a red-eyed tree frog. | Image: Microsoft

Rich captions matter more than raw data volume

Microsoft Research's Lens proves detailed captions matter more than raw scale for training efficient image generators

Microsoft Research's Lens proves detailed captions matter more than raw scale for training efficient image generators

Other newsrooms on this story

Related reading

Microsoft launches MAI-Image-2-Efficient, a cheaper and faster AI image model

MONET: Lowering the bar for World-Class Image Generation research.

Microsoft's superintelligence team ships MAI-Image-2, a text-to-image generator

Microsoft's MAI-Image-2.5 pulls even with Google's Nano Banana 2 on benchmarks

Microsoft's MAI-Image-2.5 enters Arena's top 3 with better image generation

BLIP3o-NEXT: A new challenger in open-source AI image generation - TechTalks

Other newsrooms on this story

Related reading

Microsoft launches MAI-Image-2-Efficient, a cheaper and faster AI image model

MONET: Lowering the bar for World-Class Image Generation research.

Microsoft's superintelligence team ships MAI-Image-2, a text-to-image generator

Microsoft's MAI-Image-2.5 pulls even with Google's Nano Banana 2 on benchmarks

Microsoft's MAI-Image-2.5 enters Arena's top 3 with better image generation

BLIP3o-NEXT: A new challenger in open-source AI image generation - TechTalks