While Microsoft's MAI team grabs the spotlight with souped-up image models, Microsoft Research is proving how far you can go with limited compute, thanks to detailed captions and smart architecture choices.
Microsoft Research is introducing Lens, a text-to-image model that aims to compete with much larger rivals while using a fraction of the compute during training. According to the technical report, Lens needs roughly one-fifth the compute that comparable models like Z-Image require for pre-training. It beats models many times its size across several benchmarks. Hunyuan-Image-3.0, for example, has about 80 billion parameters. Lens has just 3.8 billion.
Lens and Lens-Turbo score high on benchmarks while keeping inference time short and model size small; larger models need far more compute. | Image: Microsoft
In macro photography, Lens nails the skin texture and color contrasts of a red-eyed tree frog. | Image: Microsoft
Rich captions matter more than raw data volume












