Originally published on my blog. Cross-posted here with a canonical link.

This is Part 3 of a 3-part series on the transformer revolution in vision and language:

Part 1: Transformers — The Architecture That Changed AI

Part 2: Vision Transformers — How Transformers Learned to See

Part 3: Vision Language Models — When AI Learns to See and Talk (this post)