Vision Transformers — How Transformers Learned to See (Part 2 of 3)

Originally published on my blog. Cross-posted here with a canonical link.

Recap: The Transformer Revolution (Part 1)

In Part 1 of this series, we explored how the Transformer architecture — introduced in Google's 2017 paper "Attention Is All You Need" — upended natural language processing. The key ideas were self-attention (letting every token attend to every other token), positional encodings (injecting sequence order without recurrence), and multi-head attention (learning multiple relationship patterns in parallel). Transformers replaced RNNs and LSTMs as the backbone of language models, eventually powering GPT, BERT, and everything that followed.

But Transformers were designed for sequences of tokens — words, subwords, characters. Images are not sequences. They are 2D grids of pixels with spatial structure, local patterns, and hierarchical features. For decades, a completely different family of architectures dominated vision: convolutional neural networks.

So how did Transformers learn to see? That is the story of this post.

Originally published on my blog. Cross-posted here with a canonical link.

Recap: The Transformer Revolution (Part 1)

So how did Transformers learn to see? That is the story of this post.

Vision Transformers — How Transformers Learned to See (Part 2 of 3)

Vision Transformers — How Transformers Learned to See (Part 2 of 3)

Related reading

Vision Language Models — When AI Learns to See and Talk (Part 3 of 3)

Transformers — The Architecture That Changed AI (Part 1 of 3)

How Transformers Work — From Self-Attention to Modern LLM Architecture

Understanding Attention in Transformers — Intuition Before Equations

The Sequence Knowledge #870: Liquid Models and the Search for a…

The Sequence Knowledge #878: Beyond Transformer: What We Learned

Related reading

Vision Language Models — When AI Learns to See and Talk (Part 3 of 3)

Transformers — The Architecture That Changed AI (Part 1 of 3)

How Transformers Work — From Self-Attention to Modern LLM Architecture

Understanding Attention in Transformers — Intuition Before Equations

The Sequence Knowledge #870: Liquid Models and the Search for a…

The Sequence Knowledge #878: Beyond Transformer: What We Learned