Originally published on my blog. Cross-posted here with a canonical link.
Recap: The Transformer Revolution (Part 1)
In Part 1 of this series, we explored how the Transformer architecture — introduced in Google's 2017 paper "Attention Is All You Need" — upended natural language processing. The key ideas were self-attention (letting every token attend to every other token), positional encodings (injecting sequence order without recurrence), and multi-head attention (learning multiple relationship patterns in parallel). Transformers replaced RNNs and LSTMs as the backbone of language models, eventually powering GPT, BERT, and everything that followed.
But Transformers were designed for sequences of tokens — words, subwords, characters. Images are not sequences. They are 2D grids of pixels with spatial structure, local patterns, and hierarchical features. For decades, a completely different family of architectures dominated vision: convolutional neural networks.
So how did Transformers learn to see? That is the story of this post.






