Self-Attention is not just “looking at important words.”

It is a matrix operation.

And that is exactly why Transformers scale.

Core Idea

Self-Attention lets each token compare itself with every other token in the same sequence.