BERT reads everything at once and understands. GPT reads left to right and predicts what comes next. Forever.
That difference sounds limiting. It's not.
When you train a decoder-only transformer on billions of tokens of text and code, predicting the next word forces the model to learn grammar, facts, reasoning patterns, writing styles, and more. Not because you told it to. Because that's what you need to predict text well.
GPT-1 was interesting. GPT-2 was surprising. GPT-3 was a shock. GPT-4 changed how people work. All of them do the same thing: predict the next token.
What You'll Learn Here











