Modern language models are trained on data with extremely uneven token distributions. A small number of words appear in almost every sentence, while many rare but meaningful tokens occur only occasionally. This creates a hidden optimization challenge: parameters associated with common tokens receive constant gradient updates, while parameters tied to rare tokens may go hundreds or thousands of steps without receiving any meaningful signal. Under standard Stochastic Gradient Descent (SGD), every parameter uses the same learning rate, so frequently updated weights converge quickly while rare-token weights often remain close to their random initialization.

This is where Adam’s adaptive optimization becomes important. While Adam is commonly described as SGD with momentum, its most impactful feature in practice is variance normalization. Adam tracks the historical gradient statistics for each parameter independently and automatically adjusts update sizes based on how often reliable gradient information has been observed. Parameters that rarely receive updates end up getting proportionally larger effective learning rates, allowing underrepresented features to learn much faster than they would under vanilla SGD.