Stochastic Gradient Descent (SGD's) Frequency Bias and How Adam Fixes It

Modern language models are trained on data with extremely uneven token distributions. A small number of words appear in almost every sentence, while many rare but meaningful tokens occur only occasionally. This creates a hidden optimization challenge: parameters associated with common tokens receive constant gradient updates, while parameters tied to rare tokens may go hundreds or thousands of steps without receiving any meaningful signal. Under standard Stochastic Gradient Descent (SGD), every parameter uses the same learning rate, so frequently updated weights converge quickly while rare-token weights often remain close to their random initialization.

This is where Adam’s adaptive optimization becomes important. While Adam is commonly described as SGD with momentum, its most impactful feature in practice is variance normalization. Adam tracks the historical gradient statistics for each parameter independently and automatically adjusts update sizes based on how often reliable gradient information has been observed. Parameters that rarely receive updates end up getting proportionally larger effective learning rates, allowing underrepresented features to learn much faster than they would under vanilla SGD.

Stochastic Gradient Descent (SGD's) Frequency Bias and How Adam Fixes It

Stochastic Gradient Descent (SGD's) Frequency Bias and How Adam Fixes It

Other newsrooms on this story

Related reading

Gradient descent, explained by rolling downhill

Researchers pinpoint why larger language models pick up skills that small ones…

Consistency diffusion language models: Up to 14x faster inference without…

Google's DiffusionGemma runs text 4x faster

AdapTive-LeArning Speculator System (ATLAS): A New Paradigm in LLM Inference…

Google's new open model DiffusionGemma generates text from noise instead of…

Other newsrooms on this story

Related reading

Gradient descent, explained by rolling downhill

Researchers pinpoint why larger language models pick up skills that small ones…

Consistency diffusion language models: Up to 14x faster inference without…

Google's DiffusionGemma runs text 4x faster

AdapTive-LeArning Speculator System (ATLAS): A New Paradigm in LLM Inference…

Google's new open model DiffusionGemma generates text from noise instead of…