Parallax: A Parameterized Local Linear Attention That Keeps Softmax and Adds a Learned Covariance Correction Branch

The Transformer’s attention mechanism has barely changed since 2017. Most efficiency work has tried to replace softmax attention outright. A new paper takes a different route. It keeps softmax attention and bolts on a correction branch.

A team of researchers from Northwestern University, Tilde Research, and University of Washington introduce a parameterized Local Linear Attention called ‘Parallax’ that scales to LLM pretraining and codesigns with Muon.

Parallax does not chase efficiency by cutting compute. It adds compute deliberately, then makes that compute cheaper to run on modern GPUs.

What is Parallax

Parallax builds on Local Linear Attention (LLA). LLA comes from the test-time regression framework. That framework reads attention as a regression solver over key-value pairs.

Parallax does not chase efficiency by cutting compute. It adds compute deliberately, then makes that compute cheaper to run on modern GPUs.

What is Parallax

Parallax builds on Local Linear Attention (LLA). LLA comes from the test-time regression framework. That framework reads attention as a regression solver over key-value pairs.

Parallax: A Parameterized Local Linear Attention That Keeps Softmax and Adds a Learned Covariance Correction Branch

Parallax: A Parameterized Local Linear Attention That Keeps Softmax and Adds a Learned Covariance Correction Branch

Other newsrooms on this story

Related reading

Beyond Standard LLMs

Multi-Head Latent Attention (MLA)

Parcae: Doing more with fewer parameters using stable looped models

Falcon Perception

Introducing Olmo Hybrid: Combining transformers and linear RNNs for superior…

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

Other newsrooms on this story

Related reading

Beyond Standard LLMs

Multi-Head Latent Attention (MLA)

Parcae: Doing more with fewer parameters using stable looped models

Falcon Perception

Introducing Olmo Hybrid: Combining transformers and linear RNNs for superior…

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision