NVIDIA Introduces X-Token: Projection-Guided Cross-Tokenizer KD That Outperforms GOLD by +3.82 Average Points on Llama-3.2-1B

Knowledge distillation (KD) transfers “dark knowledge” from a large teacher model to a smaller student. The student learns from the teacher’s full output probability distribution over tokens, not just correct answers. This is done via per-position Kullback–Leibler (KL) divergence over next-token probability distributions.

This formulation requires a shared tokenizer. A practitioner committed to Llama-3.2-1B cannot leverage stronger teachers with incompatible tokenizers — such as Phi-4-mini or Qwen3-4B — because token positions do not correspond across vocabularies. This also prevents multi-teacher distillation across tokenizer families.

NVIDIA researchers introduced X-Token, a logit-distribution-based method for cross-tokenizer KD (Knowledge distillation). It operates as a drop-in replacement for the standard KD loss, requiring no auxiliary trainable components and no architectural changes.

The Problem X-Token is Solving

Two prior approaches dominate cross-tokenizer KD. ULD (Universal Logit Distillation) sidesteps vocabulary alignment by rank-sorting both distributions and minimizing L1 distance. It discards token identity entirely. GOLD adds span alignment and a hybrid loss. It partitions tokens into a 1-to-1 string-matched common subset, trained with KL divergence, and an uncommon remainder, trained with ULD-style rank matching. GOLD is the current state of the art.

The Problem X-Token is Solving

NVIDIA Introduces X-Token: Projection-Guided Cross-Tokenizer KD That Outperforms GOLD by +3.82 Average Points on Llama-3.2-1B

NVIDIA Introduces X-Token: Projection-Guided Cross-Tokenizer KD That Outperforms GOLD by +3.82 Average Points on Llama-3.2-1B

Other newsrooms on this story

Related reading

From Bayesian to deep knowledge tracing — upgrading NumPath's student model…

Leading Inference Providers Achieve Lowest Token Cost With Open Source Models…

NVIDIA AI Releases Nemotron 3 Ultra: An Open 550B Mixture-of-Experts Hybrid…

Orthrus: Parallel Token Generation That Doesn't Change Your Model's Output

NVIDIA AI Releases Nemotron-Labs-Diffusion: A Tri-Mode Language Model with 6×…

NVIDIA Introduces a 4-Bit Pretraining Methodology Using NVFP4, Validated on a…

Other newsrooms on this story

Related reading

From Bayesian to deep knowledge tracing — upgrading NumPath's student model…

Leading Inference Providers Achieve Lowest Token Cost With Open Source Models…

NVIDIA AI Releases Nemotron 3 Ultra: An Open 550B Mixture-of-Experts Hybrid…

Orthrus: Parallel Token Generation That Doesn't Change Your Model's Output

NVIDIA AI Releases Nemotron-Labs-Diffusion: A Tri-Mode Language Model with 6×…

NVIDIA Introduces a 4-Bit Pretraining Methodology Using NVFP4, Validated on a…