TL;DRAI

NVIDIA Apex benchmarks FusedAdam (2x vs AdamW), FusedLayerNorm, FusedRMSNorm for transformer training with torch.amp. Fused operators reduce wall-clock time, lowering cost-per-epoch—essential for teams scaling foundation model development.

In this tutorial, we work through an implementation of NVIDIA Apex, focusing on the components that still matter in modern GPU training workflows. Instead of treating Apex as a general mixed-precision library, we separate the older parts from the still-useful ones and test them directly. We begin by checking the CUDA runtime, building Apex with the required CUDA and C++ extensions, and detecting which fused kernels are actually available in the environment. This matters because a Python-only Apex installation can appear successful while silently missing the high-performance kernels that make Apex useful. After the setup, we benchmark FusedAdam against PyTorch AdamW, compare FusedLayerNorm and FusedRMSNorm with standard normalization layers, and run both legacy apex.amp and modern torch.amp examples. We then bring everything together in a small Transformer training experiment, where we compare a vanilla FP32 PyTorch path with a fused Apex-plus-AMP path to assess the real effect on throughput.

import os, sys, time, subprocess, importlib

import torch

assert torch.cuda.is_available(), (

"No CUDA GPU found. In Colab: Runtime > Change runtime type > Hardware accelerator = GPU"

marktechpost.com

How to Speed Up Transformer Training Using NVIDIA Apex (FusedAdam, FusedLayerNorm) and Native torch.amp

Benchmark NVIDIA Apex FusedAdam and FusedLayerNorm against PyTorch, then pair fused kernels with torch.amp for faster Transformer training.

martedì 2 giugno 2026 New tab

TL;DRAI

1,699 words~8 min read

import os, sys, time, subprocess, importlib

import torch

assert torch.cuda.is_available(), (

"No CUDA GPU found. In Colab: Runtime > Change runtime type > Hardware accelerator = GPU"

How to Speed Up Transformer Training Using NVIDIA Apex (FusedAdam, FusedLayerNorm) and Native torch.amp

How to Speed Up Transformer Training Using NVIDIA Apex (FusedAdam, FusedLayerNorm) and Native torch.amp

Other newsrooms on this story

Related reading

Together AI Achieves 90% Faster BF16 Training with NVIDIA Blackwell Platform…

NVIDIA Introduces a 4-Bit Pretraining Methodology Using NVFP4, Validated on a…

Custom Kernels for All from Codex and Claude

Accelerating Deep Learning: How Uber Optimized Petastorm for High-Throughput…

Speeding up GPU kernels by 38% with a multi-agent system · Cursor

I Benchmarked 4 Lightweight Transformers for Fault Detection. Here's What…

Other newsrooms on this story

Related reading

Together AI Achieves 90% Faster BF16 Training with NVIDIA Blackwell Platform…

NVIDIA Introduces a 4-Bit Pretraining Methodology Using NVFP4, Validated on a…

Custom Kernels for All from Codex and Claude

Accelerating Deep Learning: How Uber Optimized Petastorm for High-Throughput…

Speeding up GPU kernels by 38% with a multi-agent system · Cursor

I Benchmarked 4 Lightweight Transformers for Fault Detection. Here's What…