Qwen3.6-35B NVFP4 runs on one H100 — A100 owners are out

NVIDIA published nvidia/Qwen3.6-35B-A3B-NVFP4 on May 28, 2026 — a post-training FP4-quantized variant of Alibaba's 35B MoE model that fits on a single H100 by cutting VRAM from ~71 GB to ~23 GB. If you're on an A100 or consumer GPU, jump to the gotchas section first — this quantization format does not run on your hardware.

71 GB → 23 GB: What Gets Quantized and What Doesn't

NVFP4 quantization targets the weights and activations of linear operators inside transformer and MoE blocks specifically — LayerNorms, embeddings, and biases stay in BF16/F32 for numerical stability . The selective 4-bit compression yields a 3.06× reduction in disk footprint and VRAM versus the BF16 base, dropping from roughly 71 GB to ~23 GB equivalent on Hopper hardware .

Quick Answer: nvidia/Qwen3.6-35B-A3B-NVFP4 fits a 35B MoE reasoning model on a single H100 by applying 4-bit quantization to linear operator weights and activations, reducing VRAM from ~71 GB to ~23 GB (3.06×) with under 1-point accuracy loss on standard benchmarks. Hopper or Blackwell required — A100 and RTX 4090 lack FP4 compute paths entirely.

The calibration pipeline used two datasets: cnn_dailymail (300K+ English news articles) and NVIDIA's Nemotron-Post-Training-Dataset-v2 for multi-turn dialogue coverage, processed with NVIDIA Model Optimizer v0.44.0 . The dual-dataset approach is worth noting: a quantization calibrated only on news articles would likely regress on structured, multi-turn instruction-following — and the benchmark results bear that out.

71 GB → 23 GB: What Gets Quantized and What Doesn't

Qwen3.6-35B NVFP4 runs on one H100 — A100 owners are out

Qwen3.6-35B NVFP4 runs on one H100 — A100 owners are out

Related reading

Qwen 3.6 35B-A3B for Local AI in 2026: The 24GB VRAM Line That Gets You 120…

Quantization formats compared: GGUF vs GPTQ vs AWQ vs NF4

Qwen 3.6 27B and 35B MTP vs Standard on 16GB GPU

Qwen3.6-27B + vLLM + Hermes on 24GB VRAM: May 2026 Recipe

Qwen3-Coder-Next for Local AI in 2026: Which GPU Can Actually Run Alibaba's #1…

Production-Ready W4A8 vLLM Integration Recovery Techniques

Related reading

Qwen 3.6 35B-A3B for Local AI in 2026: The 24GB VRAM Line That Gets You 120…

Quantization formats compared: GGUF vs GPTQ vs AWQ vs NF4

Qwen 3.6 27B and 35B MTP vs Standard on 16GB GPU

Qwen3.6-27B + vLLM + Hermes on 24GB VRAM: May 2026 Recipe

Qwen3-Coder-Next for Local AI in 2026: Which GPU Can Actually Run Alibaba's #1…

Production-Ready W4A8 vLLM Integration Recovery Techniques