Gemma-4 31B + vLLM on RTX 6000 PRO : A Real-Load Benchmark

Back to Articles

Throughput, latency, and queue depth for Gemma-4 31B served on vLLM under progressive load, from 12 to 24 concurrency The numbers that matter: 1.17k tok/s peak, ~0.7s median TTFT, and tail latency as the one thing to watch.

Model Overview

Gemma-4-31B-it-FP8 is a 30.7B parameter dense Transformer built by Google DeepMind, designed for frontier-level reasoning, coding, multimodal understanding, and agentic workflows. It supports a 256K-token context window, accepts text, image, and video input, and but we deployed for a 4K-token context window only to mimic the requirements for ShareGPT dataset.

This FP8-block checkpoint was quantized by RedHatAI using LLM Compressor, compressing the weights and activations of linear operators within transformer blocks to the FP8 data type while keeping the vision tower, embedding, and output head layers in their original precision.

Back to Articles

Model Overview

Gemma-4 31B + vLLM on RTX 6000 PRO : A Real-Load Benchmark

Gemma-4 31B + vLLM on RTX 6000 PRO : A Real-Load Benchmark

Other newsrooms on this story

Related reading

Gemma 4 31B Runs Fastest on SambaCloud

I Ran Every Gemma 4 Model on My Home Lab. E4B Crushes E2B. Here's the Data.

Gemma 4: The 128K Multimodal Powerhouse in Your Terminal

Gemma 4 12B: Google's encoder-free multimodal AI now runs on a laptop

Running Brand-New Gemma 4 12B on an 8-Year-Old GTX 1080 Ti: Speed, 3 Gotchas,…

Google's new Gemma 4 12B model is designed to run on any laptop with 16GB of RAM

Other newsrooms on this story

Related reading

Gemma 4 31B Runs Fastest on SambaCloud

I Ran Every Gemma 4 Model on My Home Lab. E4B Crushes E2B. Here's the Data.

Gemma 4: The 128K Multimodal Powerhouse in Your Terminal

Gemma 4 12B: Google's encoder-free multimodal AI now runs on a laptop

Running Brand-New Gemma 4 12B on an 8-Year-Old GTX 1080 Ti: Speed, 3 Gotchas,…

Google's new Gemma 4 12B model is designed to run on any laptop with 16GB of RAM