Back to Articles

Throughput, latency, and queue depth for Gemma-4 31B served on vLLM under progressive load, from 12 to 24 concurrency The numbers that matter: 1.17k tok/s peak, ~0.7s median TTFT, and tail latency as the one thing to watch.

Model Overview

Gemma-4-31B-it-FP8 is a 30.7B parameter dense Transformer built by Google DeepMind, designed for frontier-level reasoning, coding, multimodal understanding, and agentic workflows. It supports a 256K-token context window, accepts text, image, and video input, and but we deployed for a 4K-token context window only to mimic the requirements for ShareGPT dataset.

This FP8-block checkpoint was quantized by RedHatAI using LLM Compressor, compressing the weights and activations of linear operators within transformer blocks to the FP8 data type while keeping the vision tower, embedding, and output head layers in their original precision.