I stress-tested Gemma 4 E4B's 128K context on a laptop GPU — recall is great, prefill is not

Thursday night I let a benchmark run while I slept. By Friday morning Gemma 4 E4B had answered fifteen needle-in-a-haystack questions across four context sizes on my RTX 5050 laptop. The recall numbers were better than I expected. The latency numbers were worse. Here's both, with the ~30 lines of Python to reproduce it on your own hardware.

I keep seeing "Gemma 4 E4B has a 128K context window" repeated as if it were a single property, like "the engine is 3.5 litres". It is not a single property. A context-window number means at least three different things — will the model accept this many tokens?, will it remember what's in the middle of them?, and how fast does the first answer token arrive? — and the answers diverge sharply once you leave the laptop-GPU regime that most spec sheets assume.

This is the post I wish I'd had when I started building on E4B. The TL;DR is in the table further down. The reproducible test rig is at the bottom.

The setup

Hardware: RTX 5050 Laptop, 8 GB VRAM, 24 GB system RAM, Intel i7-13620H

I stress-tested Gemma 4 E4B's 128K context on a laptop GPU — recall is great, prefill is not

Related reading

Gemma 4 on 16GB RAM: What Actually Works for Structured AI Workflows

I Fine-Tuned Gemma 4 on an Emotion Dataset Using a Single GPU

The Delusion of Infinite Compute: Running Gemma 4 on an i5 CPU

Gemma 4's 128K Context Window: Breaking Down Research Papers Without Cloud APIs

What did gemma see? - Thinking in comments...

Gemma 4 on Android: Tricks for Faster On-Device Inference