Thursday night I let a benchmark run while I slept. By Friday morning Gemma 4 E4B had answered fifteen needle-in-a-haystack questions across four context sizes on my RTX 5050 laptop. The recall numbers were better than I expected. The latency numbers were worse. Here's both, with the ~30 lines of Python to reproduce it on your own hardware.

I keep seeing "Gemma 4 E4B has a 128K context window" repeated as if it were a single property, like "the engine is 3.5 litres". It is not a single property. A context-window number means at least three different things — will the model accept this many tokens?, will it remember what's in the middle of them?, and how fast does the first answer token arrive? — and the answers diverge sharply once you leave the laptop-GPU regime that most spec sheets assume.

This is the post I wish I'd had when I started building on E4B. The TL;DR is in the table further down. The reproducible test rig is at the bottom.

The setup

Hardware: RTX 5050 Laptop, 8 GB VRAM, 24 GB system RAM, Intel i7-13620H