Production vLLM is 100,000+ lines of C++, CUDA, and Python. It powers most of the industry's LLM serving — but reading it cold is brutal.

So I built a study series around nano-vLLM, an open-source reimplementation of vLLM's core ideas in ~1,200 lines of pure Python. Every algorithm is visible. Every design decision is legible. It turned out to be the perfect lens for actually understanding how LLMs generate text.

The result is an 11-chapter interactive guide. No ML background required — every piece of jargon is explained from scratch with analogies, diagrams, annotated source code, interactive simulators, and quizzes.

What it covers:

What Is LLM Inference? — tokens, autoregressive generation, Q/K/V attention, HBM vs SRAM