Quick answer: What is LLM inference? LLM inference is the runtime process that turns a user prompt into a model answer. In a few seconds, the system tokenizes text, maps tokens into embeddings, computes attention, stores KV cache, retrieves extra context when needed, and generates the response one token at a time.TL;DR: LLM inference is not just next-token prediction. A prompt moves through tokenization, embeddings, prefill, attention, KV cache, decode, batching, retrieval, and memory layers before becoming an answer. Modern inference is a latency, cost, and orchestration problem.Over the past couple of years, inference has evolved from “the model just generates tokens” into one of the most complex engineering systems in AI. While you wait 2–3 seconds for a response, dozens of mechanisms are already working behind the scenes: tokenization, embeddings, attention, KV cache, request routing, retrieval, batching, memory management, and entire optimization pipelines.In one of our earlier articles, we explained the core fundamentals of inference: key concepts, optimization techniques, and hardware trends. But that was a year ago, and the focus of the field is shifting extremely fast.Inference now is more about system orchestration – a coordinated runtime system where all elements work together to produce an answer under latency and cost constraints.Today we’re going to put all the pieces together into one pipeline. You’ll see the full path from tokens to generated answers, and we’ll answer the most interesting question: what actually happens in the 2.5 seconds between your prompt and the model’s response?There is more going on there than most people realize.But before, watch an episode of Attention Span, inspired by Demis Hassabis and OpenAI’s incredible achievement in math In today’s episode:LLM inference in two phases: prefill and decodePrefill unpackedThe first layer: Tokens as the runtime currencyEmbeddings: From token IDs to meaningful geometryAttention: Where representations become context and prefill meets decodeWhat’s behind decode? The role of attention and KV cacheContext is not only inside the modelInference optimization: batching, chunking, and parallelismWhy modern inference is system orchestrationWhy attention is not the same as understandingSources and further readingLLM inference in two phases: prefill and decodeWhen you write a prompt and send it to a model, a surprisingly complex pipeline starts running. But at the core, the process has two main stages: first, the model processes your request; then, it generates the response. One stage flows directly into the other:Prefill – this is the first stage, when the model reads the entire prompt and builds understanding of the context. Since all prompt tokens are already known, this step can be heavily parallelized and runs very fast on the GPU. Then prefill flows into →Decode – the model generates the response one token at a time. Each new token depends on the previous ones, so this stage is mostly sequential and slower.The first output token usually takes the longest, because the model is still processing the whole prompt. After that, generation becomes a steady stream of tokens.When many users send requests at once, inference systems try to balance several goals:low latency, meaning fast responseshigh throughput to serve many users efficientlyGPU memory efficiencyand right GPU utilization.Speaking of latency, we need to distinguish between two important metrics:MetricWhat it measuresMain stageWhat it affectsTime to First Token (TTFT)The time between sending a prompt and receiving the first generated tokenMostly prefill latencyHow fast the model starts respondingTime per Output Token (TPOT)The average time required to generate each token after the first oneMostly decode latencyHow fast the response streams after generation beginsSo, total latency is approximately: TTFT + (TPOT × number of output tokens).And about the hardware, the key detail is that prefill requires more compute, whiledecode is memory-bandwidth-bound.But why does each phase use GPU differently? To understand that, and how systems can be optimized for efficiency and lower GPU usage, we need to look at how all the LLM workflow components – tokenization, embeddings, attention, and others – are distributed across prefill and decode. There’s much more interesting stuff behind this pipeline than just a sequence of steps for processing text and generating responses.Prefill unpackedThe first layer: Tokens as the runtime currencyLet’s start from the very beginning. Before a model can process and generate anything, text gets broken into tokens. The tokenization process creates these tokens: models split raw text into smaller pieces, which are then converted into numerical IDs. Depending on the tokenizer, a token can be a whole word, part of a word, punctuation, whitespace, or even a byte sequence, but it is always small enough to generalize, yet meaningful enough to preserve structure and semantics. In production, tokenization is effectively a learned compression layer sitting between human language and GPU compute.However, this part of the workflow is not only about counting tokens. The way text gets split defines almost everything about modern AI systems: final sequence lengths, context limits, latency, memory usage, throughput, and even pricing.Moreover, not all tokens are equal. A system needs to “understand” what exact kinds of tokens flow through it. An inference pipeline can involve the following token types which behave very differently:Input tokens are relatively cheap because models process them mostly in parallel during the prefill stage.Output tokens are more expensive because generation is sequential: the model predicts one token at a time. And they belong to decode stage.Reasoning tokens can silently multiply compute usage by generating long internal chains of thought before the final answer appears.Cached tokens reduce cost by reusing previously processed context.Retrieval and tool-use tokens often dominate agentic systems because every loop adds more context back into the window.This influences how people design AI systems, a lot. A long conversation, a RAG pipeline, or an autonomous agent is now fundamentally a token-management problem. The smartest systems appear to be the ones “deciding” which tokens are actually worth processing, storing, retrieving, or generating in the first place.Tokenization happens before inference itself starts, but optimal tokenization and working with only the necessary tokens is one of the directions for optimizing compute and memory use.Tokens are what the input consists of – now let’s look at how they start to come “alive” inside the model.Embeddings: From token IDs to meaningful geometryAfter tokenization the system only has token IDs – integers like 14382 or 5021. They are useless for the model until they reconstruct their meaning. In AI, this meaning is hidden in geometry.An embedding layer maps every token ID to a dense vector – a learned coordinate in a high-dimensional space. The model then learns relationships between these representations through distance and direction. Similar concepts end up near each other, and this is the key to a total generalization (like generalizing from “cat” to “dog” or from “room” to “bedroom”) without memorizing every possible sentence individually.Technically, this happens through an embedding matrix: a trainable lookup table where each token maps to a vector. During training, those initially random vectors organize into a semantic space where patterns emerge naturally.Since models also need to know the order of tokens in the sequence, positional encodings are used to inject the actual order directly into the vectors. Many systems use a fundamental technique called RoPE (Rotary Position Embedding), which rotates embeddings in vector space based on token position, allowing attention layers to track relative distance between tokens efficiently. This concrete geometry is finally what the network can reason over.Only after this step does the real computation begin →Attention: Where representations become context and prefill meets decodeThe model now has contextualized vector representations that carry both semantic meaning and positional structure, and those vectors flow into attention layers where tokens start interacting with each other.Attention is truly the computational center of modern autoregressive AI models.Firstly, the vectors are projected into 3 representations: query (Q) – what the token is looking for; key (K) – what the token exposes about itself; value (V) – the information each token can contribute.Then, the overall computational process starts to form the context. A token enters as a vector and scans the sequence for useful context, comparing queries against all keys, and searching for relevant tokens. Higher similarity means higher attention. Then it pulls in the most relevant information (a weighted sum of the value vectors computed via attention), and exits as a more context-aware version of itself. This repeats layer after layer until the model builds the final representation used for generation.During prefill, all input tokens are already known, so the GPU can process them in parallel. Internally, the model computes attention states – keys and values – for every token and stores them in the KV cache. And this KV cache is extremely important for further optimization. It is what connects prefill and decode, and the reason why attention mechanism becomes a part of decode phase, as well.Actually, prefill behaves more like a huge matrix-matrix multiplication, which GPUs are extremely good at, and GPU utilization is usually near maximum here.What’s behind decode? The role of attention and KV cacheDecode phase works completely differently – it is much less parallel. Every new token depends on all previously generated tokens, so generation becomes sequential:generate one tokenappend it to the contextattend to all previous tokenscompute attention for the next tokengenerate the next tokenrepeatThe KV cache helps avoid recomputing old attention states: the model only computes KV tensors for the newest token and appends them to the cache instead of recalculating attention for the whole sequence every time.This makes decode much cheaper computationally and model inference more practical. But not without drawbacks. Now the bottleneck becomes memory bandwidth, because every request needs its own cache and cache size grows linearly with sequence length. The hardware starts to spend much of its time moving model weights, activations and KV tensors through memory instead of doing heavy math. (We will explore some ways for mitigating these problems a little bit later.)A pipeline from a query to a responseContext is not only inside the modelAttention transforms raw token representations into contextual representations. But the model itself is no longer the only place where context exists. Retrieval and memory systems now extend models far beyond their native context windows, becoming a critical part of the inference stack.Embeddings also operate outside the model itself. They power search and retrieval from vector databases, ranking systems, long-term memory, and recommendation engines. Vector databases used to function mostly as passive storage layers. In modern agentic systems, retrieval becomes part of the reasoning loop itself: agents search iteratively, refine queries, compare results, write back new information, and reuse past experience over time.A pipeline with retrievalMemory also transforms into an active infrastructure layer for planning, tool use, self-correction, and continuity across workflows. Modern systems now have to decide what is actually worth remembering, how to forget outdated information, how to handle conflicting memories, and how to reduce retrieval noise before it floods the context window.And all these elements together require proper design for the success, accuracy, and convenience of the entire model, agent or system. So →Inference optimization: batching, chunking, and parallelismThe methods we are about to discuss apply to both the prefill and decode stages.The main goal of the inference pipeline besides generating a response is to maintain high throughput, low latency, and efficient KV cache and memory management. That’s why inference systems optimize how multiple requests share the GPU, using optimization techniques like batching and chunking:Batching improves throughput because many requests share the same model weights. Traditional static batching processes groups of requests together, but it is inefficient because shorter requests must wait for the longest one in the batch to finish. Modern runtimes use dynamic/continuous batching that inserts new requests as others finish, so queries enter and leave batches continuously.Chunked prefill splits large prompts into smaller pieces. This helps to not completely pause ongoing decodes during long prefills.That’s why GPUs are often underutilized during decode unless many requests are batched together. Some inference engines (e.g. vLLM) exploit this by mixing prefills and decodes in the same GPU workload to maximize throughput.Also, there is a problem that large models often can’t fit on a single GPU. The solution is to split them across multiple GPUs:Pipeline parallelism splits the model by layers across devices.Tensor parallelism splits computations inside layers themselves, such as attention heads or MLP matrices.Sequence parallelism splits operations along the token dimension to reduce activation memory.To improve the efficiency of the system as a whole, we can optimize each part of the pipeline separately: better attention mechanisms, better KV cache compression, and better KV cache management. One example we found interesting is CompactAttention, which separates KV selection from attention execution and avoids unnecessary memory movement.First, it selects the most useful parts of the KV cache. Then it converts them into compact KV block tables that can be accessed directly from paged KV cache memory without copying or rearranging tensors. Paged KV memory is a system introduced in methods like PagedAttention and widely used in frameworks such as vLLM, where the KV cache is split into small memory “pages” that can be dynamically allocated across GPU memory. This method keeps accuracy close to dense attention but significantly accelerates long-context inference, reaching up to 2.72× attention speedup at 128K context length during chunked prefill. It mitigates two issues – inefficient sparse GPU execution and expensive KV gathering operations.Image Credit: CompactAttention original paperApart from what we discussed there are many other optimization options that you can use depending on your particular case and needs.Why modern inference is system orchestration2026 is the year when we have started to view AI systems from the new perspective. If many years ago it was probably enough to think about inference as “the model generates the next token,” today this changes a lot with the appearance of reasoning and agentic systems. We started to work with the multi-layered systems build around the core iconic pipeline, and modern inference is heavily influenced by orchestration layer around generation. Part of the context may come from the prompt, part from cache or conversation history, part from retrieval and memory, and part from tool calls or agent loops. The final answer appears only after this context is selected, ranked, filtered, and packed into a form the model can use. That is why context engineering is becoming one of the key directions in AI systems. The central question is simple: how well does the system manage the flow of context around the model? How can it deliver the smallest possible set of high-signal information within a limited attention budget? These questions are now moving to the center of the inference stack.The same shift applies to memory management (including agentic memory). Poor memory becomes a direct source of degraded inference quality, higher cost, worse tool use, wrong agent decisions and more hallucinations. There is also a distinct task of building systems that can identify what to store long-term, what to keep local, when to summarize, when to trim, and how to prevent conflicts or context poisoning.Retrieval design is also connected to the overall trend. When relevant knowledge “lives” outside the model’s context window, retrieval becomes part of the model’s cognitive loop. Bad retrieval means too much low-signal, irrelevant information, and pollutes the context window, reducing answer quality.So this is probably one of the biggest shifts in modern AI system design reality: when we talk about models and agents, we mean entire systems with retrieval, tools, context and memory management. More and more intelligence now lives in the orchestration layers around models and inference can’t be just a straight line from input to output. Choosing the best orchestration layer with all the proper puzzle pieces is now as important as choosing a good model. Sources and further readingAttention Is All You Need | PaperMastering LLM Techniques: Inference Optimization | NVIDIA blog postInference Optimization of Foundation Models on AI Accelerators | paperTrustworthy AI Inference Systems: An Industry Research View | paperEfficient Memory Management for Large Language Model Serving with PagedAttention | PaperCompactAttention: Accelerating Chunked Prefill with Block-Union KV Selection | PaperResources From Turing Post:FAQWhat is LLM inference?LLM inference is the process where a trained model turns an input prompt into an output answer by processing tokens, computing attention, and generating new tokens.Prefill vs decode: what is the difference?Prefill processes the full prompt in parallel and mostly determines time to first token. Decode generates the answer one token at a time and mostly determines streaming speed.Why does KV cache matter in LLM inference?KV cache stores previously computed attention keys and values so the model does not recompute the whole context for every new generated token.Why is decode slower than prefill?Decode is sequential: each new token depends on previous tokens. It is also often limited by memory bandwidth because the system must repeatedly read model weights and KV cache.Why is modern inference about orchestration?Modern inference combines model execution with routing, batching, retrieval, memory, caching, and tool use. The final answer depends on how well the whole system manages context, latency, and cost.