How LLM Inference Works

Arpit Bhayani

curious, tinkerer, and explorer


When you enter a prompt into an LLM, the model converts your text into numbers, processes them, and returns a response one token at a time. In this article, we go through the journey of LLM inference and see how it works.

What are Large Language Models?

LLMs are just neural networks built on the transformer architecture. Unlike earlier architectures that processed text sequentially, transformers can analyze entire sequences in parallel, making them more efficient to train and deploy.

The fundamental building block of these models is the transformer layer, which consists of two primary components:

  1. a self-attention mechanism, and
  2. a feed-forward neural network.

LLMs stack dozens of these layers, creating deep networks capable of capturing complex patterns in language.

Transformers rely on self-attention and it evaluates how each word relates to the rest of the sequence, not just its neighbouring words.

Model size = the number of parameters in the network. A 7-billion parameter model has 7 billion floating-point numbers that store the learned knowledge from training. These parameters are organized into weight matrices that perform transformations on the input data at each layer.

Models like GPT-4, Claude, and Llama are decoder-only transformers, meaning they use only the decoder part of the original transformer architecture. This makes them autoregressive, generating one token at a time based on all previously generated tokens, which is perfect for text generation tasks.

Tokenization

Before any computation happens, the model needs to convert your text input into numbers. This process, called tokenization, breaks text into smaller units called tokens.

The most common tokenization approach in modern LLMs is Byte Pair Encoding (BPE). BPE starts with a vocabulary of individual characters and iteratively merges the most frequent pairs of adjacent tokens to create new tokens.

# Example of BPE tokenization process
# Input text: "unhappiness"

# Initial: ['u', 'n', 'h', 'a', 'p', 'p', 'i', 'n', 'e', 's', 's']
# After merges: ['un', 'happi', 'ness']

Because of BPE, common words get represented as single tokens (efficient), while rare or unknown words get broken into familiar subword pieces (flexible).

The tokenization process works by encoding your input text into UTF-8 bytes, then applying the learned merge rules to compress the byte sequence into tokens. Each token maps to an integer ID that the model can work with.

# Simplified tokenization example
text = "The AI model generates text"
tokens = tokenizer.encode(text)
# [464, 15592, 2746, 18616, 2420]

Tokenization directly impacts model performance and costs. More tokens mean more computation, higher API costs, and potentially hitting context length limits. This is why non-English text often costs more to process since these languages typically require more tokens per word when the tokenizer was primarily trained on English data.

Token Embeddings

Once text becomes tokens, the next step transforms these discrete token IDs into continuous vector representations that neural networks can process. This happens through an embedding layer, essentially a lookup table that maps each token ID to a high-dimensional vector.

For a model with a vocabulary of 50,000 tokens and an embedding dimension of 4,096, the embedding matrix has shape [50000, 4096]. Each row represents one token, and the values in that row form the embedding vector for that token.

token_id = 464  # Token ID for "The"
embedding_vector = embedding_matrix[token_id]
# Result: a vector of 4096 floating-point numbers

These embedding vectors capture semantic meaning learned during training. Words with similar meanings have embedding vectors that point in similar directions in this high-dimensional space.

Transformers do not inherently understand the order of tokens. To address this, we add positional encodings to the embeddings, providing information about where each token sits in the sequence. Modern approaches use learned positional embeddings or relative position encodings like Rotary Position Embeddings (RoPE).

The Transformer Architecture

The transformer processes embedding vectors through its layers. Each transformer layer applies two main operations: multi-head self-attention and feed-forward networks.

The self-attention mechanism computes three matrices for each token: Query (Q), Key (K), and Value (V). These come from multiplying the input embeddings by three learned weight matrices.

# Self-attention computation
Q = input @ W_query   # Shape: [batch, seq_len, dim]
K = input @ W_key     # Shape: [batch, seq_len, dim]
V = input @ W_value   # Shape: [batch, seq_len, dim]

The weight matrices W_query, W_key, and W_value are learned during training. They are randomly initialized and then adjusted through backpropagation to extract the most useful patterns from the embeddings.

The attention mechanism then computes how much each token should attend to every other token. This happens through a scaled dot-product attention calculation:

# Attention scores
scores = (Q @ K.transpose()) / sqrt(dim)
attention_weights = softmax(scores)
output = attention_weights @ V

The scaling factor (square root of dimension) prevents the dot products from becoming too large, which would cause the softmax function to saturate and produce extremely small gradients during training.

Multi-head attention runs this process multiple times in parallel with different learned projection matrices. A model might use 32 attention heads, each learning to focus on different aspects of the relationships between tokens. The outputs from all heads get concatenated and projected back to the model dimension.

After attention, the output passes through a feed-forward network, which consists of two linear transformations with a non-linear activation function in between. This typically expands the dimensionality by 4x before projecting back down.

# Feed-forward network
hidden = activation(input @ W1 + b1)   # Expand to 4x dimension
output = hidden @ W2 + b2              # Project back down

Inference Phases - Prefill and Decode

The prefill phase happens when you first submit a prompt. The model processes all input tokens in parallel, computing the Query, Key, and Value matrices for each token simultaneously. This phase is compute-bound, meaning the GPU’s computational throughput determines performance.

During prefill, the attention mechanism performs matrix-matrix multiplication, which GPUs excel at. All tokens can see all other tokens (in the input), and the model computes attention scores for every pair of positions in one batch operation.

# Prefill phase computation
input_tokens = [token_1, token_2, ..., token_n]
# Process all tokens at once
for layer in model.layers:
    Q, K, V = compute_qkv(input_tokens)
    attention_output = attention(Q, K, V)
    layer_output = feedforward(attention_output)

The prefill phase produces the first output token and builds the KV cache, which we will discuss shortly. Time to First Token (TTFT) measures how long this phase takes, directly impacting user experience since this is the wait time before seeing any output.

The decode phase begins after the first token generates. The model produces tokens one at a time, autoregressively. Each new token gets computed based on all previous tokens, but only the latest token needs fresh Q, K, V computations.

This phase is memory-bound, not compute-bound. The GPU spends most of its time loading data from memory rather than performing calculations. Each iteration involves a matrix-vector operation instead of matrix-matrix, providing far less computational work to saturate the GPU.

# Decode phase computation
current_token = first_generated_token
while not done:
    # Only compute for the new token
    q_new = compute_query(current_token)

    # Retrieve cached K, V from previous tokens
    k_cached, v_cached = retrieve_cache()

    # Compute attention with cached values
    attention_output = attention(q_new, k_cached, v_cached)

    next_token = generate_token(attention_output)
    current_token = next_token

Inter-Token Latency (ITL) measures the time between consecutive token generations in the decode phase. This metric determines how fast text streams to the user after generation begins.

The KV Cache

The KV cache represents one of the most important optimizations in transformer inference. Without it, generating 100 tokens would require recomputing attention for all previous tokens 100 times, wasting enormous computational resources.

During autoregressive generation, the Key and Value matrices for previously processed tokens never change. Only the Query matrix for the new token needs computation. By caching the K and V matrices from all previous tokens, we avoid recomputing them.

A pseudocode to make sense of it goes like this.

class KVCache:
    def __init__(self):
        self.cache_k = None
        self.cache_v = None
    
    def update(self, new_k, new_v):
        if self.cache_k is None:
            self.cache_k = new_k
            self.cache_v = new_v
        else:
            # Concatenate new K, V with cached values
            self.cache_k = concat([self.cache_k, new_k], dim=1)
            self.cache_v = concat([self.cache_v, new_v], dim=1)
    
    def get(self):
        return self.cache_k, self.cache_v

For each transformer layer and each attention head, the model maintains separate KV caches. When generating the nth token, the cache stores K and V matrices for all n-1 previous tokens.

The speedup from KV caching can be dramatic. Empirical tests show that generating 1000 tokens with KV caching takes about ~10 seconds, while without caching the same task takes ~50 seconds, nearly a 5x difference.

However, the KV cache comes with a memory cost. The cache grows linearly with sequence length. For a 13-billion parameter model like LLaMA-2, each output token requires approximately 1 MB of cache storage. A 4,000 token context needs about 4 GB just for the cache, comparable to the model size itself.

This memory pressure becomes severe with long contexts or large batch sizes. Modern systems employ several strategies to manage KV cache memory: quantizing the cache to lower precision (4-bit or 2-bit keys and values), using sliding window attention that only retains recent tokens, or implementing attention approximations that reduce cache requirements.

When I first experimented with running a model myself, I blamed the GPU for slow responses before noticing that the KV cache kept spilling out of GPU memory.

Every time a user typed a long prompt, latency skyrocketed. A single fix - reducing precision from FP16 to INT8 - cut our response times by more than half.

Matrix Multiplication

Matrix multiplication forms the computational heart of transformer inference. Every layer performs multiple matrix multiplications: computing Q, K, V from inputs, applying attention, and running the feed-forward network.

On GPUs, efficient matrix multiplication employs a tiling strategy. The large matrix operation gets divided into smaller tiles that fit in shared memory, reducing expensive global memory accesses.

Each thread block computes one output tile, stepping through the K dimension in tiles. This maximizes data reuse: once data loads into shared memory, all threads in the block can access it without additional global memory traffic.

Tensor Cores further accelerate this by performing entire small matrix multiplications in hardware. The programming model exposes 16x16x16 operations, but hardware executes them as multiple 4x4x4 operations automatically.

# Using Tensor Core operations
for each 16x16 output tile:
    for each 16x16 input tile along K dimension:
        # This becomes a single Tensor Core instruction
        output_tile += tensor_core_mma_16x16x16(A_tile, B_tile)

Precision and Quantization in Inference

LLM inference often operates at reduced precision compared to training. While training typically uses FP32 or BF16 precision, inference can use FP16, INT8, or even INT4 with minimal quality loss.

FP16 (16-bit floating point) cuts memory usage and bandwidth requirements in half compared to FP32. Tensor Cores achieve maximum throughput at FP16, making it the default precision for many inference deployments.

# Precision formats
FP32: 1 sign bit, 8 exponent bits, 23 mantissa bits
FP16: 1 sign bit, 5 exponent bits, 10 mantissa bits  
BF16: 1 sign bit, 8 exponent bits, 7 mantissa bits
INT8: 8 bits for integer representation
INT4: 4 bits for integer representation

Quantization converts model weights and activations to lower precision formats. This requires careful calibration to maintain model quality. Post-training quantization analyzes activation distributions on representative data to determine optimal scaling factors.

A 7-billion parameter model at FP16 precision requires approximately 14 GB of memory (7B parameters × 2 bytes per parameter). Quantizing to INT4 reduces this to 3.5 GB, enabling inference on consumer hardware.

Quantization techniques like GPTQ and AWQ apply different scaling factors per channel or per group, preserving more information from the original weights. Some methods quantize weights but keep activations at higher precision, balancing quality and performance.

End-to-end Inference Flow

Step 1: Tokenization. Your prompt “Explain how transformers work” gets converted to token IDs by the tokenizer. The BPE algorithm splits this into subword units, producing something like [Explain, how, transform, ers, work].

prompt = "Explain how transformers work"
token_ids = tokenizer.encode(prompt)
# [22163, 703, 4659, 364, 990]

Step 2: Embedding lookup. Each token ID indexes into the embedding matrix, retrieving its corresponding embedding vector. If the model has 4096 dimensions, each token becomes a vector of 4096 floating-point numbers.

embeddings = embedding_matrix[token_ids]
# Shape: [5, 4096]

Step 3: Add positional encodings. The model adds positional information to the embeddings so the attention mechanism knows the order of tokens.

positions = [0, 1, 2, 3, 4]
positional_embeddings = positional_encoding[positions]
input_embeddings = embeddings + positional_embeddings

Step 4: Prefill phase. The input embeddings flow through each transformer layer. For a 32-layer model, this happens 32 times.

hidden_states = input_embeddings
for layer in model.layers:
    # Multi-head self-attention
    Q = hidden_states @ W_query
    K = hidden_states @ W_key
    V = hidden_states @ W_value
    
    attention_scores = (Q @ K.T) / sqrt(dim)
    attention_probs = softmax(attention_scores)
    attention_output = attention_probs @ V
    
    # Store K, V in cache for this layer
    kv_cache[layer].update(K, V)
    
    # Residual connection and layer norm
    hidden_states = layer_norm(hidden_states + attention_output)
    
    # Feed-forward network
    ffn_output = feed_forward(hidden_states)
    
    # Residual connection and layer norm
    hidden_states = layer_norm(hidden_states + ffn_output)

Step 5: Generate first token. After the final layer, the hidden states get projected to vocabulary size through a linear layer, then softmax converts these logits to probabilities over all possible next tokens.

Step 6: Decode phase. Now we generate tokens one at a time. For each new token, we only compute fresh Q, K, V for that token, retrieving cached values for all previous tokens.

Step 7: Detokenization. Finally, the sequence of token IDs gets converted back to text using the tokenizer’s vocabulary.

output_text = tokenizer.decode(generated_tokens)

This entire process repeats for every token generated, with the KV cache growing at each step. The decode phase continues until the model generates a stop token or reaches a maximum length limit.

Inference Serving Frameworks

Production LLM inference relies on specialized serving frameworks that handle batching, memory management, and optimization automatically.

vLLM implements PagedAttention for efficient KV cache management and continuous batching for high throughput. It achieves 2-4x higher throughput than naive implementations on the same hardware.

TensorRT-LLM from NVIDIA provides highly optimized kernels specific to NVIDIA GPUs, achieving near-theoretical peak performance. It includes techniques like in-flight batching and FP8 quantization support.

# Simplified vLLM usage
from vllm import LLM

llm = LLM(
    model="meta-llama/Llama-2-7b-hf",
    tensor_parallel_size=2,  # Use 2 GPUs
    dtype="float16"
)

outputs = llm.generate(
    prompts=["Explain transformers", "What is AI?"],
    sampling_params=SamplingParams(max_tokens=100)
)

Text Generation Inference (TGI) from Hugging Face offers broad model support and features like continuous batching and token streaming. It provides a production-ready HTTP API for deploying models.

Each framework makes different tradeoffs between ease of use, performance, and model support. Choosing the right one depends on your specific requirements, hardware, and model architecture.

Performance metrics and monitoring

Understanding and monitoring inference performance requires tracking several key metrics.

Time to First Token (TTFT) measures prefill phase latency. This directly impacts user experience since users wait this long before seeing any output. Optimizing TTFT means efficient prompt processing, often through batch prefill or speculative decoding techniques.

Inter-Token Latency (ITL) measures the time between consecutive tokens during decode. Low ITL creates smooth streaming experiences. This metric depends heavily on memory bandwidth and KV cache efficiency.

Throughput, measured in tokens per second, indicates overall system capacity. High throughput means serving more users concurrently. Batching strategies significantly impact throughput.

# Performance monitoring
start_time = time.now()
first_token = model.generate_first_token(prompt)
ttft = time.now() - start_time

token_times = []
for i in range(num_tokens):
    token_start = time.now()
    token = model.generate_next_token()
    token_times.append(time.now() - token_start)

itl = mean(token_times)
throughput = num_tokens / sum(token_times)

GPU utilization indicates how effectively the hardware is being used. Low utilization during decode suggests memory bottlenecks. Monitoring tools like nvidia-smi show GPU usage, memory consumption, and power draw in real-time.

Memory pressure, especially KV cache size, affects maximum context length and batch size. Tracking cache memory helps prevent out-of-memory errors and guides quantization decisions.

Footnote

LLM inference transforms text prompts into responses through a process involving tokenization, transformer layers with self-attention mechanisms, and autoregressive token generation.

There are two stages in practice: the model first handles your full prompt in parallel, then switches to generating tokens one by one, which shifts the bottleneck from math to memory access.

Key optimizations include KV caching to avoid redundant computation, batching to improve GPU utilization, and quantization to reduce memory pressure.

Thanks for reading. I hope the breakdown made the inner workings of inference a little clearer.


If you find this helpful and interesting,

Arpit Bhayani

Staff Engg at GCP Memorystore, Creator of DiceDB, ex-Staff Engg for Google Ads and GCP Dataproc, ex-Amazon Fast Data, ex-Director of Engg. SRE and Data Engineering at Unacademy. I spark engineering curiosity through my no-fluff engineering videos on YouTube and my courses

Writings and Learnings

Blogs

Papershelf

Bookshelf

RSS Feed


Arpit's Newsletter read by 145,000 engineers

Weekly essays on real-world system design, distributed systems, or a deep dive into some super-clever algorithm.


The courses listed on this website are offered by

Relog Deeptech Pvt. Ltd.
203, Sagar Apartment, Camp Road, Mangilal Plot, Amravati, Maharashtra, 444602
GSTIN: 27AALCR5165R1ZF