How LLMs Really Work

If you have used ChatGPT, Gemini, or Claude, you have already formed an intuition about what these systems do. You type something in, and text comes back that feels coherent, knowledgeable, and sometimes eerily human. But the machinery underneath is simultaneously simpler and stranger than most people expect.

This article tears open that machinery and explains what a language model is doing at a mechanical level - why it produces the outputs it does, why identical inputs produce different outputs on different runs, and what “temperature” actually means beyond “a creativity dial.”

Next-token Prediction Machine

A large language model (LLM) is, at its most fundamental level, a function that takes a sequence of tokens as input and outputs a probability distribution over its entire vocabulary for what the next token should be. That is the complete description of the core operation. Everything else - the apparent reasoning, the conversational ability, the code generation - emerges from doing this one thing at enormous scale, across an enormous amount of training data.

Concretely, imagine you feed the model the tokens for “The quick brown fox”. The model does not produce the word “jumps”. It produces a table of probabilities: “jumps” might have a 42% chance, “sat” a 12% chance, “leaped” an 8% chance, and every other token in a 100,000-word vocabulary gets some non-zero slice of the remaining probability mass. The model then samples from that distribution to pick the next token. That token gets appended to the sequence, and the whole process repeats until a stop condition is reached.

This is called autoregressive generation. Each token generated becomes part of the input for the next prediction. The model is always asking the same question: “given everything I have seen so far, what token is most likely to come next?”

What Training Actually Does

The model learns to produce these probability distributions by training on a massive corpus of text - essentially a large fraction of the written internet, books, code, and academic papers. During training, the model sees a sequence of tokens and tries to predict the next one.

When it is wrong, the error signal flows backward through the network (via backpropagation), nudging billions of internal parameters - the model’s “weights” - very slightly in the direction that would have made the correct prediction more probable.

After trillions of these updates, the model’s weights encode something remarkable: a compressed statistical model of how language works. It learns that “The Eiffel Tower is located in” is very frequently followed by “Paris,” that Python function definitions start with “def,” and that a sentence starting “To be or not to” almost certainly continues with “be.”

Crucially, the model does not have a memory of individual training examples. It has internalized statistical patterns. This is why it can generalise to novel inputs - it is not retrieving stored sentences, it is sampling from learned distributions.

Logits, Softmax, and Why Probabilities Matter

Before the model produces those clean probabilities, it produces raw scores called logits - one real number per token in the vocabulary. These logits are the raw output of the final linear layer in the neural network.

To convert logits to a probability distribution, the model applies the softmax function:

$P(\text{token}_i) = \frac{e^{\text{logit}_i}}{\sum_j e^{\text{logit}_j}}$

Softmax does two things. First, it exponentiates each logit, which amplifies differences: a logit that is twice as large becomes exponentially more probable. Second, it normalizes everything so that all probabilities sum to 1. The result is a valid probability distribution over the entire vocabulary.

To see this in action, imagine the model is predicting the next word after “The quick brown fox”. It generates raw logits for a tiny vocabulary of four words:

Token	Logit ( $x_i$ )	Exponent ( $e^{x_i}$ )	Probability ( $P_i$ )
“jumps”	8.3	4023.8	90.7%
“leaped”	6.0	403.4	9.1%
“sat”	2.1	8.1	0.18%
“sleeps”	-1.5	0.2	0.004%
Sum		4435.5	100%

This is the number the model actually hands you before sampling. The entire drama of temperature, top-k, and nucleus sampling happens here, in the manipulation of this distribution before a token is drawn from it.

Temperature

Temperature is the most misunderstood parameter in prompting. It is commonly described as “creativity” or “randomness,” which is technically correct but obscures exactly how it works. Understanding it precisely lets you use it deliberately.

Temperature is a scalar that divides the logits before the softmax is applied:

$P(\text{token}_i) = \frac{e^{\text{logit}_i / T}}{\sum_j e^{\text{logit}_j / T}}$

When $T = 1.0$ , nothing changes. The probabilities are exactly what the raw softmax produced in our previous example.

By adjusting $T$ , we can either “sharpen” or “flatten” the distribution:

Token	Logit	Prob ( $T=1.0$ )	Prob ( $T=0.5$ )	Prob ( $T=2.0$ )
“jumps”	8.3	90.7%	~99.0%	~67.5%
“leaped”	6.0	9.1%	~1.0%	~21.3%
“sat”	2.1	0.18%	~0.0%	~3.0%
“sleeps”	-1.5	0.004%	~0.0%	~0.5%

When $T < 1.0$ (e.g., $T = 0.5$ ), dividing by a fraction magnifies the logits. By cooling the temperature, the already-large difference between the top tokens becomes enormous. The model becomes nearly deterministic, overwhelmingly picking the single most likely token.

When $T > 1.0$ (e.g., $T = 2.0$ ), dividing by a large number flattens the logits. By turning up the heat, the probability mass is spread more evenly. Previously unlikely tokens become plausible candidates, meaning the model will sample more surprising continuations.

This has a practical implication that is easy to miss: temperature does not change what the model knows or how it reasons. It changes which region of the probability distribution you sample from. At low temperature you are exploiting the model’s most confident predictions. At high temperature you are exploring the tail of the distribution, which contains valid but unusual continuations - as well as incoherent ones.

A sensible mental model:

$T = 0.0$ to $0.3$ : near-deterministic output, good for code generation, factual Q&A, structured data extraction
$T = 0.7$ to $1.0$ : balanced, good for chat, summarisation, general-purpose use
$T = 1.2$ to $2.0$ : high diversity, good for brainstorming, creative writing, exploring unusual phrasings - but outputs become increasingly unreliable at the high end

Why Outputs are Inherently Probabilistic

If you ask a language model “What is $2 + 2$ ?” you will get " $4$ " back every time regardless of temperature, because the probability mass is so concentrated on that token that even high-temperature sampling almost never picks anything else. But for any prompt where multiple continuations are plausible, the model’s outputs are drawn from a probability distribution. Run the same prompt a hundred times and you will get a hundred slightly different outputs, sometimes substantively different.

This is not a bug. It is a direct consequence of how the model was trained. The training data contains enormous variation: different people express the same idea in thousands of different ways. The model has learned this variation. When you ask it to write an email or summarise a document, many different phrasings are reasonable, and the model reflects that.

The probabilistic nature of outputs has several practical consequences that experienced engineers learn the hard way:

You cannot assume the model will always produce the same structure in its output, even with the same prompt. Strict output parsing must handle variation.
The model can contradict itself across separate calls even with identical input. For anything requiring consistency, either use temperature $0$ or implement validation logic.
“It gave me a wrong answer” and “it gives wrong answers reliably” are very different failure modes. Always test across multiple runs before concluding a prompt works.

No Meta Knowledge

One thing that trips up newcomers: the model does not “decide” what to say in any cognitive sense. There is no inner monologue, no planning step where it outlines a response before writing it. Each token is generated one at a time, left to right, with no ability to revise earlier tokens once they are committed.

This is why “chain-of-thought prompting” - asking the model to reason step by step before giving a final answer - actually improves accuracy on complex tasks. By generating intermediate reasoning tokens, the model conditions later tokens on that reasoning. The scratch space is real and functional: writing “let me think step by step” into the output genuinely changes the distribution over subsequent tokens in a way that improves correctness. It is not theatrical.

It also explains why the model can “hallucinate” - generate confident-sounding but false text. Given a prompt that contextually expects a specific detail (an author name, a statistic, a URL), the model samples a plausible-sounding continuation from its learned distribution. That distribution was built on real text, but it was not indexed for factual accuracy. A plausible token is not the same as a true one.

What “the model knows” Actually Means

When engineers say a language model “knows” something, they mean the training corpus contained many examples where that piece of information appeared in context, causing the model’s weights to encode a strong prior toward continuations that express it. The model does not have a database of facts. It has a compressed, lossy encoding of co-occurrence statistics across hundreds of billions of tokens.

This matters in practice. The model is confident and coherent about things it has seen many times in training. It is unreliable about things that appeared rarely or were expressed inconsistently. It will confidently make up details in domains that are underrepresented in its training data, because the token-prediction machinery does not distinguish between “I learned this” and “I am pattern-matching to something plausible.”

Understanding this helps you design prompts appropriately. For tasks grounded in common knowledge, the model is a powerful accelerant. For tasks requiring precise factual recall, especially of specific numbers, citations, or recent events, the model needs to be treated as a starting point that requires verification.

Footnote

A language model is a next-token prediction machine trained by minimizing the error on predicting held-out tokens from a large corpus. It outputs a probability distribution over its vocabulary at each step, and temperature controls how sharply peaked that distribution is before sampling.

Outputs are probabilistic because the model has learned the natural variation in human language. Understanding this - rather than treating the model as a search engine or a knowledge base - is the foundation for using LLMs effectively in production systems.