If you have only ever interacted with a language model through a chat interface, you have seen one layer of abstraction that hides a lot of engineering. Behind the friendly chat window, every interaction with a modern LLM is structured as a list of messages, each tagged with a role.
That role tagging is not cosmetic. It shapes how the model responds, how context is managed across multiple turns, and how application developers constrain and direct model behaviour at a structural level. Understanding this format is the difference between using an LLM and building reliably on top of one.
Why Roles Exist at All
Base language models - the kind trained purely on next-token prediction over raw text - do not have a natural concept of “conversation.” They continue text. If you feed a base model the string “What is the capital of France?”, it might continue with “What is the capital of Germany? What is the capital of Spain?” because that pattern appears frequently in quiz and FAQ content. The model is doing exactly what it was trained to do: predict plausible continuations.
Instruction-following models (the kind you interact with in production APIs) are fine-tuned on data formatted as conversations. During this fine-tuning, the model sees thousands of examples where a system context is followed by a user request and then a high-quality assistant response. The model learns to treat these structural cues as meaningful. It learns that text following a system prefix should be treated as persistent instructions, that text following a user prefix is a request to respond to, and that it is generating the text that follows the assistant prefix.
The three-role format is therefore not arbitrary. It emerged from how instruction tuning works, and every production-grade model from OpenAI, Google, Anthropic, and Meta has been trained to respect it.
The System Prompt
The system prompt is the foundational instruction layer of a conversation. It is written by the application developer, not the end user, and it executes before any user interaction takes place.
A well-crafted system prompt does several things:
- Defines the model’s persona and role (“You are a senior data analyst…”).
- Specifies output format constraints (“Always respond in valid JSON with the schema: …”).
- Establishes scope boundaries (“Only answer questions about our product documentation. Politely decline off-topic requests.”).
- Sets behavioural rules (“Never speculate. If you are uncertain, say so explicitly.”).
- Injects background context the model needs (“The current date is… The user’s subscription tier is…”).
The system prompt is processed before the first user message and its content persists through the entire conversation in the model’s context window. It is the most reliable lever you have for controlling model behaviour consistently across all turns.
One critical insight: the system prompt does not have magic authority in the way a configuration file has authority over software. The model has learned to attend to system content heavily because of how it was trained, but it is ultimately still performing token prediction.
A sufficiently adversarial user prompt can sometimes cause the model to deviate from system instructions - this is the class of vulnerabilities known as prompt injection. Never trust that a system prompt alone is a security boundary. Validate and sanitize outputs programmatically when the stakes are high.
Here is a minimal but structurally sound system prompt for a customer support application:
You are a support assistant for Acme Corp. Your job is to help customers with questions about their orders and account settings.
Rules:
- Only discuss topics related to Acme Corp products and services.
- If you cannot answer with certainty, say "I am not sure - let me connect you with a human agent."
- Never disclose internal pricing strategies or supplier information.
- Always address the customer by their first name if provided.
- Respond concisely. Aim for 2-4 sentences unless the customer asks for detail.
Notice that it defines role, scope, fallback behaviour, confidentiality constraints, and style. These four categories cover most of what a useful system prompt needs to specify.
The User Turn
The user turn is the input from the person or the system acting as a person. In a simple chatbot, this is what the human typed. In a programmatic pipeline, this is often constructed by application code - injecting a retrieved document, formatted data, or a templated instruction.
A common mistake is treating the user turn as a place to put everything. Developers sometimes cram persona, instructions, data, and the actual question into a single user message because they are not using the system prompt at all.
This works, to a point, but it conflates different layers of intent. The model is somewhat sensitive to where instructions come from, and instructions in the user turn carry less persistent authority than those in the system prompt. More importantly, when you start managing multi-turn conversations, conflation becomes a maintenance problem.
The user turn should contain:
- The actual request or question.
- Any data or documents that are specific to this request (e.g. “Here is the PDF text - summarise it.”).
- Context that is specific to this turn (e.g. “Given the plan we discussed above…”).
It should not contain:
- Persistent behavioural instructions. Those belong in the system prompt.
- Security-sensitive constraints. A user can modify their own messages; they cannot modify the system prompt (in a properly built application).
The Assistant Turn
The assistant turn is the model’s previous response, injected back into the conversation for the next request. This is the mechanism that gives a language model what looks like memory in a multi-turn conversation.
Here is the part that surprises many developers: the model has no persistent state between API calls. Every call is stateless. The model does not remember the previous turn - you have to send it back. When you make a second API call in a conversation, your application must include the entire conversation history: system prompt, first user message, first assistant response, second user message, and so on. The model attends to all of it to generate the next response.
This has immediate engineering consequences:
- Token costs grow linearly with conversation length. A 20-turn conversation sends approximately 20x more tokens per call than a single-turn call, because the entire history is in every request.
- Context windows are finite budgets. Once the cumulative history exceeds the model’s context window (measured in tokens), something has to give. Some APIs silently truncate the oldest messages. Others return an error. Your application needs a strategy - sliding window, summarization, or selective pruning - before it needs one.
- You control the history. Nothing forces you to inject the exact unmodified model response from the previous turn. Sophisticated applications summarize, compress, or filter history before injecting it. You can also inject synthetic assistant turns to steer the model’s subsequent behavior - a technique sometimes called “prefilling.”
Here is what the message list looks like at the API level for a two-turn conversation:
messages = [
{
"role": "system",
"content": "You are a helpful coding assistant. Be concise."
},
{
"role": "user",
"content": "What does the 'yield' keyword do in Python?"
},
{
"role": "assistant",
"content": "yield turns a function into a generator. Instead of returning a value and exiting, it pauses execution and hands a value back to the caller, resuming from that point on the next iteration."
},
{
"role": "user",
"content": "Can you show me a simple example?"
}
]
The model receives all four messages as context. Its response to the final user message will be informed by everything above it - including the definition it already gave. This is why follow-up questions work at all.
How Format Maps to Raw Text
Models do not natively understand JSON or Python data structures. Before the model ever sees the message list, the API serializes it into a flat text sequence using a chat template. The format varies by model family. OpenAI’s ChatML format looks like this:
<|im_start|>system
You are a helpful coding assistant. Be concise.<|im_end|>
<|im_start|>user
What does the 'yield' keyword do in Python?<|im_end|>
<|im_start|>assistant
yield turns a function into a generator...<|im_end|>
<|im_start|>user
Can you show me a simple example?<|im_end|>
<|im_start|>assistant
The final <|im_start|>assistant header with no closing tag is the generation prompt - the cue that tells the model to start producing the assistant’s response. The model continues the text from this point.
Llama-based models use a different format with [INST] and [/INST] markers. Anthropic’s Claude uses \n\nHuman: and \n\nAssistant: delimiters internally. The principle is the same: structured markers that the model was trained to respect, serialized into the flat token sequence the model actually sees.
When you use a hosted API, all of this serialization happens invisibly. When you run models locally using tools like llama.cpp or Ollama, applying the correct chat template yourself is your responsibility. Getting it wrong does not produce an error - it produces subtly degraded output, because the model’s behavior was fine-tuned against a specific format.
Practical Patterns for Production
A few patterns that experienced practitioners use consistently:
Separate persona from constraints. A system prompt that mixes “you are a friendly assistant” with “never discuss competitor products” is harder to maintain and debug than one with explicit sections. Use clear structural separation, even in plain text.
Test system prompt changes in isolation. The system prompt is a shared dependency for every conversation in your application. Changes to it are breaking changes. Version-control your system prompts and evaluate them on a representative set of test prompts before deploying.
Treat the user turn as untrusted input. Everything in the user turn could, in principle, be an attempt to override system instructions. This is not paranoia - it is the correct security model. Never interpolate user input directly into your system prompt. If you need to include user-provided data in the system prompt (a document they uploaded, for example), validate and sanitize it first.
Keep context history manageable. A context window of 128,000 tokens sounds generous until you realize that 20 turns of a rich conversation, with a substantial system prompt and retrieved documents, can fill it. Build context management into your architecture from the start, not as a retrofit.
Use assistant prefilling deliberately. You can inject the beginning of the assistant response to constrain the model’s output format. For example, if you need the model to always start with a JSON object, begin the assistant turn with { in your API call. The model will continue from that starting point. This is a low-overhead way to enforce structure without relying entirely on instruction following.
Footnote
Every interaction with a production LLM is a structured list of messages with roles - system, user, and assistant. The system prompt is the developer’s persistent instruction layer. The user turn is the request. The assistant turn is previous model output re-injected as context, because the model is stateless between calls.
Understanding this format and its constraints - token costs, context limits, injection risks - is foundational to building reliable applications on top of language models.