Part 2 of 2. How the base model from the previous article is turned into the assistant you use every day.

In the first article, the model stopped in a strange place: it masters the statistics of language but is good for almost nothing. If you ask “What’s 2 + 2?”, it might keep inventing new questions instead of answering, repeat chunks of the dataset, or simply ramble. It completes sequences, that’s all. Nothing in it knows what an instruction is, what an assistant role is, or what an answer that ends at the right moment looks like.

Turning this base model into ChatGPT is the work of two more stages: supervised fine-tuning, which teaches assistant behavior, and reinforcement learning, which refines reasoning. Together with the pre-training we already saw, this is the full cycle for building a modern LLM.

LLM building pipeline: pre-training produces the base model, SFT teaches instruction-following and the assistant role, and reinforcement learning refines reasoning
The three stages that turn the base model into the assistant: pre-training, SFT, and reinforcement learning.

SFT: teaching the model to be an assistant

Supervised Fine-Tuning (SFT) is the first stage that turns the generic prediction engine into a system that follows instructions. It leverages something the base model already knows how to do, in-context learning, but in a systematic way, using curated examples to induce a consistent response pattern.

For this to work, the model needs to see explicit examples of how an assistant should behave. In 2022 OpenAI described this process in the paper “Training Language Models to Follow Instructions with Human Feedback”. The idea was to build a specialized dataset, much smaller than the pre-training material but far more qualified. It gathers human-written queries, hand-written model answers, complete structured dialogues between user and assistant, and examples that demonstrate the format, style, and organization expected of a good answer. Much of this material was collected by contracted freelancers, whose role was to serve as an explicit reference for ideal behavior.

There’s an important structural detail here. For the model to tell who is speaking and know where each message begins and ends, special tokens come in that act as markers. They indicate where a message starts and stops, which agent is speaking, how to separate metadata from content, and when to stop generating. In a simplified format, a conversation looks like this, with markers surrounding each turn:

SFT dataset example: a user turn and an assistant turn surrounded by markers such as im_start and im_end, followed by fine-tuning that adjusts the parameters to reproduce the reference answer
An SFT dataset example: special markers delimit each turn and fine-tuning teaches where the answer itself ends.

In practice these markers appear tokenized as im_start, im_sep, and im_end, and during fine-tuning they become part of the vocabulary. Systems like ChatGPT rely on them to format the prompt, monitor generation, and cut off the output when the end token appears. Without these markers, the base model has no way of knowing where the answer ends or how to separate its own speech from the user’s, which is why it tends to keep generating text endlessly, even inventing fictitious messages attributed to the user.

The training itself consists of presenting the model with the formatted prompt and the human reference answer, and adjusting the parameters to reduce the difference between what the model predicts and the target answer. This teaches the LLM to take on the assistant role, structure answers clearly, stay consistent throughout the conversation, respect the dialogue format, and reflect the style present in the dataset. It’s worth retaining a point many people get wrong: the model doesn’t create a personality, it replicates the personality implicit in the SFT examples. If you swap the dataset, you change the assistant’s behavior. OpenAI’s original dataset is closed, but there are public initiatives with the same logic, like OpenAssistant, which make exactly this clear: the assistant’s behavior comes from SFT, not from pre-training.

Reducing hallucination and connecting tools

SFT doesn’t just tune general behavior. It’s also the main stage for tackling two concrete problems: hallucination and the lack of access to external information.

Even after SFT, the model is still a token predictor. When it doesn’t have enough data about a subject, its natural behavior is to complete with something that looks plausible, even if it’s wrong. In smaller models this is even more common, because the capacity to store knowledge is reduced. There’s nothing intentional about it, it’s just the model fulfilling its basic function, generating the most likely continuation. Take a real example from the material: ask “Who is Joaquim Tadewald?” and the model might return an entire, detailed, convincing biography of a person who doesn’t exist.

The way to reduce this is to train the model with examples that show how to act when it doesn’t know. The process is direct: you build a set of questions for which the model would normally invent an answer, produce for them a correct and conservative answer along the lines of “I don’t have enough information about this”, and adjust the model to imitate that behavior. It learns a pattern, which is to answer conservatively when confidence is low. This doesn’t eliminate all hallucination, but it drops the frequency a lot and improves safety. Larger models trained this way learn to recognize when they don’t have reliable information and answer something like “I couldn’t find data on this person”.

The second skill taught here is knowing when to ask for help from an external tool, like a search engine or a calculator. And there’s a subtlety that matters a lot for anyone building AI systems: the model doesn’t run the tool, it only signals that it needs one.

Tool use flow: faced with a question that depends on an external source, the model emits a special token like search_tool, the system detects the token and runs the real action, the result returns to the context, and the model produces the final answer
The model doesn't run the tool: it emits a special token, the system runs the real action and returns the result to the context.

The model receives examples where the right answer depends on an external source, and in those examples the answer includes a special token, for instance search_tool. The system running the model detects that token and performs the real action, searching the internet or calculating. The result returns to the model, which then produces the final answer. What it learns is when to ask for help, not how to perform the action. Handling hallucination and integrating tools aren’t natural capabilities of an LLM, they’re behaviors taught during SFT, and that’s what sets the stage for the more advanced alignment that comes with reinforcement learning.

Tokens are the model’s reasoning

There’s a point that completely changes how you write prompts, and it starts from a simple question: how does the model “think”? The counterintuitive answer is that it doesn’t work out the entire answer internally and then write it. Reasoning happens as it generates the tokens. Each token produced is influenced by the ones that came before, and that chain works as its actual line of thought.

The example from the material makes this concrete. Consider the problem: “Emily buys 3 apples and 2 oranges. Each orange costs 2 reais. She paid 13 in total. How much does each apple cost?”. Compare two ways of answering.

Comparison between a short answer, which bets everything on a single probabilistic step with almost no intermediate context, and a structured answer, where each calculation step generates context that supports the next up to the final token
A short answer bets everything on a single step; a structured answer builds context that supports the final token.

The second form is almost always more accurate, even though it’s longer. The reason is that the model is autoregressive: each token depends only on the previous tokens. In the short answer, it has to produce the “3” with virtually no context, so the entire decision falls on a single probabilistic step. In the structured answer, the whole sequence of calculations and numerical relationships builds a robust context that pushes the probability toward the right token. The auxiliary reasoning isn’t decoration, it’s part of the prediction mechanism.

That’s where the effectiveness of instructions like “think step by step”, “show your work”, or “solve it in stages” comes from, no magic involved. They work because they increase the amount of useful tokens that support the result. In any task involving logic, calculation, or problem decomposition, asking the model to make its reasoning explicit before the conclusion improves accuracy considerably, reduces error paths, and even makes it easier for the model to detect its own inconsistencies along the way. You can sum up the core idea like this: the model doesn’t think before generating tokens, it thinks by generating tokens.

Reinforcement Learning: letting the model discover on its own

Even after pre-training and SFT, models still make mistakes on tasks that require technical reasoning, especially in STEM. They memorize a lot of knowledge but struggle to organize reasoning chains consistently, tending to cut corners, skip steps, and prioritize a fast answer. This isn’t a lack of data, it’s a limitation of the autoregressive dynamic itself. To tackle this point, Reinforcement Learning (RL) comes in.

The logic resembles how we learn a difficult subject. First you study the theory, then you see a worked example, and finally you consolidate by practicing on your own, making mistakes and correcting them. The model has already completed the first two stages: pre-training was theory at scale, SFT was learning by demonstration. What’s missing is active practice, and that’s the space RL fills.

The challenge is defining what counts as a good answer. We know a structured answer tends to work better, but to the model both the long and the short one are just token sequences. You can’t simply impose human preference, because what seems better to us isn’t always the most efficient pattern for the model’s internal dynamic. RL’s solution is elegant: let the model itself discover.

Reinforcement learning flow: for a problem with a verifiable answer, the model generates several alternative answers, the incorrect ones are discarded, among the correct ones it selects those with the leanest structure, and those become new examples that reinforce the behavior
In RL the model generates several answers, discards the wrong ones, and reinforces the leanest correct ones as new examples.

For each technical problem, the LLM generates several answers. Some are right, some aren’t. The wrong ones are dropped. Among the correct ones, training selects those that reach the result with the leanest, most efficient structure, and these become a new set of examples to reinforce the behavior. In practice it’s like instructing the model: “this style of reasoning led to the correct solution, replicate this strategy”. The power of this lies in letting the model learn reasoning patterns that nobody wrote by hand, patterns that emerge from its own behavior when exposed to structured challenges.

The preference for the most token-economical answer has nothing to do with aesthetics. In an autoregressive model, a shorter sequence usually corresponds to a more direct reasoning chain, with less statistical dispersion, which reduces variance and cuts down the alternative paths that lead to error. The end user doesn’t need to receive a short answer, the point is to train the model to internally organize cleaner reasoning. RL doesn’t replace the earlier stages, it’s the final refinement: it’s where the model starts analyzing its own performance, recognizing what works, and adjusting the parameters to repeat it, moving from merely reproducing observed patterns to developing a more reliable reasoning structure.

The current state and where to keep up

While pre-training and SFT are already well-standardized procedures in the industry, RL applied to LLMs is still rapidly changing terrain. There’s no dominant method or settled consensus, each lab develops its own approach, often novel and poorly documented.

One of the most striking advances in this landscape came from DeepSeek R1, released by a Chinese company. It demonstrated reasoning performance superior to that of several much larger commercial alternatives, despite having around 600 billion parameters, a number substantially smaller than that of top proprietary models. This shook the perception that only math scaled, the model performs a step of internal reflection before answering, generating a sequence of tokens that represents its reasoning process. This connects directly to the principle from the previous section: an LLM reasons by generating tokens, so letting it “think” more before answering increases accuracy. And the crucial detail is that these chains weren’t provided by hand, the model learned to produce them via RL, reinforcing the internal processes that led to correct answers.

A curious phenomenon observed in DeepSeek R1 was what the authors called the “aha moment”. During internal reasoning, the model tried multiple approaches, explored alternative paths, and, at a certain point, abruptly changed direction, as if it had noticed an important relationship or spotted an error in its own reasoning. Nobody programmed this, it emerged from the RL iterations. This ability to reorganize its own process let the model beat very demanding benchmarks like AIME 2024 and Codeforces, reinforcing in practice the principle that runs through both articles: the model reasons better when it has room to generate longer internal sequences and explore alternatives.

To keep up with this evolution, which changes almost every week, a few sources are worth it. lmarena.ai maintains a leaderboard where users rate answers without knowing which model produced each one, which yields a reliable ranking per category, text, vision, code, multimodality. The news.smol.ai newsletter brings frequent updates on releases, training techniques, and benchmark analysis. GitHub’s Trending tab, filtered by Python, shows which LLM-related projects are gaining traction, usually agent frameworks, local execution libraries, and automation tools. And the r/LocalLLaMA subreddit concentrates technical discussion about open-source models running locally, quantization, performance on consumer hardware, and custom finetuning.

Understanding this entire cycle, from Common Crawl to RL, takes AI out of the black-box category. Once you know that the model predicts token by token, that assistant behavior came from SFT, and that reasoning is built during generation, your prompt and architecture decisions stop being trial and error. That’s the next step I recommend: take a real problem of yours and test the ideas here, compare a direct answer with a structured one, watch when the model asks for a tool, measure where it hallucinates.

Next steps

This series covered the full path of an LLM, from raw data to assistant. Part 1 showed how the model goes from Common Crawl to the base model, through the Transformer, tokenization, and pre-training. This second part took that base model and showed how SFT teaches the assistant role, how hallucination and tool use are trained, and how reinforcement learning refines reasoning.

If you want to see this kind of model inside a real software architecture, it’s also worth reading Amazon Bedrock in practice: AI as part of the architecture, where AI stops being a concept and becomes part of a system in production.


This is the closing of a two-article series on the inner workings of LLMs. If the content helped you, follow me on LinkedIn and GitHub.