7 min read
Why RL Forgets Less: The Training Method Behind Better AI Models
Writing Team
:
Jun 22, 2026 8:00:01 AM
Every major AI model you use was shaped after pretraining by a post-training pipeline. The choices made in that pipeline — which methods, in what order, with what signals — determine whether the model that emerges is capable across a broad range of tasks or has quietly lost earlier abilities in exchange for the new ones it gained.
A recent technical analysis by researcher W.H. offers one of the clearer accounts of why these methods differ, what each one is actually doing to a model's probability distribution, and why a finding that initially seemed surprising — that students can outperform their teachers — turns out to follow logically from the geometry of on-policy training.
Key Points
- The core question is about distributions: Different post-training methods reshape a language model's probability distribution differently. Understanding which direction they pull — and from where — explains most of the qualitative differences in how they perform.
- SFT pulls toward an external target regardless of starting position: Supervised Fine-Tuning optimizes toward a fixed dataset with no built-in mechanism to preserve existing capabilities, which is why it forgets.
- RL stays close to what the model already knows: Because reinforcement learning trains on the model's own outputs, it implicitly constrains updates to stay near the current policy — finding the nearest task-solving solution rather than the closest match to an external dataset.
- On-Policy Distillation students outperform their teachers: In controlled experiments, students trained via on-policy distillation from an SFT teacher — a degraded model — forgot less than the teacher itself and matched performance of students trained from an RL teacher.
- The load-bearing ingredient is on-policy data: It isn't that RL is uniquely special or that KL penalties are doing the heavy lifting. The thing that pushes capability without erasing prior knowledge is training on the model's own outputs.
- No perfect algorithm exists yet: The ideal training method would combine the density of distillation signals, the unbiasedness of RL rewards, and the on-policy property of both. What that looks like as a concrete algorithm remains an open problem.
The Distributional View
The framing that makes the analysis work is treating a language model as a distribution over sequences. Every completion a model generates is a sample from that distribution. Post-training is the process of reshaping that distribution toward some desired behavior. The critical question becomes: what is the target distribution, and how directly does the training method define it?
This framing is not meant to be mathematically rigorous. The author presents it explicitly as a useful mental model. But it makes the differences between Supervised Fine-Tuning, Reinforcement Learning, and On-Policy Distillation intuitive in a way that technical descriptions alone often do not.
Supervised Fine-Tuning: A Pull Toward an External Target
SFT is the simplest case. You have a dataset — human-written, model-generated, or both — that exists before training begins. You run cross-entropy training, which pulls the model's distribution toward the dataset distribution. The starting model's distribution does not influence the direction of the pull. The loss function does not know where the model started, and it does not care.
This is why SFT has a natural catastrophic forgetting failure mode. If the dataset distribution is far from what the model already knew how to do, the model has no built-in mechanism to prefer a nearby solution. It is simply pulled toward the demonstrated tokens. Every token in the dataset receives uniform gradient pressure regardless of whether it is task-critical or stylistic noise. A mathematical operator and a filler phrase like "we can see that" receive equivalent treatment. Prior capabilities that were not represented in the training data face collateral pressure from updates aimed at unrelated tokens.
Research by Diao et al. found that SFT datasets contain many tokens where the model is confident in its own prediction but is forced to fit a diverging ground-truth label, creating the broad, redundant updates that erode existing capabilities. Mukherjee et al. found that SFT induces dense parameter updates, while Yuan et al. found more redundancy in those updates — when parameters are pruned, SFT performance degrades more slowly than RL, suggesting the updates carry less information per parameter.
The author is careful to note that none of this makes SFT bad. It is highly effective for cold-start tasks where the expected output format needs to change drastically. The point is that its geometry — a direct pull toward an external target with no regard for the starting policy — creates specific failure modes that matter in specific contexts.
Reinforcement Learning: The Nearest Task-Solving Policy
Online RL looks different. The model generates samples from itself, those samples are scored by a reward function, and the model is updated to increase expected reward via policy gradient. There is no fixed external dataset. The target distribution is not defined arbitrarily in advance.
The author's preferred explanation for why RL forgets less draws on work by Shenfeld et al. Consider a simplified version with binary 0/1 rewards. A reward of 1 contributes positive training signal. A reward of 0 contributes nothing. In that case, reward acts as a filter — it looks very similar to rejection sampling. The implicit target distribution is the closest optimal policy to the current policy. Because training happens on samples generated by the current model, the update at every step constrains the training procedure to stay near where the model already is.
The practical consequence is that RL finds the nearest task-solving policy rather than the globally closest match to an external distribution. This is what produces the anti-forgetting behavior. Updates are concentrated in regions the model already visits, rather than spreading across the full parameter space in the direction of an external dataset.
Two other mechanisms reinforce this. First, RL has data-dependent regularization through advantage estimation: when rewards are normalized within a group, samples with higher variance receive smaller updates, naturally reducing gradient magnitude when the model is uncertain. Second, as Mukherjee et al. found, RL updates a small but important subnetwork via sparse updates — when those parameters are pruned, performance degrades rapidly, suggesting the updates are informationally dense. Both effects concentrate learning on what matters rather than distributing it uniformly.
The author acknowledges that explicit KL regularization is often cited as the explanation, but finds it incomplete. RL retains its anti-forgetting behavior even when KL penalties are removed or significantly reduced, as is common in Reinforcement Learning from Verifiable Rewards settings where people frequently use GRPO instead of PPO with weaker constraints. The explanation that holds across those settings is the on-policy nature of the data, not the explicit penalty.
On-Policy Distillation: Where the Experiment Gets Interesting
On-Policy Distillation sits between SFT and RL. Like SFT, it uses a teacher signal. Like RL, the training data comes from the student's own outputs. At each step, the student generates completions from its current policy, and the teacher provides a learning signal by indicating how its own distribution differs from the student's.
The author trained two teacher models — one via SFT, one via RL — on a minimal code editing task where the model must fix a bug in a function without rewriting any other part of the code. The task was designed to test both generalization (training on one type of corruption, evaluating on different types) and catastrophic forgetting (measuring degradation on general code generation via LiveCodeBench).
The results were not what was expected. Both OPD students — one trained from the SFT teacher, one from the RL teacher — performed nearly identically. More surprising, both slightly outperformed the RL teacher and significantly outperformed the SFT teacher on out-of-domain corruptions. Most striking of all: the OPD student trained from the degraded SFT teacher forgot less than the SFT teacher itself.
If the teacher's distribution were the primary driver of student behavior, the student trained from the SFT teacher should have inherited more of the SFT teacher's forgetting. It did not. The source of the training data — on-policy sampling from the student's own current distribution — mattered more than the teacher from which the signal came.
This result has practical implications. It means you can aggressively specialize a model through brute-force SFT, producing a degraded but highly capable specialist, and then use OPD to transfer that capability into a student model that preserves substantially more of its original abilities. The teacher does not need to be clean.
Why Students Can Outperform Teachers
This phenomenon is not new — Agarwal et al. showed that distillation students surpass teacher performance on mathematical benchmarks — but the explanation the author offers is useful. In traditional distillation, supervision comes from teacher-generated trajectories. The student may receive guidance on parts of the distribution it rarely visits, creating a mismatch between where supervision occurs and where the student's mistakes actually happen.
In OPD, the teacher gives advice on the student's own prefixes. The student generates a partial completion from its current state, and the teacher provides signal conditioned on that prefix. Supervision is concentrated on the mistakes the student actually makes, not the mistakes the teacher would make. The signal is more targeted than it would be from static teacher-generated data.
The second mechanism is that KL matching reshapes the distribution in ways that go beyond reproducing the teacher's individual token choices. The teacher distribution carries information about style, uncertainty, alternative continuations, and reasoning structure. Matching it can improve the student's overall sampling behavior without exactly copying the teacher's greedy outputs. This is easier to reason about in distributional terms than in terms of individual token predictions.
The Full Pipeline and What It Tells Us
Most state-of-the-art open models follow a Pretrain-SFT-RL-OPD pipeline. SFT is necessary after pretraining to establish adherence to format and basic instruction-following, without which RL cannot run efficiently. RL then develops specialized capabilities in domains where verifiable rewards exist. OPD merges those specialist capabilities into the final model.
The MiMo-V2-Flash technical report explicitly documents this pattern, comparing post-training approaches across domains. Math and code tasks favor RL, where verifiable rewards are clean and unambiguous. Creative writing and knowledge-heavy benchmarks benefit more from distillation-style methods, where reward is noisier and an LLM judge introduces bias that would be amplified by RL's optimization pressure. In the RL domain, the final merged model almost always outperforms the teacher. In domains where the teacher itself was produced by distillation, the student occasionally underperforms.
Both GLM 5 and DeepSeek v4 have moved to using OPD to merge expert capabilities into the final model, with the final checkpoint never going through RL directly. The convergence toward expert merging as the final pipeline stage is notable across major model releases.
What This Means for AI Development
The analysis closes with an honest statement about where the field stands. The ideal post-training algorithm would combine the dense per-token signal of distillation, the low-bias reward structure of verifiable RL, and the on-policy property that both share. Nothing currently does all three cleanly.
Outcome-based RL rewards are sparse, which is why RL is computationally expensive. Process Reward Models that provide denser intermediate feedback have not trained efficiently at scale. Logit distillation from a teacher provides dense per-token signal but introduces bias that forces clipping mechanisms and messy hyperparameter tuning.
The shape of the problem is clearer than the solution. On-policy data is the load-bearing ingredient that allows capability improvement without catastrophic forgetting. The author's contribution is to pull together research across SFT, RL, and OPD to make that common thread visible and to show through experiments that the teacher matters less than the data-generation process — a finding with practical consequences for how specialized models are trained and how those specializations are transferred.
For organizations building or deploying AI systems, the practical takeaway is not that one training method is universally superior. It is that the choice of post-training method has specific, predictable effects on what a model retains versus loses, and that those effects follow from the geometry of how each method defines its target. Understanding that geometry is increasingly relevant as enterprises build specialized AI systems on top of foundation models and discover that what a model learned in fine-tuning is not always what it kept.
At Winsome Marketing, our AI strategy work increasingly includes exactly these questions about model behavior — not at the research level, but at the practical level of what it means for how AI tools are selected, customized, and evaluated. If you want help thinking through the AI infrastructure decisions that actually affect performance, we can help.


