System 2 Distillation: Porting DeepSeek-R1 Reasoning via Probability Mapping
The secret to 2026 AI is not running GPT-5. It is copying its brain into a 1.5B parameter model that runs locally on a battery.
Model distillation transfers the reasoning capabilities of a frontier teacher model into a smaller student model by mapping probability distributions instead of just final answers. Most distillation is just surface-level imitation. To build a small model that actually thinks, we need System 2 distillation. The goal is moving the slow, deliberate reasoning of a Frontier model into the fast weights of a Small Language Model (SLM). We aren't just copying what the teacher said; we are productizing the teacher's thought process.
Why Logits Matter More Than Labels
Most people think distillation means asking a massive model a million questions and training a small model on the answers. That is just supervised fine-tuning. True distillation maps probability.
Stop training on hard labels. When you train an SLM on a single correct answer, you lose the signal found in the teacher's uncertainty. We use logit-based KL divergence to map the entire probability distribution from the teacher to the student.
This preserves the dark knowledge. The relative probabilities of every possible next token matter. By minimizing the divergence between the teacher's logits and the student's, the SLM learns not just what the answer is, but which wrong answers were almost right and why. Geoffrey Hinton mapped this out in his 2015 distillation paper. When a teacher model evaluates an image of a dog, it might output 90% dog, 9% cat, and 1% car. That 9% cat signal matters. It teaches the student that dogs and cats share structural features. We force the small model to mimic that exact distribution, capturing the nuance instead of just the correct label.
Why SLMs Need Internal Reasoning Traces
I needed this for the Voice Keyboard. The hardware captures spoken dictation. Spoken dictation is messy. People stutter, restart sentences, and use filler words. Sending every keystroke to a cloud API introduces lag. It ruins the typing experience.
I needed a local 1.5B model to clean up the text instantly. Base 1.5B models fail at this task. They just transcribe the stutter.
The real leverage is distilling internal reasoning traces. Instead of mapping probability distributions, you extract the reasoning trace itself. The DeepSeek-R1 technical report outlines the exact mechanics. They force the teacher model to expose its internal logic inside explicit <thought> tags before generating a final answer.
For example: <thought> The user said 'I want to... no wait, let's go to the store'. The first part is a stutter. I should remove it. </thought> Let's go to the store.
Distilling this teaches the student model the actual step-by-step cognitive process. You stop copying the output and start copying the work.
To ensure the student isn't just hallucinating a logic path, we use Process-supervised Reward Models (PRMs). Unlike standard reward models that grade the outcome, PRMs grade the individual steps of the reasoning chain. This forces the SLM to maintain logical consistency through the entire thought process.
Building a Cold-Start Curriculum
You cannot just dump advanced reasoning traces into a tiny model and expect it to learn. It usually collapses.
Getting started requires high-quality cold-start data for reasoning SFT. You don't need millions of examples. You need a few thousand perfect chains of thought. Use a frontier model to generate these traces, then filter them through your PRM. You can orchestrate the synthetic data generation pipeline with Distilabel and run the actual fine-tuning through Axolotl. This builds your foundation.
Recent research on pedagogical synthesis (arXiv:2602.12172) shows that small models need a curriculum. You have to build foundational logic before introducing complex formatting tasks. Once the SLM understands the format of the thought tags, you can scale the distillation through synthetic data generation.
The ROI of System 2 Distillation
The difference is night and day.
Before distillation, running a 1.5B model locally produced useless literal transcriptions. Sending the audio to a cloud provider took about 1.2 seconds per sentence.
Now the distilled 1.5B model runs entirely on-device. It cleans messy dictation with DeepSeek-R1 level logic. Latency dropped by 60 percent down to 450 milliseconds. The recurring cloud cost is exactly zero.
The leverage here is purely structural. Distilling the 671B parameter DeepSeek-R1 model into a 1.5B parameter architecture yields a 99.7% reduction in total size. Yet according to the R1 benchmarks, that tiny 1.5B student model scores 83.9% on the MATH-500 reasoning benchmark. You get a model that runs locally on battery power at hundreds of tokens per second while retaining massive cognitive capability. This means native execution on an Apple Silicon M3 or a Snapdragon X Elite NPU. You effectively move the heavy compute cost from inference time to training time.