DPO Fine-Tuning a 1.2B Model: What Worked, What Broke, What I'd Skip

I spent two weeks fine-tuning a 1.2B parameter model with DPO and LoRA. Four iterations. ~400 training samples. The model got better at style and worse at reasoning. The base model won.

Here's everything I learned.

The goal

LFM2.5-1.2B-Thinking is good at reasoning. But its <think> blocks are verbose, sometimes circular, and occasionally drift off-topic. I wanted to fine-tune it to produce tighter reasoning — more focused chains of thought that still arrive at correct answers.

Secondary goal: improving tool call format compliance. The base model sometimes outputs tool calls in slightly wrong formats. I wanted to bake the correct format into the weights instead of relying on post-processing to fix it.

The setup

Method: DPO (Direct Preference Optimization) Framework: TRL 0.27.2 + Transformers 5.1.0 + PyTorch 2.5.1 Adapter: LoRA with rank 16

{
  "r": 16,
  "lora_alpha": 32,
  "lora_dropout": 0.05,
  "target_modules": ["q_proj", "in_proj", "out_proj", "v_proj", "k_proj"]
}

Training params:

Batch size: 5
Epochs: 3
Max steps: 9
Learning rate: default TRL DPO settings

Hardware: Not on the Pi — fine-tuning ran on a GPU instance. The Pi is for inference only.

The training data

~400 DPO pairs in JSONL format. Each pair:

Chosen: The response I wanted (tight reasoning, correct format, useful output)
Rejected: The response I didn't want (verbose, circular, format violations)

Most pairs were hand-curated from real conversations with the base model. I'd run a query, get a mediocre response, then write a better version. Tedious but precise.

Categories covered:

Tool call routing (calculator, search, memory)
Multi-step reasoning (trip planning, cost estimation)
Context resolution (pronouns, references to previous conversation)
Direct questions (facts, definitions, how-tos)
Negation handling ("don't search, just tell me")

Iterations 1-3: gradual improvement

The first three iterations followed the same loop:

Train on the current sample set
Evaluate on 40 test queries
Fix bad outputs manually
Add the fixed versions to training data
Repeat

By iteration 3, the style was noticeably better:

Shorter <think> blocks
Less repetition
Cleaner transitions from reasoning to answer
More natural language (fewer bullet lists in thinking)

Tool call format compliance also improved. The model was hitting the exact JSON schema I wanted about 90% of the time, up from maybe 75% with the base.

This felt like progress. It was.

Iteration 4: where accuracy dropped

Iteration 4 added more aggressive preference pairs — cases where the base model's reasoning was okay but not great, and I wrote tighter alternatives.

The model learned the style perfectly. <think> blocks were concise, well-structured, and read like I wrote them.

But reasoning accuracy dropped.

On my 40-query test suite:

Base model: 39/40 correct (with post-processing)
Iteration 3: 38/40 correct
Iteration 4: 34/40 correct

Six regressions. All in multi-step reasoning. The model would produce beautiful, confident reasoning chains that arrived at wrong answers. It looked smarter while being dumber.

What went wrong

DPO optimizes for preference. If your "chosen" responses are better-written but contain slightly different reasoning paths than what the model would naturally produce, you're training it to imitate your writing style at the expense of its own reasoning patterns.

At 1.2B parameters, the model doesn't have the capacity to learn new style AND maintain all its reasoning ability. Something gives. In my case, what gave was multi-step math and complex tool orchestration.

This is the alignment tax at small scale. The model gets more pleasant to interact with but loses edge-case accuracy. At 70B parameters, you probably wouldn't notice. At 1.2B, it's measurable.

Why more DPO fine-tuning doesn't mean better

There's an optimal point, and I overshot it. There's an optimal point where the model has absorbed your style preferences but hasn't had its reasoning distorted. For this model and this dataset, that point was somewhere around iteration 2-3. I overshot it.

I converted iteration 4 to GGUF anyway (iteration_4_grpo_dpo.gguf) and tested it on the Pi. The style improvements were real and the regressions were in cases that my post-processing sanitizer could catch. So the fine-tuned model plus post-processing was slightly better overall than the base model plus post-processing.

But "slightly better" after two weeks of work isn't a great return.

Why I went back to base

I went back to the base LFM2.5-1.2B-Thinking model for production. Four reasons:

The base model + post-processing hits 97% accuracy. Hard to improve on that with fine-tuning alone.
Liquid AI updates the base model. When they release a new version, I'd need to re-fine-tune. That's a maintenance burden I don't want.
The style improvements weren't worth the accuracy risk. Users care about correct answers. They don't notice if the reasoning block is 20% shorter.
Prompt engineering was more effective per hour. The improvements from better tool names and simpler instructions were larger than the improvements from fine-tuning, and took hours instead of weeks.

What I'd do differently

Fewer samples, higher quality. 400 samples was too many. The model started overfitting to my writing patterns around sample 200. I'd aim for 100-150 very clean DPO pairs next time.

Never fine-tune reasoning by rewriting it. Instead of writing new reasoning chains (which teaches the model to think like me), I should have only fine-tuned on format and style cues — leaving the model's own reasoning process intact. This is the key mistake I made.

Test every iteration on the full suite. I tested iterations 1 and 2 on a subset. If I'd run the full 40-query suite on each, I would have caught the accuracy drop earlier and stopped at iteration 2.

Use DPO for format, GRPO for reasoning. DPO is good at "make this look like that." It's not good at "reason better." For reasoning improvements, GRPO (Group Relative Policy Optimization) rewards correct answers rather than stylistic preferences. I didn't have the compute budget for GRPO at the time, but that's where a future attempt would start.

DPO + LoRA training config

Here's the minimal setup that worked:

from trl import DPOTrainer, DPOConfig
from peft import LoraConfig
 
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q_proj", "in_proj", "out_proj", "v_proj", "k_proj"],
    task_type="CAUSAL_LM",
)
 
training_args = DPOConfig(
    per_device_train_batch_size=5,
    num_train_epochs=3,
    output_dir="./grpo_checkpoints/iteration_4",
    save_steps=500,
    logging_steps=10,
)

Training takes about 20 minutes per iteration on a single GPU. Convert to GGUF with llama.cpp's conversion scripts, then deploy to the Pi.

DPO fine-tuning at 1.2B: the takeaway

Fine-tuning a 1.2B model is cheap and fast. But at this scale, the model's capacity is limited enough that style improvements come at the cost of reasoning. Test aggressively. Stop early. And seriously consider whether prompt engineering gets you there faster — in my case, it did.

For the voice keyboard, the base model won. The fine-tuned versions taught me what DPO can and can't do at small scale, but the shipping product runs on unmodified LFM2.5-1.2B-Thinking with good prompts and a solid post-processing layer.

The goal

The setup

The training data

Iterations 1-3: gradual improvement

Iteration 4: where accuracy dropped

What went wrong

Why more DPO fine-tuning doesn't mean better

Why I went back to base

What I'd do differently

DPO + LoRA training config

DPO fine-tuning at 1.2B: the takeaway

Related posts

The $650B Zero-ROI Disconnect: AI's Biggest Bet vs the Data

The Frontier Model Tax

System 2 Distillation: Porting DeepSeek-R1 Reasoning via Probability Mapping

The 32B Open VLA Re-Prices Every Closed AV Stack