LFM2.5-1.2B vs LFM2-2.6B: Why We Chose the Smaller Model

Bigger isn't better. I ran the numbers. The 1.2B model wins.
I'm building a voice keyboard that runs on a Raspberry Pi 5. Fully local, voice-first. The constraint: a model that reasons well and fits under ~1GB RAM, because I'm also running Whisper and a second model on the same 4GB board. That left two Liquid AI options — LFM2-2.6B (technical report) and LFM2.5-1.2B-Thinking (Hugging Face).
On paper, the 2.6B sounds stronger. Half my brain assumed bigger parameters = better. In practice, the 1.2B scores higher on everything I care about and runs over twice as fast.
Here's the comparison.
Quick background on Liquid AI
Liquid AI builds the LFM (Liquid Foundation Models) family. Their whole thing is models that run well on-device — phones, laptops, embedded hardware — under tight latency and memory budgets. Not datacenter-first. Edge-first. They co-design architecture, training, and deployment (llama.cpp, ExecuTorch, vLLM, open weights).
When I went looking for something that could reason on a Pi without blowing the RAM budget, their stack was the obvious starting point.
The two models
| LFM2-2.6B | LFM2.5-1.2B-Thinking | |
|---|---|---|
| Parameters | 2.6B | 1.17B |
| Context | 32K | 32K |
| Training | 11T tokens (LFM2 pipeline) | 28T tokens + multi-stage RL |
| Variant | General Instruct (SFT + preference + merge) | Reasoning / thinking model |
LFM2.5 is the newer family. Same edge-first architecture, way more pre-training, plus reinforcement learning. Liquid positions the Thinking variant as on-device reasoning under 1GB — fits in ~900 MB on a phone. What used to need a datacenter now runs offline in your pocket.
They trained it with curriculum RL and doom-loop mitigation so it actually finishes reasoning instead of getting stuck repeating itself. That's a real problem with small thinking models — I've seen it happen. The Thinking variant is built for chain-of-thought and tool use, which is exactly how I use it.
Benchmarks: the 1.2B wins where it matters
I care about instruction following, reasoning, and enough knowledge to be useful. Not raw knowledge at any cost. Numbers below: LFM2-2.6B from the technical report (Tables 6 & 7), LFM2.5-1.2B-Thinking from the Hugging Face card.
Instruction following and reasoning
| Benchmark | LFM2-2.6B | LFM2.5-1.2B-Thinking |
|---|---|---|
| IFEval | 79.56 | 88.42 |
| IFBench | 22.19 | 44.85 |
| Multi-IF | 60.26 | 69.33 |
| GSM8K | 82.41 | 85.60 |
| MATH-500 | 63.60 | 87.96 |
Ahead on every single one. The gaps that jumped out: IFBench (+22.66), MATH-500 (+24.36), IFEval (~+9). For a voice assistant that has to follow instructions precisely and reason step-by-step, that's the signal I needed.
Knowledge
| Benchmark | LFM2-2.6B | LFM2.5-1.2B-Thinking |
|---|---|---|
| MMLU | 64.42 | — |
| MMLU-Pro | 25.96 | 49.65 |
| GPQA Diamond | 26.57 | 37.86 |
No standard MMLU on the HF card for the 1.2B. But it leads on MMLU-Pro and GPQA Diamond. Where I can compare, the smaller model holds up or wins.
Bottom line: Instruction following, math, knowledge — the 1.2B takes it. For what I'm building, it's the better performer despite being less than half the size.
Inference: speed and memory on the Pi
On a Pi 5, throughput and memory are hard limits. Not preferences. Hard limits.
LFM2 report gives 2.6B numbers on CPU (llama.cpp, Q4_0). The 1.2B numbers come from the Hugging Face model card and Liquid's launch post — same table, same methodology. Both use llama.cpp, Q4_0, CPU, with 1K prefill and 100 decode tokens.
Decode speed — what you feel when the model is talking
| Device | LFM2-2.6B | LFM2.5-1.2B-Thinking |
|---|---|---|
| Samsung Galaxy S25 (CPU, 4K prefix) | 30.0 | 70 |
| AMD Ryzen AI 9 HX 370 (CPU, 4K prefix) | 46.8 | 116 |
2.3x faster decode for the 1.2B. The Pi 5 is CPU-bound too, so the direction holds: the smaller model runs at comfortable latency. The bigger one would feel sluggish.
Prefill speed
| Device | LFM2-2.6B | LFM2.5-1.2B-Thinking |
|---|---|---|
| Samsung S25 (CPU) | 116 | 335 |
| AMD Ryzen HX 370 (CPU) | 1,171 | 2,975 (1K) |
The 1.2B wins by a lot. Faster prefill means the first token shows up sooner after you stop speaking. On a voice-first device, that gap is the difference between "responsive" and "is it frozen?"
Memory
- LFM2-2.6B (Q4_0): ~1.5-1.6 GB for 32K context.
- LFM2.5-1.2B-Thinking (Q4_0): 719 MB (S25), 856 MB (Ryzen, full context).
I need to fit Whisper, the app, and a second model (Instruct) on a 4GB Pi. The 1.2B fits with headroom. The 2.6B would be tight or over budget.
Picking a model here is like picking a car for a narrow garage. Horsepower doesn't help if it doesn't fit.
How it runs in practice
The voice keyboard runs two models:
- Instruct (LFM2.5-1.2B-Instruct) for direct chat and Clarify mode.
- Thinking (LFM2.5-1.2B-Thinking) for Converse mode when the user turns on reasoning or tools.
That split — which I arrived at after trying the alternative — matches Liquid's own guidance. They recommend Thinking for agentic and reasoning-heavy work (tool use, math, planning sequences) and Instruct for chat and creative writing.
Compared to Instruct, the Thinking variant jumps on exactly the benchmarks I need in Converse: math (63 to 88 on MATH-500), instruction following (61 to 69 on Multi-IF), tool use (49 to 57 on BFCLv3). I'm not running two models for the sake of it. Each one is built for its job.
The Thinking model has to: (1) follow instructions precisely, (2) reason step-by-step and use tools, (3) run at low latency with 32K-65K context, (4) stay under ~1GB so both models and Whisper can coexist. On (1) and (2), the 1.2B beats the 2.6B on every benchmark I care about. On (3) and (4), it's faster and lighter.
The smaller model wins on every axis that matters for this device.
Why the 1.2B
- Better quality where it counts. Instruction following and reasoning ahead of the 2.6B. Knowledge holds up or wins.
- 2x+ faster inference. Decode and prefill on CPU. Responses start sooner, stream faster.
- Under 1GB memory. Leaves headroom for the second model and the rest of the stack.
- Same 32K context. No tradeoff on length.
- Built for reasoning. Chain-of-thought, tool use, the whole thinking pipeline. I'm working with the design, not against it.
- Efficient at test time. Liquid reports fewer output tokens for the same or better quality vs. other thinking-mode models. Less verbosity, faster answers.
I'm not choosing smaller for the sake of smaller. I'm choosing the model that scores higher on my target benchmarks and runs better on my hardware. That model happens to be half the size.
For the full breakdown of how I run this on a Pi 5 — quantization, context window, inference settings — see the llama.cpp optimization guide.
References
- LFM2 Technical Report — LFM2-2.6B architecture, training, benchmarks (Tables 2, 3, 6, 7).
- LFM2.5-1.2B-Thinking: On-Device Reasoning Under 1GB — Liquid's launch post: positioning, training recipe (curriculum RL, doom-loop mitigation), when to use Thinking vs Instruct, inference table.
- LFM2.5-1.2B-Thinking on Hugging Face — Benchmarks, inference tables, recommended use (agentic tasks, RAG, data extraction).