LFM2.5-1.2B vs LFM2-2.6B: Why We Chose the Smaller Model

LFM2.5-1.2B vs LFM2-2.6B: Why We Chose the Smaller Model

Bigger isn't better. I ran the numbers. The 1.2B model wins.

I'm building a voice keyboard that runs on a Raspberry Pi 5. Fully local, voice-first. The constraint: a model that reasons well and fits under ~1GB RAM, because I'm also running Whisper and a second model on the same 4GB board. That left two Liquid AI options — LFM2-2.6B (technical report) and LFM2.5-1.2B-Thinking (Hugging Face).

On paper, the 2.6B sounds stronger. Half my brain assumed bigger parameters = better. In practice, the 1.2B scores higher on everything I care about and runs over twice as fast.

Here's the comparison.

Quick background on Liquid AI

Liquid AI builds the LFM (Liquid Foundation Models) family. Their whole thing is models that run well on-device — phones, laptops, embedded hardware — under tight latency and memory budgets. Not datacenter-first. Edge-first. They co-design architecture, training, and deployment (llama.cpp, ExecuTorch, vLLM, open weights).

When I went looking for something that could reason on a Pi without blowing the RAM budget, their stack was the obvious starting point.

The two models

LFM2-2.6BLFM2.5-1.2B-Thinking
Parameters2.6B1.17B
Context32K32K
Training11T tokens (LFM2 pipeline)28T tokens + multi-stage RL
VariantGeneral Instruct (SFT + preference + merge)Reasoning / thinking model

LFM2.5 is the newer family. Same edge-first architecture, way more pre-training, plus reinforcement learning. Liquid positions the Thinking variant as on-device reasoning under 1GB — fits in ~900 MB on a phone. What used to need a datacenter now runs offline in your pocket.

They trained it with curriculum RL and doom-loop mitigation so it actually finishes reasoning instead of getting stuck repeating itself. That's a real problem with small thinking models — I've seen it happen. The Thinking variant is built for chain-of-thought and tool use, which is exactly how I use it.

Benchmarks: the 1.2B wins where it matters

I care about instruction following, reasoning, and enough knowledge to be useful. Not raw knowledge at any cost. Numbers below: LFM2-2.6B from the technical report (Tables 6 & 7), LFM2.5-1.2B-Thinking from the Hugging Face card.

Instruction following and reasoning

BenchmarkLFM2-2.6BLFM2.5-1.2B-Thinking
IFEval79.5688.42
IFBench22.1944.85
Multi-IF60.2669.33
GSM8K82.4185.60
MATH-50063.6087.96

Ahead on every single one. The gaps that jumped out: IFBench (+22.66), MATH-500 (+24.36), IFEval (~+9). For a voice assistant that has to follow instructions precisely and reason step-by-step, that's the signal I needed.

Knowledge

BenchmarkLFM2-2.6BLFM2.5-1.2B-Thinking
MMLU64.42
MMLU-Pro25.9649.65
GPQA Diamond26.5737.86

No standard MMLU on the HF card for the 1.2B. But it leads on MMLU-Pro and GPQA Diamond. Where I can compare, the smaller model holds up or wins.

Bottom line: Instruction following, math, knowledge — the 1.2B takes it. For what I'm building, it's the better performer despite being less than half the size.

Inference: speed and memory on the Pi

On a Pi 5, throughput and memory are hard limits. Not preferences. Hard limits.

LFM2 report gives 2.6B numbers on CPU (llama.cpp, Q4_0). The 1.2B numbers come from the Hugging Face model card and Liquid's launch post — same table, same methodology. Both use llama.cpp, Q4_0, CPU, with 1K prefill and 100 decode tokens.

Decode speed — what you feel when the model is talking

DeviceLFM2-2.6BLFM2.5-1.2B-Thinking
Samsung Galaxy S25 (CPU, 4K prefix)30.070
AMD Ryzen AI 9 HX 370 (CPU, 4K prefix)46.8116

2.3x faster decode for the 1.2B. The Pi 5 is CPU-bound too, so the direction holds: the smaller model runs at comfortable latency. The bigger one would feel sluggish.

Prefill speed

DeviceLFM2-2.6BLFM2.5-1.2B-Thinking
Samsung S25 (CPU)116335
AMD Ryzen HX 370 (CPU)1,1712,975 (1K)

The 1.2B wins by a lot. Faster prefill means the first token shows up sooner after you stop speaking. On a voice-first device, that gap is the difference between "responsive" and "is it frozen?"

Memory

  • LFM2-2.6B (Q4_0): ~1.5-1.6 GB for 32K context.
  • LFM2.5-1.2B-Thinking (Q4_0): 719 MB (S25), 856 MB (Ryzen, full context).

I need to fit Whisper, the app, and a second model (Instruct) on a 4GB Pi. The 1.2B fits with headroom. The 2.6B would be tight or over budget.

Picking a model here is like picking a car for a narrow garage. Horsepower doesn't help if it doesn't fit.

How it runs in practice

The voice keyboard runs two models:

  • Instruct (LFM2.5-1.2B-Instruct) for direct chat and Clarify mode.
  • Thinking (LFM2.5-1.2B-Thinking) for Converse mode when the user turns on reasoning or tools.

That split — which I arrived at after trying the alternative — matches Liquid's own guidance. They recommend Thinking for agentic and reasoning-heavy work (tool use, math, planning sequences) and Instruct for chat and creative writing.

Compared to Instruct, the Thinking variant jumps on exactly the benchmarks I need in Converse: math (63 to 88 on MATH-500), instruction following (61 to 69 on Multi-IF), tool use (49 to 57 on BFCLv3). I'm not running two models for the sake of it. Each one is built for its job.

The Thinking model has to: (1) follow instructions precisely, (2) reason step-by-step and use tools, (3) run at low latency with 32K-65K context, (4) stay under ~1GB so both models and Whisper can coexist. On (1) and (2), the 1.2B beats the 2.6B on every benchmark I care about. On (3) and (4), it's faster and lighter.

The smaller model wins on every axis that matters for this device.

Why the 1.2B

  • Better quality where it counts. Instruction following and reasoning ahead of the 2.6B. Knowledge holds up or wins.
  • 2x+ faster inference. Decode and prefill on CPU. Responses start sooner, stream faster.
  • Under 1GB memory. Leaves headroom for the second model and the rest of the stack.
  • Same 32K context. No tradeoff on length.
  • Built for reasoning. Chain-of-thought, tool use, the whole thinking pipeline. I'm working with the design, not against it.
  • Efficient at test time. Liquid reports fewer output tokens for the same or better quality vs. other thinking-mode models. Less verbosity, faster answers.

I'm not choosing smaller for the sake of smaller. I'm choosing the model that scores higher on my target benchmarks and runs better on my hardware. That model happens to be half the size.

For the full breakdown of how I run this on a Pi 5 — quantization, context window, inference settings — see the llama.cpp optimization guide.

References