LFM2.5-1.2B on Raspberry Pi 5: llama.cpp Optimization Guide

LFM2.5-1.2B on Raspberry Pi 5: llama.cpp Optimization Guide

Most llama.cpp guides assume you have a beefy desktop or a Mac with 32 GB of unified memory. I'm running on a Raspberry Pi 5 with 8 GB of RAM and four ARM cores. Every setting matters.

This post covers the specific llama.cpp configuration I use for LFM2.5-1.2B on a Pi 5, why each value is what it is, and what happens when you get them wrong.

The baseline config

from llama_cpp import Llama
 
llm = Llama(
    model_path="../llama.cpp/models/LFM2.5-1.2B-Instruct-Q5_K_M.gguf",
    n_ctx=65536,
    n_threads=4,
    verbose=False,
)

Four parameters. Let me walk through each one.

Quantization: Q5_K_M

The model is LFM2.5-1.2B at Q5_K_M quantization — 5-bit with k-quant mixed precision. Some layers get higher precision where it matters for quality.

QuantizationFile sizeRAM (model only)Quality
Q4_0~670 MB~720 MBBaseline
Q4_K_M~700 MB~750 MBBetter than Q4_0
Q5_K_M~800 MB~850 MBNoticeably better instruction following
Q8_0~1.2 GB~1.3 GBBest quality, tight on memory

I started with Q4_0 because Liquid AI's published benchmarks use it. But when I tested Q5_K_M, the difference in instruction following was clear — especially for structured output like tool call routing. The model was more likely to output the exact format I asked for. Worth the extra 130 MB.

Q8_0 would be ideal for quality but loading two models (Instruct + Thinking) at Q8_0 eats 2.6 GB just for weights. Add KV cache and Whisper and you're over budget on 8 GB. Q5_K_M is the sweet spot.

Context window: 65,536 tokens

n_ctx=65536

LFM2.5-1.2B was trained with up to 128K context, but I cap it at 65K. Three reasons.

Memory. KV cache scales linearly with context length. At 65K tokens, the KV cache for a 1.2B model is roughly 1-2 GB (varies with actual fill). At 128K, it doubles. With two models loaded, that's the difference between comfortable and swapping.

Practical usage. Voice conversations rarely exceed 10K tokens. The 65K window is for when the system injects conversation history, vault summaries, and tool results into the context. Even in heavy sessions, I've never hit 65K.

Latency. Larger context windows increase prefill time. At 65K, the first token after a long context injection takes a few seconds. At 128K, it's noticeably worse on ARM.

If I were on a 4 GB Pi, I'd drop to 32K or even 16K. The context window is the biggest memory lever you have after quantization.

Thread count: 4

n_threads=4

The Pi 5 has four Cortex-A76 cores. Setting n_threads=4 uses all of them for inference.

I tested n_threads=2 and n_threads=3. Both were measurably slower. No benefit to leaving cores free — Whisper isn't running during inference (they're sequential), and the UI thread is lightweight enough to share.

Don't set n_threads higher than your physical core count. On a Pi 5, that means 4. Going to 8 (matching logical cores if they existed — they don't on ARM) just adds scheduling overhead.

Inference parameters

These vary by use case:

response = llm.create_chat_completion(
    messages=messages,
    temperature=0.1,
    top_k=50,
    top_p=0.1,
    repeat_penalty=1.05,
    max_tokens=8192,
    stop=["<|im_end|>"],
    stream=True,
)

Temperature: 0.1

Low temperature for almost everything. Voice assistants need predictable, correct responses. I bump it to 0.3 for creative writing in Compose mode, but for Converse and Clarify, 0.1 keeps the model focused.

At 1.2B parameters, higher temperatures cause more format violations. The model has less headroom for randomness than a 7B or 70B model.

Top-p: 0.1

Aggressive nucleus sampling. Combined with low temperature, this makes output very deterministic. For tool routing and structured output, that's what I want.

Repeat penalty: 1.05

Light penalty. Small models are prone to repetition loops — especially in longer outputs. 1.05 is enough to prevent "the the the" without distorting the distribution.

Max tokens by mode

ModeMax tokensWhy
Converse8192Long-form responses, tool results, reasoning
Clarify2048Rewrites and summaries of short text
Reasoner1024Analysis phase should be concise

Setting max_tokens too high doesn't cost anything if the model stops naturally (via <|im_end|>). But on a Pi, if the model gets confused and generates garbage, you want a ceiling. I learned this the hard way — an early version generated 16K tokens of repeated tool calls before I added the cap.

KV cache management

The most important optimization for multi-turn conversations:

llm.reset()  # Clear the KV cache

Between conversation turns? Don't reset. Let the cache accumulate — that's how the model maintains context.

Between roles in the thinking pipeline? Always reset. The Reasoner's chain-of-thought in the KV cache confuses the Planner. Clean slate for each phase.

Between modes? Reset. Switching from Clarify to Converse should start fresh.

The rule: reset when the system prompt changes. Keep when it doesn't.

ChatML format

LFM2.5 uses ChatML tokens:

<|im_start|>system
{system_prompt}<|im_end|>
<|im_start|>user
{user_input}<|im_end|>
<|im_start|>assistant

The stop token <|im_end|> tells the model when to stop generating. Without it, the model keeps going — sometimes generating fake user turns and responding to itself. Always set your stop tokens.

Python bindings vs llama-server

I use llama-cpp-python (direct Python bindings) instead of running llama-server as a separate process.

Why direct bindings:

  • No HTTP overhead (latency matters on Pi)
  • Direct KV cache control (llm.reset())
  • Single process — simpler to manage
  • Model stays loaded in the app's memory space

When llama-server makes sense:

  • Multiple clients need access to the same model
  • You want an OpenAI-compatible API
  • You're integrating with tools that expect HTTP endpoints

I do run llama-server for one thing: the SearXNG integration uses it on port 8081 for web answer synthesis. But the main voice pipeline uses direct bindings.

Streaming

Always stream. On a Pi, generation speed is 10-20 tokens/second. If you wait for the full response before displaying, the user stares at a blank screen for 5-15 seconds. Streaming shows tokens as they arrive — the first word appears in under a second after prefill.

for chunk in llm.create_chat_completion(..., stream=True):
    token = chunk["choices"][0]["delta"].get("content", "")
    display(token)  # Push to UI immediately

Even more important for the Thinking model, which generates <think> blocks before the visible response. Streaming those reasoning tokens to a collapsible UI section gives the user something to watch while the model works.

What I'd tell someone starting out

  1. Start with Q5_K_M. Not Q4_0. The quality difference is worth the extra memory.
  2. Set n_ctx to what you actually need. 65K is generous. 32K is fine for most use cases.
  3. Always set stop tokens. Small models will hallucinate extra turns without them.
  4. Reset KV cache between role changes. Not between turns.
  5. Stream everything. Perceived latency matters more than actual latency.
  6. Profile your memory. Run htop while the app is running. Know where your RAM goes.

The Pi 5 is surprisingly capable for inference. The trick isn't raw speed — it's making the right tradeoffs so the experience feels responsive even at 10-20 tokens per second.