From 78% to 97%: Making a 1.2B Model Actually Work

Small language models are stubborn. They know what you want. They just won't do it.

I spent a weekend fighting with a 1.2 billion parameter model, trying to get it to pick the right tool for the job. Weather question? Use Lookup. Math question? Use Calculate. Memory question? Use Remember.

Simple. Except my model was right 78% of the time. That's a failing grade.

Here's how I got it to 97%.

The Setup

I'm building a voice-operated AI keyboard that runs entirely on a Raspberry Pi 5. No cloud. Everything local. Part of it is a "thinking mode" where the AI can use tools — search the web, do math, recall past conversations.

The model is LFM2.5-1.2B-Instruct from Liquid AI. It's part of their LFM2.5 family — models designed to run on-device under tight memory and latency budgets. 1.17B parameters, 32K context window, trained on 28 trillion tokens. The whole thing fits in under 1GB quantized (Q4_0), which is critical when you're sharing RAM with Whisper on a 4GB Pi. Liquid also offers a Thinking variant built for chain-of-thought reasoning and tool use, but for this post we're working with the Instruct model.

The architecture is simple:

Planner generates a plan with tool calls
Worker executes the tools
Solver synthesizes the answer

The Planner's job is routing. User asks "what's 15% of 847?" → Planner outputs Calculate(expr="15% of 847"). User asks "what's the weather?" → Planner outputs Lookup(query="weather").

I built a test suite with 40 queries. 10 per tool category. Ranging from easy ("what is 7 times 8?") to hard ("compound interest on $1000 at 5% for 3 years").

First run: 78%.

What Went Wrong

I started with technical tool names:

SearchMemory - search past conversations
Python - execute calculations
WebAnswer - search the web
Fetch - read a URL

The model kept confusing them. "What did we talk about yesterday?" would route to WebAnswer. "Calculate my age" would sometimes hit SearchMemory.

Looking at the failures, I noticed a pattern. The model understood what to do. It just picked the wrong label.

Fix 1: Rename Everything

Small models learn from how humans talk. We don't say "please SearchMemory for our last conversation." We say "do you remember what we discussed?"

So I renamed the tools:

SearchMemory → Remember
Python → Calculate
WebAnswer + Fetch → Lookup (merged into one)

That's it. Same functionality. Different names.

Result: 78% → 86%.

Eight points from renaming. The model was already trying to use human words - I just had to meet it halfway.

Fix 2: The Calibration Line

I was still getting waste. The model would over-plan. Simple weather question? It would output three tools when one would do.

I ran a sweep of 10 different "calibration lines" - single sentences added to the system prompt telling the model how to choose tools.

Line	Accuracy	Waste
(none)	85%	35%
"Use your best judgment"	90%	18%
"Be precise. Pick the right tool(s)"	85%	32%
"Match the tool to the task"	95%	14%
"Wrong tools waste time. Choose wisely"	95%	30%

The winner: "Match the tool to the task."

Simple language. Not threatening. Not vague. Just clear direction.

Fear-based prompts ("Wrong tools waste time") achieved high accuracy but caused over-caution. The model would add backup tools "just in case." More precise language ("Be precise") actually hurt - too abstract for a small model.

Fix 3: The Sanitizer Trick

Here's the thing nobody tells you about small models: sometimes they just give up on specifics.

My model would occasionally output this:

1. ToolName(query="weather in Tokyo")

Not Lookup. Not Calculate. Just... ToolName. The model understood it needed a tool. It just didn't commit to which one.

Instead of fighting this, I built a post-processor. When the planner outputs a generic ToolName, I infer the correct tool from the user's query using keyword patterns:

patterns = {
    'Calculate': [r'\d+\s*%', r'percent', r'split', r'convert', r'days until'],
    'Remember': [r'\bmy\b.*notes', r'yesterday', r'what did i', r'we discussed'],
    'Lookup': [r'weather', r'news', r'price of', r'who is', r'current'],
    'LLM': [r'explain', r'write', r'compare', r'help me'],
}

User asks "what's the weather?" → ToolName gets replaced with Lookup.

This sounds hacky. It is hacky. But it works because the query already contains the signal. I'm not guessing - I'm reading what the user explicitly asked for.

Result: 86% → 97%.

The Final Numbers

Change	Accuracy
Original (technical names)	78%
After renaming tools	86%
After calibration line	~90%
After sanitizer	97%

40 test cases. 39 correct. The one failure: "What did we discuss yesterday?" sometimes triggers multiple tools with Remember not first.

What I Learned

1. Tool names are prompts. The name you give a function is part of the instruction. "Remember" triggers different associations than "SearchMemory" - and at 1.2B parameters, those associations matter. This is prompt engineering at the API design level — the interface is the prompt.

2. Simple calibration beats complex instructions. "Match the tool to the task" outperformed paragraphs of detailed guidance. Small models respond to clear, concrete direction. If you're doing prompt engineering for small models, less is more.

3. Post-processing is underrated. Sometimes the fix isn't better prompts - it's a fallback layer. The sanitizer catches cases where the model is 80% there and just needs a nudge. (Though you can overdo it — I found the hard limit of rule-based fixes when I tried to handle negation.)

4. Test like software. I wouldn't ship code without tests. Why ship prompts without them? The test suite caught regressions I would have missed.

The Actual Test Cases

For anyone building something similar, here's what realistic voice queries look like:

Calculate:

"What's the tip on a $67 bill at 18%?"
"Split a $234 dinner bill between 5 people"
"If I was born in 1987 how old am I?"

Lookup:

"What's the weather like right now?"
"Current stock price of Apple"
"Latest headlines about climate change"

Remember:

"What did I ask you earlier?"
"What were the action items from our last chat?"
"Find where I talked about the marketing plan"

LLM (no tools needed):

"Explain what an API is in simple terms"
"Compare renting versus buying a home"
"Draft an email declining a meeting politely"

The test suite forces you to handle the messy, ambiguous queries people actually say out loud. Not "Please calculate 15% of 847" but "What's the tip on sixty-seven dollars at eighteen percent?"

The Setup

What Went Wrong

Fix 1: Rename Everything

Fix 2: The Calibration Line

Fix 3: The Sanitizer Trick

The Final Numbers

What I Learned

The Actual Test Cases

Related posts

When to Stop Adding Rules: Building a Negation-Aware Sanitizer Without Overfitting

DIY Local Voice Assistant: Building a Raspberry Pi 5 AI Device

LFM2.5-1.2B on Raspberry Pi 5: llama.cpp Optimization Guide

LFM2.5-1.2B vs LFM2-2.6B: Why We Chose the Smaller Model