Multi-Model Routing: Instruct vs Thinking on Edge Devices

Multi-Model Routing: Instruct vs Thinking on Edge Devices

I run two language models simultaneously on a Raspberry Pi 5. Same architecture (LFM2.5-1.2B), same quantization (Q5_K_M), but different personalities. One follows instructions. The other reasons through problems.

Deciding which model handles which request is the routing problem. Here's how I solved it — and why the obvious solution (let the AI decide) was the wrong call.

The two models

InstructLFM2.5-1.2B-Instruct. Trained for direct instruction following. "Summarize this." "Rewrite this email." "What's 15% of $67?" Fast, predictable, no wasted tokens on reasoning it doesn't need.

ThinkingLFM2.5-1.2B-Thinking. Trained for step-by-step reasoning with <think> blocks. Breaks problems down before answering. Better for multi-step tasks, tool orchestration, and anything that benefits from planning before acting.

I chose LFM2.5-1.2B over LFM2-2.6B for both. And I split into two models instead of one because a single model kept compromising at both jobs.

The routing matrix

The router is a 2x2 matrix based on two user-controlled toggles:

ToolsReasoningModelBehavior
OFFOFFInstructDirect chat. Fast answers.
ONOFFInstructChat with tools exposed. Model can call calculator, search, memory.
ONONThinkingFull reasoning pipeline with tools. Observe, Plan, Execute, Synthesize.
OFFONThinkingReasoning without tools. Step-by-step analysis, no external calls.

In code, it's one function:

def _get_converse_llm(self, enable_thinking: bool):
    """Return the LLM instance to use for converse mode."""
    return self.llm_thinking if enable_thinking else self.llm_instruct

The enable_tools flag doesn't change which model runs — it changes what the model sees in its system prompt. Tools ON means function definitions are injected. Tools OFF means the model doesn't know tools exist.

Why user-controlled, not automatic

The first version tried automatic routing. Analyze the query, classify its complexity, pick the right model. It failed for three reasons.

Classification costs time. On a Pi, running a classifier before routing adds 2-3 seconds of latency. For voice input where responsiveness matters, that's too much. The user already waited for Whisper to transcribe.

Classification at 1.2B is unreliable. A small model classifying "What's the weather?" as simple and "Plan a road trip considering gas prices and hotel costs" as complex? It gets it wrong often enough to be annoying. Wrong model selection is worse than slightly-wrong speed — the Thinking model is overkill for simple queries (wasted tokens), and the Instruct model is inadequate for complex ones (bad answers).

Users have better intuition than classifiers. When someone toggles Reasoning ON, they're saying "I want you to think about this." That's a signal no classifier can extract from text alone. The user knows their intent. Let them express it.

The toggle approach works because it's honest. The device doesn't pretend to be smart about routing. It gives you a switch and lets you flip it.

Mode-specific routing

Not all modes use the router. Some have fixed assignments:

Compose (Clarify) — Always Instruct. Summarization, outlining, and text cleanup don't benefit from reasoning. They need fast, predictable formatting. The Instruct model is better at "rewrite this as bullet points" than the Thinking model, which tends to over-explain.

Converse — Uses the router. Both toggles visible.

Background summarization — Always Instruct. Summaries need consistent formatting (abstract + extractive). The Thinking model's reasoning tokens are wasted here — you don't need chain-of-thought to write a two-sentence summary.

Tool execution (internal) — Always Instruct. When the Thinking model's plan calls a tool, the tool executor uses Instruct for any internal LLM calls (like parsing results). Deterministic behavior matters for reliability.

The Thinking model pipeline

When Reasoning is ON and Tools are ON, the Thinking model runs a multi-phase pipeline:

1. Observer  — Extract user intent and context
2. Planner   — Generate a tool-call plan (ReWOO style)
3. Workers   — Execute tool calls sequentially
4. Solver    — Synthesize results into a response

Each phase gets its own system prompt. The KV cache resets between phases (a hard-won optimization — leaving the Reasoner's chain-of-thought in the cache confuses the Planner).

Evidence flows through placeholders: #E-01, #E-02, etc. The Planner writes a plan referencing future evidence, Workers fill in the evidence, and the Solver sees the complete picture.

This is the ReWOO pattern adapted for a 1.2B model. It works because each phase is simple enough for a small model to handle, even though the composite task would be too complex for one pass.

When Reasoning is ON but Tools are OFF

This is the underrated combination. The Thinking model generates <think> blocks — visible reasoning — but can't call external tools. Pure analysis.

Good for:

  • "Help me think through whether to refinance"
  • "What are the tradeoffs of X vs Y?"
  • "Explain this concept step by step"

The model's reasoning tokens stream to a collapsible section on the touchscreen. You can watch it think, then read the final answer. Slower than Instruct (more tokens generated), but the reasoning quality is measurably better for anything requiring multi-step logic.

Latency comparison

Measured on Pi 5, 8 GB, Q5_K_M, typical queries:

ConfigurationTime to first tokenTotal response
Instruct, no tools~0.5s3-8s
Instruct, with tools~0.5s5-12s (includes tool execution)
Thinking, no tools~0.5s + think time8-20s
Thinking, with tools~0.5s + pipeline15-30s

The Thinking pipeline is 2-4x slower than Instruct. That's the tradeoff. For "what time is it" you don't need 20 seconds of reasoning. For "plan my week based on my calendar and the weather," the extra time produces a much better answer.

Advice for building multi-model routing

Start with one model. Add the second only when you have clear evidence that one model can't serve both needs. For me, that evidence was tool call accuracy — the Instruct model was great at tools but mediocre at reasoning, and vice versa.

User-controlled routing is not a cop-out. It's a design choice. Automatic routing adds complexity, latency, and failure modes. Toggles add a UI element and zero latency.

Don't route by model size. Both my models are 1.2B. The routing is about training objective (instruct vs reasoning), not capability. A 7B instruct model wouldn't be better for my reasoning tasks than a 1.2B thinking model — it would just be slower.

Fix routing to specific modes where possible. Compose always uses Instruct. Background tasks always use Instruct. The fewer decisions the system makes at runtime, the fewer things go wrong.

Loading two models costs about 1.7 GB of RAM. On an 8 GB Pi, that's affordable. The payoff is a device that's fast when you need fast and thoughtful when you need thoughtful.