Why We Stopped Forcing One Model to Do Everything

Two weeks ago I wrote about "thinking without thinking models."

At the time, that was the right call. I was trying to get strong reasoning behavior out of small local models on a Pi 5. So I built a process: Reasoner -> Planner -> Worker -> Solver. One model. Multiple roles.

It worked. But it also came with hidden costs.

This post is about why I moved the project to a simpler setup:

base Instruct model for direct chat and clarify flows
Thinking model for native reasoning in Converse
tools exposed only when the user enables tools

No forced orchestration when the model can reason natively.

The original architecture solved a real problem

The earlier pipeline gave me control. It let me structure how the model thought. It reduced some failure modes. And it made small models more usable on hardware constraints.

That was important for this project:

Raspberry Pi 5
fully local
voice-first interaction
low-friction UX

The pipeline was a good bridge. But a bridge is not always the final road.

Where it started to hurt

Three problems kept showing up.

1) Natural reasoning felt unnatural

Reasoning models have a superpower: they lay out the chain in their own native format. Verbose, yes. But that verbosity is often how they get to the right answer.

When I forced reasoning through a strict multi-stage framework, I got structure. I also lost some native "figure it out" behavior.

You can feel this in voice products. The response may be correct. But it can feel mechanical.

2) UI and state complexity exploded

Once you split reasoning into phases, the UI must mirror those phases:

reasoning chips
planning chips
tool chips
solver streaming
persisted replay of all of the above

That is a lot of state for a plug-and-play voice device. The architecture became harder to reason about than the user requests.

3) Training loops did not improve reliability

I tried to improve this through data and tuning loops:

SFT iterations
GRPO-style shaping
DPO preference passes

What I found was uncomfortable but clear: these loops can erode capability at this scale.

Not always. But often enough that I stopped trusting "more tuning" as the default answer.

Some runs improved style while hurting actual reasoning. Some improved tool format while hurting judgment. Some looked better in narrow evals and worse in real voice sessions.

The decision: stop over-engineering the model role

So I made a policy change.

Use the model that is designed for the job.

New runtime policy

In Converse:

tools OFF, thinking OFF -> Instruct model, no tools in prompt
tools ON, thinking OFF -> Instruct model, tools exposed
tools OFF, thinking ON -> Thinking model, no tools in prompt
tools ON, thinking ON -> Thinking model, tools exposed

For Clarify flows:

use base Instruct model

That is it. No synthetic "thinking pipeline" required for normal operation.

Why this is better for the device

Better fit for voice UX

Voice hardware should feel immediate. People should not have to understand your internal architecture.

They should just choose: "Do I want tools?" "Do I want deeper thinking?"

Then the system should do the obvious thing.

Better fit for local constraints

This approach is still local-first. Still Pi-first. Still privacy-first.

But now complexity sits where it belongs: model selection + prompt discipline + tool exposure rules.

Not a heavyweight orchestration layer for every request.

Better fit for reliability

I trust base model behavior more than fragile over-tuning loops now.

So the strategy is:

use strong base instruct + thinking models
prompt clearly
expose tools deliberately
keep architecture legible

That gives better behavior than trying to "force intelligence" with layers.

What this does not mean

It does not mean the previous work was a mistake.

That framework taught me what matters:

event ordering is critical
tool-call shape matters
replay state matters
UI truth must match model truth

Those lessons still power the current system. The difference is where I draw the boundary now.

The bigger lesson

There is a temptation in local AI to compensate for model limits with architecture.

Sometimes you should. Sometimes that is exactly right.

But there is a point where architecture starts fighting the model instead of helping it.

When that happens, simpler is smarter.

For this project, the pragmatic path is now: native Instruct + native Thinking, routed by user intent.

And honestly, that is closer to the original product goal anyway: plug it in, talk to it, trust it.