Why We Stopped Forcing One Model to Do Everything

Two weeks ago I wrote about "thinking without thinking models."
At the time, that was the right call. I was trying to get strong reasoning behavior out of small local models on a Pi 5. So I built a process: Reasoner -> Planner -> Worker -> Solver. One model. Multiple roles.
It worked. But it also came with hidden costs.
This post is about why I moved the project to a simpler setup:
- base Instruct model for direct chat and clarify flows
- Thinking model for native reasoning in Converse
- tools exposed only when the user enables tools
No forced orchestration when the model can reason natively.
The original architecture solved a real problem
The earlier pipeline gave me control. It let me structure how the model thought. It reduced some failure modes. And it made small models more usable on hardware constraints.
That was important for this project:
- Raspberry Pi 5
- fully local
- voice-first interaction
- low-friction UX
The pipeline was a good bridge. But a bridge is not always the final road.
Where it started to hurt
Three problems kept showing up.
1) Natural reasoning felt unnatural
Reasoning models have a superpower: they lay out the chain in their own native format. Verbose, yes. But that verbosity is often how they get to the right answer.
When I forced reasoning through a strict multi-stage framework, I got structure. I also lost some native "figure it out" behavior.
You can feel this in voice products. The response may be correct. But it can feel mechanical.
2) UI and state complexity exploded
Once you split reasoning into phases, the UI must mirror those phases:
- reasoning chips
- planning chips
- tool chips
- solver streaming
- persisted replay of all of the above
That is a lot of state for a plug-and-play voice device. The architecture became harder to reason about than the user requests.
3) Training loops did not improve reliability
I tried to improve this through data and tuning loops:
- SFT iterations
- GRPO-style shaping
- DPO preference passes
What I found was uncomfortable but clear: these loops can erode capability at this scale.
Not always. But often enough that I stopped trusting "more tuning" as the default answer.
Some runs improved style while hurting actual reasoning. Some improved tool format while hurting judgment. Some looked better in narrow evals and worse in real voice sessions.
The decision: stop over-engineering the model role
So I made a policy change.
Use the model that is designed for the job.
New runtime policy
In Converse:
- tools OFF, thinking OFF -> Instruct model, no tools in prompt
- tools ON, thinking OFF -> Instruct model, tools exposed
- tools OFF, thinking ON -> Thinking model, no tools in prompt
- tools ON, thinking ON -> Thinking model, tools exposed
For Clarify flows:
- use base Instruct model
That is it. No synthetic "thinking pipeline" required for normal operation.
Why this is better for the device
Better fit for voice UX
Voice hardware should feel immediate. People should not have to understand your internal architecture.
They should just choose: "Do I want tools?" "Do I want deeper thinking?"
Then the system should do the obvious thing.
Better fit for local constraints
This approach is still local-first. Still Pi-first. Still privacy-first.
But now complexity sits where it belongs: model selection + prompt discipline + tool exposure rules.
Not a heavyweight orchestration layer for every request.
Better fit for reliability
I trust base model behavior more than fragile over-tuning loops now.
So the strategy is:
- use strong base instruct + thinking models
- prompt clearly
- expose tools deliberately
- keep architecture legible
That gives better behavior than trying to "force intelligence" with layers.
What this does not mean
It does not mean the previous work was a mistake.
That framework taught me what matters:
- event ordering is critical
- tool-call shape matters
- replay state matters
- UI truth must match model truth
Those lessons still power the current system. The difference is where I draw the boundary now.
The bigger lesson
There is a temptation in local AI to compensate for model limits with architecture.
Sometimes you should. Sometimes that is exactly right.
But there is a point where architecture starts fighting the model instead of helping it.
When that happens, simpler is smarter.
For this project, the pragmatic path is now: native Instruct + native Thinking, routed by user intent.
And honestly, that is closer to the original product goal anyway: plug it in, talk to it, trust it.