Thinking Without Thinking Models

I'm obsessed with making tiny models think.
Not "think" as in generate tokens that look thoughtful. Actually reason through multi-step problems - the kind where you need to pull context from memory, do some math, and synthesize an answer.
I'm building a voice-operated AI device — think a keyboard you talk to instead of type on. It runs entirely on a Raspberry Pi 5. No cloud. No API calls. Everything happens locally on an $80 computer with 8GB of RAM. The goal is an accessibility-friendly input device that travels with you, works offline, and keeps your data on your device.
But here's the thing: sometimes you need more than a single-shot response. "What did we talk about last week, and how does that affect this decision?" That's a multi-step problem. Search memory, extract relevant info, reason about it, give me an answer.
The obvious solution is a thinking model. o1, DeepSeek R1, QwQ. Let the model reason longer, get better answers.
I tried this. At 1-3B parameters, thinking models are... not good. The reasoning traces read like drunk philosophy. Lots of words, wrong conclusions. And they're slow - we're talking minutes per response on a Pi.
So I went looking for another way.
The Paper That Changed My Approach
I found ReWOO (Xu et al., 2023) — "Reasoning Without Observations." The core insight hit me immediately: you don't need a thinking model. You need a thinking process.
Most AI agents work like this:
- Think → Act → Observe
- Think again (with observation) → Act → Observe
- Repeat until done
The problem? Token growth is quadratic. Each loop feeds the entire history back to the model. For a 5-step task, that's 5 LLM calls, each one bigger than the last. On my hardware, that means 4-5 minutes for one question.
ReWOO does something different:
PLANNER (1 LLM call) → WORKER (tool execution) → SOLVER (1 LLM call)
The Planner generates a complete blueprint before seeing any results. It uses placeholder variables that get filled in later:
User: "What's the weather where I had my last meeting?"
Plan: Search memory for last meeting location
#E1 = SearchMemory[last meeting location]
Plan: Extract city name
#E2 = LLM[What city is mentioned in: #E1]
Plan: Get weather
#E3 = SearchWeb[weather in #E2]
The Worker executes each step, substituting real values for the placeholders. No LLM calls - just rule-based execution.
The Solver gets all the plans plus evidence and synthesizes an answer.
Total LLM calls: 2, regardless of step count.
Why This Makes Sense to Me
I was skeptical at first. How can you plan without seeing results?
But then I realized - I do this constantly. "I'll check my calendar, and if I'm free, I'll message Sarah." I don't need to see my calendar before forming that plan. I'm planning based on expected outcomes.
The paper calls this "foreseeable reasoning." The model predicts what tool results will probably contain and references them with placeholder variables.
It's not perfect. Sometimes the Planner assumes a tool will return something it doesn't. But it eliminates the quadratic blowup that makes interleaved reasoning impossible on my hardware.
The Numbers That Convinced Me
From the ReWOO paper, on HotpotQA:
| Approach | Tokens per query |
|---|---|
| ReAct (interleaved) | ~9,795 |
| ReWOO | ~1,986 |
5x token reduction.
My estimates for the device on Pi 5:
| Metric | Interleaved | ReWOO |
|---|---|---|
| LLM calls (5-step task) | 5 | 2 |
| Time estimate | 4-5 min | 1-2 min |
| Token consumption | ~10K | ~2K |
Still not instant. But usable for "let me think about this" queries.
What I'm Building
I'm adapting ReWOO for this project with some constraints:
One model, three roles. I don't have GPT-3.5 for planning and a smaller model for solving. I have one 1.2B model. So I use the same model with different prompts for each phase.
Minimal tools. Four to start:
SearchMemory[query]- vector search over past conversationsLLM[prompt]- use the model for reasoningCalculator[expr]- mathCurrentTime[]- date/time
The constraint forces focus. I can add more later if needed.
Structured output. Instead of parsing free-form text, I'm using JSON schemas for the Planner output. More reliable, especially with a smaller model.
The Part I Didn't Expect
ReWOO handles failures better than the interleaved approach.
When a tool fails in ReAct-style systems, the model often loops: try tool A → fail → try tool B → fail → back to tool A → repeat forever.
With ReWOO, the plan is already made. If a tool fails, you store the failure as evidence and keep going. The Solver is told to "use evidence with caution" - it can work around missing pieces.
From the paper's failure analysis:
| Failure Type | ReAct | ReWOO |
|---|---|---|
| Token Overflow | 18% | 0% |
| Bad Reasoning | 76% | 51% |
You trade some "answer miss" errors for complete elimination of token overflow. On constrained hardware, that's a trade I'll take.
What's Next
This is the architecture I'm committing to. The implementation roadmap:
- Core pipeline — Planner, Worker, Solver classes
- Integration — hook into conversation mode, add a UI toggle
- Optimization — tune prompts for my specific model, add caching
Target: 30-60 seconds for a 3-step reasoning task. Not instant, but acceptable when you're asking something that genuinely requires thinking.
I'll write more as I build it. The first thing I tackled was getting one model to play all three roles — same weights, different prompts.