Thinking Without Thinking Models

I'm obsessed with making tiny models think.

Not "think" as in generate tokens that look thoughtful. Actually reason through multi-step problems - the kind where you need to pull context from memory, do some math, and synthesize an answer.

I'm building a voice-operated AI device — think a keyboard you talk to instead of type on. It runs entirely on a Raspberry Pi 5. No cloud. No API calls. Everything happens locally on an $80 computer with 8GB of RAM. The goal is an accessibility-friendly input device that travels with you, works offline, and keeps your data on your device.

But here's the thing: sometimes you need more than a single-shot response. "What did we talk about last week, and how does that affect this decision?" That's a multi-step problem. Search memory, extract relevant info, reason about it, give me an answer.

The obvious solution is a thinking model. o1, DeepSeek R1, QwQ. Let the model reason longer, get better answers.

I tried this. At 1-3B parameters, thinking models are... not good. The reasoning traces read like drunk philosophy. Lots of words, wrong conclusions. And they're slow - we're talking minutes per response on a Pi.

So I went looking for another way.

The Paper That Changed My Approach

I found ReWOO (Xu et al., 2023) — "Reasoning Without Observations." The core insight hit me immediately: you don't need a thinking model. You need a thinking process.

Most AI agents work like this:

Think → Act → Observe
Think again (with observation) → Act → Observe
Repeat until done

The problem? Token growth is quadratic. Each loop feeds the entire history back to the model. For a 5-step task, that's 5 LLM calls, each one bigger than the last. On my hardware, that means 4-5 minutes for one question.

ReWOO does something different:

PLANNER (1 LLM call) → WORKER (tool execution) → SOLVER (1 LLM call)

The Planner generates a complete blueprint before seeing any results. It uses placeholder variables that get filled in later:

User: "What's the weather where I had my last meeting?"

Plan: Search memory for last meeting location
#E1 = SearchMemory[last meeting location]

Plan: Extract city name  
#E2 = LLM[What city is mentioned in: #E1]

Plan: Get weather
#E3 = SearchWeb[weather in #E2]

The Worker executes each step, substituting real values for the placeholders. No LLM calls - just rule-based execution.

The Solver gets all the plans plus evidence and synthesizes an answer.

Total LLM calls: 2, regardless of step count.

Why This Makes Sense to Me

I was skeptical at first. How can you plan without seeing results?

But then I realized - I do this constantly. "I'll check my calendar, and if I'm free, I'll message Sarah." I don't need to see my calendar before forming that plan. I'm planning based on expected outcomes.

The paper calls this "foreseeable reasoning." The model predicts what tool results will probably contain and references them with placeholder variables.

It's not perfect. Sometimes the Planner assumes a tool will return something it doesn't. But it eliminates the quadratic blowup that makes interleaved reasoning impossible on my hardware.

The Numbers That Convinced Me

From the ReWOO paper, on HotpotQA:

Approach	Tokens per query
ReAct (interleaved)	~9,795
ReWOO	~1,986

5x token reduction.

My estimates for the device on Pi 5:

Metric	Interleaved	ReWOO
LLM calls (5-step task)	5	2
Time estimate	4-5 min	1-2 min
Token consumption	~10K	~2K

Still not instant. But usable for "let me think about this" queries.

What I'm Building

I'm adapting ReWOO for this project with some constraints:

One model, three roles. I don't have GPT-3.5 for planning and a smaller model for solving. I have one 1.2B model. So I use the same model with different prompts for each phase.

Minimal tools. Four to start:

SearchMemory[query] - vector search over past conversations
LLM[prompt] - use the model for reasoning
Calculator[expr] - math
CurrentTime[] - date/time

The constraint forces focus. I can add more later if needed.

Structured output. Instead of parsing free-form text, I'm using JSON schemas for the Planner output. More reliable, especially with a smaller model.

The Part I Didn't Expect

ReWOO handles failures better than the interleaved approach.

When a tool fails in ReAct-style systems, the model often loops: try tool A → fail → try tool B → fail → back to tool A → repeat forever.

With ReWOO, the plan is already made. If a tool fails, you store the failure as evidence and keep going. The Solver is told to "use evidence with caution" - it can work around missing pieces.

From the paper's failure analysis:

Failure Type	ReAct	ReWOO
Token Overflow	18%	0%
Bad Reasoning	76%	51%

You trade some "answer miss" errors for complete elimination of token overflow. On constrained hardware, that's a trade I'll take.

What's Next

This is the architecture I'm committing to. The implementation roadmap:

Core pipeline — Planner, Worker, Solver classes
Integration — hook into conversation mode, add a UI toggle
Optimization — tune prompts for my specific model, add caching

Target: 30-60 seconds for a 3-step reasoning task. Not instant, but acceptable when you're asking something that genuinely requires thinking.

I'll write more as I build it. The first thing I tackled was getting one model to play all three roles — same weights, different prompts.

The Paper That Changed My Approach

Why This Makes Sense to Me

The Numbers That Convinced Me

What I'm Building

The Part I Didn't Expect

What's Next

Related posts

One Model, Three Roles: How a 1.2B Model Plays Reasoner, Planner, and Solver

Why We Stopped Forcing One Model to Do Everything

DIY Local Voice Assistant: Building a Raspberry Pi 5 AI Device

LFM2.5-1.2B on Raspberry Pi 5: llama.cpp Optimization Guide