DIY Local Voice Assistant: Building a Raspberry Pi 5 AI Device

I'm building a voice assistant that runs entirely on a Raspberry Pi 5. No cloud. No API calls. No data leaving the device.
This post isn't about the AI models (I've written about the voice stack and LLM optimization separately). This is about the product — the hardware decisions, the software architecture, and what I've learned building something that people actually touch.
Why build hardware
I was tired of building AI things that lived in terminal windows. I wanted something physical. Something my family could walk up to and talk to without opening a laptop.
The constraint was privacy. Everything said to this device stays on this device. Local speech-to-text, local language models, local storage. No cloud fallback.
The hardware
| Part | Spec | Cost |
|---|---|---|
| Raspberry Pi 5 | 8 GB RAM, 4x Cortex-A76 | ~$60 |
| Touchscreen | 5" 800x480 IPS | ~$30 |
| Microphone | HyperX USB condenser | ~$30 |
| Cooling | Active fan + heatsink | ~$8 |
| Case | 3D printed + standoffs | ~$5 |
| SD card | 64 GB A2 | ~$10 |
| Total | ~$143 |
The Pi 5 was a given — it's the only single-board computer with enough RAM and CPU to run a 1.2B parameter model at acceptable speed. The 4 GB version would work with compromises, but 8 GB lets you load two models simultaneously.
The touchscreen size matters. 800x480 is small. Every pixel counts. No room for decorative UI — everything on screen needs to be functional. This constraint turned out to be a feature. It forced me to simplify.
The microphone matters more than I expected. The Pi's built-in audio jack is input-only for line-level signals, not microphones. A USB condenser mic with decent noise rejection made transcription accuracy jump noticeably over a cheap lapel mic.
The software architecture
The app is built in Pygame. Yes, Pygame — the game library. Sounds weird, but for a touchscreen kiosk app on a Pi, it's ideal:
- Direct framebuffer access (no X11 or Wayland needed)
- Touch events map to mouse events
- Low overhead
- Python-native (everything else in the stack is Python)
I tried PyQt and Kivy first. Both were heavier than necessary for what's fundamentally a single-screen kiosk app. Pygame at 30 FPS is smooth enough for text rendering and button presses.
The five modes
The UI has five tabs across the top of the screen:
HOME — System status. CPU/RAM usage (via psutil), model info, quick actions. The device's dashboard.
SCRIBE — Pure transcription. Press record, speak, see text. No AI processing. Useful for taking notes without the latency of LLM inference.
COMPOSE — Transcription + AI clarification. Speak rough thoughts, then the Instruct model rewrites them based on a mode:
- Summarize — condense
- Outline — structure into bullets
- Generate — expand into full text
Each mode has verbosity (Concise/Balanced/Detailed) and creativity (Low 0.1 / Mid 0.3 / High 0.6) controls. These map directly to temperature and max_tokens on the LLM.
CONVERSE — Open-ended conversation with memory. This is the core product. You talk, the device responds, and it remembers what you've discussed. Two toggles:
- Tools — enables search, calculator, memory recall
- Reasoning — switches to the Thinking model for step-by-step reasoning
VAULT — Browse saved conversations and notes. Everything stored as markdown files organized by date and mode.
The recording interaction
I went back and forth on this for weeks. Two options:
Hold-to-record: Press and hold, release to stop. Natural for walkie-talkie style. Problems: your thumb gets tired, and you can't look at the screen while pressing.
Tap-to-toggle: Tap once to start, tap again to stop. More accessible. Problems: people forget to tap stop. The device keeps recording silence.
Solution: make it a user preference. The recording mode selector is in settings. Default is Hold, but you can switch to Tap. Different people prefer different things. Let them choose.
The waveform visualizer
While recording, the screen shows a real-time waveform of the audio input. This wasn't in the original plan — I added it because without visual feedback, people weren't sure if the device was listening. They'd start over mid-sentence thinking it froze.
Small thing. Big difference in usability. If the user can see their voice moving the display, they trust the device.
The vault: markdown all the way down
Every conversation is saved as a markdown file with YAML front matter:
---
date: 2026-02-14 10:30:00
mode: converse
turns: 3
abstract_summary: |
Discussed meal planning for the week...
extractive_summary: |
- Decided on pasta for Monday
- Need to buy groceries Saturday
---
# Converse Session
## Raw Input
What should we have for dinner this week?
## Output
Here are some ideas based on what's in season...The file structure:
~/tuon_vault/
├── conversations/ # Timestamped markdown files
├── notes/
│ ├── projects/
│ ├── ideas/
│ └── reference/
└── memory/
├── context.md # Long-term memory
└── recent_context.md # Recent interactions
Why markdown? Because it's readable without the device. I can SSH in and cat any conversation. I can search with grep. I can back up the whole vault by copying a directory. No database. No proprietary format. Just text files.
Two-part summarization
After a conversation ends, the system runs a background summarization (using the Instruct model) that produces two summaries:
- Abstract — 1-3 sentences. "What was this conversation about?"
- Extractive — Key facts, decisions, numbers. "What specific things were said?"
When a new conversation starts, the system injects recent summaries (not full transcripts) into the context. This gives the LLM conversational memory in about 4K characters instead of 400K characters of raw history.
Not perfect memory. The model sometimes misses details that were in a summary but not emphasized. But it's dramatically better than no memory, and it fits comfortably in a 65K context window.
Lessons from building voice AI hardware
Touch targets need to be huge. My first UI had small buttons like a phone app. But this device often sits on a counter and gets poked with a finger from arm's length. Minimum touch target: 48 pixels tall. I ended up at 60+ for most buttons.
Streaming matters more in hardware. On a web app, users are used to waiting. On a physical device, a blank screen after speaking feels broken. Streaming tokens to the display — even at 10-20 tokens per second — makes the interaction feel alive.
Audio feedback is overrated. I spent a week adding TTS output. Turned it off after two days. Reading is faster than listening, and the device's speaker wasn't great. The touchscreen is the output. Save the complexity.
Cooling is non-optional. Sustained inference on all four cores pushes the Pi to 80C+. Without active cooling, the CPU throttles and inference speed drops 40%. A $8 fan and heatsink combo solved it.
People try things you don't expect. My wife's first question to the device: "What's my name?" It didn't know. That became the catalyst for the memory system. Users don't care about your architecture — they care about whether the thing is useful.
What's next for the voice assistant
Right now this is a working prototype I use daily. It handles voice memos, quick calculations, conversation-based thinking, and acts as a kitchen assistant that doesn't phone home to anyone.
Next: better model routing, fine-tuning experiments for specific use cases, and eventually a form factor that doesn't look like a science project on the counter.
The bet is that good-enough AI running locally, with real privacy, is more interesting than perfect AI running in someone else's data center. So far, that bet is holding.