Raspberry Pi 5 Local Voice Assistant: A Whisper and Liquid LFM AI Stack

Raspberry Pi 5 Local Voice Assistant: A Whisper and Liquid LFM AI Stack

Here's what's running simultaneously on my Raspberry Pi 5 right now: a speech-to-text model, two language models, a touchscreen UI, and a conversation memory system. Total cost: ~$180 in hardware.

This post is about the voice stack — how audio goes from microphone to transcription to LLM response, what fits in memory, and where the sharp edges are.

The hardware

  • Raspberry Pi 5, 8GB RAM
  • 800x480 touchscreen (Pygame, 30 FPS)
  • HyperX USB condenser mic (device index 2, with fallback detection)
  • Active cooling (required — inference gets warm)

That's it. No GPU. No NPU. Everything runs on four Cortex-A76 cores.

The software stack

Three main components share the Pi's resources:

ComponentLibraryModelMemory
Speech-to-textfaster-whisperbase.en (int8)~75 MB
LLM (Instruct)llama-cpp-pythonLFM2.5-1.2B-Instruct Q5_K_M~850 MB
LLM (Thinking)llama-cpp-pythonLFM2.5-1.2B-Thinking Q5_K_M~850 MB

Both LLMs are Liquid AI's LFM2.5-1.2B family. The Instruct variant handles chat and clarification. The Thinking variant handles reasoning and tool use. Both load at startup and stay resident.

Total model memory: ~1.75 GB out of 8 GB available. That leaves headroom for the OS, Pygame, PyAudio, the vault system, and KV cache.

Audio capture

Recording is straightforward PyAudio:

RATE = 16000      # 16 kHz — what Whisper expects
CHANNELS = 1      # mono
FORMAT = pyaudio.paInt16  # 16-bit PCM
CHUNK = 1024      # frames per buffer

The app records while you hold the button (or tap to toggle — user preference). Audio saves to phrase.wav as a standard WAV file. Nothing fancy. WAV is lossless and Whisper wants 16kHz mono anyway.

One thing that tripped me up: microphone detection. The HyperX mic shows up as device index 2, but that can shift if you plug in other USB devices. So the app tries index 2 first, then falls back to scanning for any available input device.

Transcription

model = WhisperModel("base.en", device="cpu", compute_type="int8")
 
segments, _ = model.transcribe(
    "phrase.wav",
    beam_size=1,
    language="en",
    vad_filter=True,
    hotwords=vault_vocabulary,  # domain-specific terms from VOCAB.md
)

Key choices:

  • base.en — the English-only base model. Faster than small, good enough for clear speech with a decent mic. The int8 quantization keeps it under 75 MB.
  • beam_size=1 — greedy decoding. Faster, and for voice input the quality difference from beam search is negligible.
  • vad_filter=True — Voice Activity Detection. Strips silence, which matters when people pause mid-thought.
  • hotwords — domain-specific vocabulary loaded from ~/tuon_vault/VOCAB.md. One word per line. Whisper uses these to bias toward correct transcriptions of project-specific terms.

Transcription takes about 1-2 seconds for a typical voice query on CPU. Not instant, but acceptable for a "speak, wait, get answer" flow.

The spelled-out word problem

Voice input has a quirk: sometimes you need to spell things out. "Capital D, O, Capital D, A" should become "DoDA", not "doda" or "d o d a."

I built a post-processor that detects capitalization markers and letter sequences, then collapses them into the intended word. Regex-based, handles hyphenated sequences too. Not elegant, but it works for the 95% case.

LLM inference

Both models use the same configuration:

llm = Llama(
    model_path=model_path,
    n_ctx=65536,    # 65K token context window
    n_threads=4,    # all four Cortex-A76 cores
    verbose=False,
)

Inference parameters vary by mode:

ModeTemperatureMax tokensModel
Converse (chat)0.18192Instruct or Thinking
Clarify (rewrite)0.12048Instruct
Reasoner phase0.11024Thinking

Everything streams. The UI shows tokens as they arrive. Stop token is <|im_end|> (ChatML format).

The LLM lock

Both models are loaded simultaneously, but only one runs at a time. An LLM lock prevents concurrent access:

# Background summarization yields to user interaction
with self.llm_lock:
    # Only one LLM call at a time
    response = llm.create_chat_completion(...)

Background tasks (like conversation summarization) check the lock before starting and yield immediately if the user starts talking. User interaction always wins. This was a hard lesson — early versions would block on a background summary and the UI would freeze for 10 seconds. Users don't care about your background tasks. They care about responsiveness.

The memory budget in practice

On a fresh boot with both models loaded:

WhatEstimated RAM
OS + system services~500 MB
Whisper base.en (int8)~75 MB
LFM2.5-1.2B Instruct (Q5_K_M)~850 MB
LFM2.5-1.2B Thinking (Q5_K_M)~850 MB
KV cache (active context)500 MB - 2 GB
Pygame + PyAudio + app~100 MB
Total~3-4.5 GB

That leaves 3.5-5 GB free on an 8 GB Pi. Comfortable. The KV cache is the variable — it grows with conversation length. At 65K context fully utilized, it can eat 2+ GB. In practice, conversations rarely fill the window.

With a 4 GB Pi, you'd need to drop to one model or use Q4_0 quantization. Doable, but tight. 8 GB gives breathing room.

What I'd do differently

Start with Q5_K_M, not Q4_0. I started with Q4_0 because the Liquid AI benchmarks use it. But Q5_K_M gives noticeably better output quality — especially for instruction following — and fits fine with 8 GB. The speed difference is minimal.

Use faster-whisper, not whisper.cpp. I originally tried the C++ Whisper implementation for speed. faster-whisper (CTranslate2 backend) was easier to integrate with the Python app and int8 quantization makes it fast enough. Save yourself the build headaches.

Don't load models lazily. Load everything at startup. The 15-second boot time is worth it to avoid 5-second pauses when you switch modes mid-conversation.

The full flow

1. User presses record button
2. PyAudio captures 16kHz mono audio → phrase.wav
3. faster-whisper transcribes (beam=1, VAD, hotwords)  ~1-2s
4. Text goes to LLM (Instruct or Thinking based on mode)
5. LLM streams response to touchscreen UI              ~5-15s
6. Conversation saved to vault as markdown
7. Background summarization runs (if idle)

End-to-end: about 8-20 seconds from releasing the record button to seeing a complete response. Not instant. But for a fully local, fully private voice assistant on ~$180 in hardware, it works.

The llama.cpp optimization choices that make this possible are their own post.