Voice Transcription Market: Current State and the Missing Piece

Voice Transcription Market: Current State and the Missing Piece

Voice dictation is finally usable. The technical hurdles like latency and word error rates have mostly vanished. But the actual experience of using these tools is still a mess of fragments.

I spent the last year running field research to find the friction points. I built a note-taking app with real-time transcription, then moved to local hardware builds and on-device voice stacks. Shipping these tools exposed a fundamental flaw in the market.

Right now, the user has to adapt to the platform. You get locked into specific ecosystems or forced into awkward workflows just to get words on a screen.

This is a deep dive into the current landscape: the major players, the underlying builder's stack, and the missing piece that still keeps voice from being a universal replacement for the keyboard.

The current state: who's shipping what

Monologue

Monologue (from Every) is a big player. Mac and iOS, 100+ languages, context-aware modes that change how it writes based on the app you're in. Zendesk gets support-speak, Cursor gets code. Personal dictionary, auto formatting, privacy-first (no audio or transcripts saved on servers, screenshots for context deleted immediately). Built on open models. Julien Chaumond from Hugging Face has praised it partly because of that. Ben Tossell (Ben's Bites): "Monologue has replaced Wispr Flow for me." Early bird Pro plans have hovered around $100–$140/year.

Limitation: Mac and iOS only.

Wispr Flow

Wispr Flow is the most mature cross-platform option. Mac, Windows, iPhone, with Android on the waitlist. Claims 4x faster than typing. AI auto-edits: rambled thoughts become clean text, filler words stripped. Personal dictionary, snippet library, tone that adapts to the app. Works in any text field (Notion, Gmail, Cursor, WhatsApp). Free to start, enterprise and team plans available. Case studies from Rahul Vohra (Superhuman), Suzanne Xie (Neo), Jeff Seibert (Digits AI).

Flow is the one that actually works across devices. If you switch from Mac to Windows, you can keep using it.

Hey Lemon

Hey Lemon takes a different angle. Not just dictation, it's an AI agent that turns voice into completed tasks. Reply to emails, create docs, run search, all from the fn key. "Press fn, say what you want, watch it get done." Built for knowledge workers juggling apps and messages. Claims type 5x faster, reply to emails 12x faster, open 70% fewer tabs. Mac only. Very early.

Built-in AI voice

ChatGPT, Claude, and Gemini all have voice modes. You speak, text appears. Useful for long-form thinking and drafts. But they live inside their own apps. Dictate a long email into ChatGPT and you're stuck in that tab until you copy the output out. No universal dictation into every field. They're a separate experience, not a replacement for your keyboard everywhere.

What builders use: the underlying stack

If you're building instead of buying, you're working with a different layer.

whisper.cpp

whisper.cpp is a C/C++ port of OpenAI's Whisper. Plain C/C++, no Python, no PyTorch. Compiles to a single binary. CPU-first (AVX2, NEON, Metal on Apple Silicon), with optional GPU acceleration (CUDA, Vulkan, OpenVINO). Supports quantization. tiny.en uses ~75 MB RAM, large-v3 uses ~4 GB. Runs on Mac, Windows, Linux, iOS, Android, Raspberry Pi, WebAssembly, Docker.

This is what many embedded voice apps use. It's production-grade and widely deployed. But it's not plug and play. You compile, download ggml-format models, wire up bindings (Python, Rust, Swift, etc.), and handle audio capture yourself. The Pi 5 voice stack I built uses faster-whisper, not whisper.cpp. Different tradeoffs, same family.

OpenAI Whisper (original) and the API

The original Whisper is Python + PyTorch. Reference implementation. Heavier to run, needs the Python ecosystem. Good for experimentation and fine-tuning.

The OpenAI Whisper API is the cloud option: $0.006/min, high accuracy on clean audio. Minimal setup. API key, send audio, get text. No model downloads, no local compute. Tradeoff: your audio leaves the device, and you're locked into OpenAI's pipeline.

AssemblyAI, ElevenLabs, and MiniMax

AssemblyAI offers a streaming speech-to-text API that's built for live transcription (voice agents, meetings, accessibility). I've used it for live speech-to-text and had a great experience. Tuon Scribe uses it under the hood. ~300ms word latency, >91% accuracy, $0.15/hr for streaming. Turn-based transcription, customizable end-of-utterance detection, multilingual streaming (English, Spanish, French, German, Italian, Portuguese). Python and JavaScript SDKs. Good fit if you're building a voice assistant or real-time captions.

ElevenLabs is best known for text-to-speech and voice cloning, but they also ship Scribe for speech-to-text. Scribe v2 for batch and Scribe v2 Realtime (~150ms latency, 90+ languages, speaker diarization). Popular choice when you want both TTS and STT in one vendor.

MiniMax offers a voice and audio API: text-to-speech (300+ voices, voice cloning, 40+ languages), streaming output, and music-like generation. T2S and audio synthesis, not transcription. But worth naming if you're building a full voice pipeline and need the generation side.

faster-whisper and Distil-Whisper

faster-whisper is a CTranslate2 reimplementation. Same Whisper models, but optimized for inference. Lighter and faster than the PyTorch version. I use base.en (int8) on a Pi 5. ~75 MB, 1–2 seconds per phrase on CPU. Good for resource-constrained builds.

Distil-Whisper is a distilled variant: 6x faster, ~49% smaller, within 1% word error rate of large-v3 on English. English only. Available as Systran/faster-distil-whisper-large-v3 on Hugging Face. For multilingual, you use the full Whisper or large-v3-turbo.

Hugging Face: the wild west

Hugging Face hosts dozens of speech-to-text models: official Whisper checkpoints, faster-whisper variants, Distil-Whisper, Wav2Vec2, Seamless M4T, Kyutai STT, and many community forks. The catalog is huge. The quality and behavior vary wildly.

Here's the catch: it's not plug and play. Different architectures (CTC vs encoder-decoder vs streaming). Different chunking behavior: Wav2Vec2 handles long files well with stride; Seamless M4T has documented issues where chunk length tanks quality (4–5x worse WER). Different formats (PyTorch, CTranslate2, ggml). Different language support. You have to read the model card, check the benchmark, understand the inference library, and test on your own audio. One model might work great for 30-second clips and fall apart on hour-long meetings. Another might hallucinate on silence (Whisper is known to output "Thank you for watching" or "Support me on Patreon" on empty audio). You're assembling a stack, not dropping in a component.

If you're building something serious, stick to Whisper (original, whisper.cpp, or faster-whisper) or Distil-Whisper for English. The rest are experiments until you've validated them yourself.

The problem: fragmentation and lock-in

All of this is valuable. Monologue's context awareness is sharp. Wispr Flow's cross-platform support is real. Hey Lemon's task-focused approach is a different wedge. But here's what breaks:

Platform lock-in. Monologue is Mac/iOS. Hey Lemon is Mac. Wispr Flow runs on more platforms, but your personal dictionary is stuck inside the app. Switch from Mac to Windows and the dictionary you maintained stays behind. Want to build your own? You're stuck with whatever API or SDK that vendor exposes. There's no universal layer.

Device sprawl. People report problems (e.g. this thread). The experience isn't universal. Some tools handle microphones differently on macOS vs Windows. Local vs cloud processing. SayToType, for example, does offline transcription only on Apple Silicon Macs, not Windows. Your setup on one machine doesn't carry over. You adapt to the software; the software doesn't adapt to you.

No single voice layer. We have keyboards that work everywhere. We don't have a voice-equivalent layer that sits between you and every app, that follows you across devices and platforms. Instead we have five good products that don't talk to each other.

The missing piece: a universal input layer

The gap isn't better transcription. Current models are already good enough. The real missing piece is portability. We need a voice experience that isn't locked to one vendor, one OS, or one device.

We need AI hardware that unifies the experience across devices.

To get there, the focus needs to shift from software to tools that sit directly between users and machines. This requires:

  1. Hardware-level integration: Using USB HID to align with peripheral hardware standards.
  2. Decoupled intelligence: A system where the user chooses the model based on the task, not the app constraints. Local for privacy, cloud for scale.
  3. Persistent identity: A voice interface that recognizes your technical shorthand regardless of the computer you are using.

Software provided the foundation. But for voice to become a legitimate replacement for typing, we need a device as platform-agnostic as a standard keyboard.