Custom Blocks in Obsidian: The Scribe Block

Obsidian gives you markdown, code blocks, and embeds. That's it. No custom block types. No component model. No "insert a widget here" API.
So when I wanted a block that could record audio, display a live transcript, and let you summarize it in place — all inside a regular note — I had to build it out of parts Obsidian already has.
The result is the Scribe block. It works. But the block itself is just plumbing. What makes the plugin actually useful are two features bolted onto it: a plain-text vocabulary file that teaches the speech API your jargon, and a selection overlay that makes AI edits feel like normal typing.
The block trick
Obsidian lets plugins register a MarkdownCodeBlockProcessor for any language tag. So I invented one: tuon-voice. The registration is one call:
this.registerMarkdownCodeBlockProcessor("tuon-voice", (source, el, ctx) => {
// source = the raw YAML inside the fence
// el = the DOM element to render into
// ctx = file path, addChild, etc.
renderScribeBlock({ source, el, sourcePath: ctx.sourcePath });
});When the renderer hits a fenced code block with that tag, my plugin takes over and renders a UI instead of syntax-highlighted text.
Inside the fence is just YAML metadata. Block ID, title, timestamps, recording mode. It's lightweight and human-readable if you switch to source view. The actual content — transcript, summary, prettified text — lives elsewhere in the same note as a hidden <div> with a matching ID.
In source mode, the note looks like this:
```tuon-voice
id: a1b2c3d4-...
title: Scribe
createdAt: 2026-03-03T12:00:00.000Z
recordingMode: stream
```
<div id="tuon-data-a1b2c3d4-..." style="display:none">
{"version":1,"transcript":"SGVsbG8gd29ybGQ=","summary":"...","pretty":"..."}
</div>The long strings are base64-encoded so the raw markdown doesn't become an unreadable wall of text (transcripts get long). Dumping raw JSON (with quotes and newlines) inside a raw HTML <div> also breaks Obsidian's markdown parser and puts you in escaping hell. Base64 isn't for security; it's just to give the parser a safe string it won't choke on. The plugin finds the div by ID, decodes the string, and populates the UI.
Why not put everything inside the code fence? I tried that first. It broke immediately. Obsidian re-renders a code block every time its content changes. If you update a transcript mid-recording, you trigger an infinite render loop. Separating the stable metadata (the fence) from the volatile content (the storage div) fixed it.
Why not front matter? Front matter is per-file. A meeting note might have three separate recordings for three different agenda items. Each needs its own transcript and summary.
So: fence for identity, hidden div for data, plugin for UI. It's a hack. But it composes perfectly with the rest of Obsidian.
Teaching the speech API your words
Generic speech-to-text mangles names. It doesn't know your coworkers, your project names, or your acronyms. Every time I said "Tuon", the AI wrote "tune" or "two on."
AssemblyAI supports keyterms — a list of words you send with the streaming config to bias the recognition engine. The question was where to store that list. A settings panel? A JSON config file? Both feel disconnected from the actual writing process.
I put it right in the vault. A standard markdown file: scribe/VOCAB.md.
One term per line. The plugin creates it on first load and reads it before every transcription session. I capped it at 100 terms and 50 characters each, because the API has limits and a massive vocabulary just introduces noise.
When the plugin connects to AssemblyAI, it maps this list directly to the API's word_boost parameter and sets the boost weight to "high". That bridges the gap between a simple local text file and the recognition engine actually prioritizing your jargon.
The workflow is simple. You dictate something. The transcript says "tune" instead of "Tuon." You highlight the wrong word, right-click, and hit "Add to vocab." Done. Next time you record, AssemblyAI gets it right.
No separate admin panel. No import/export. It lives next to your notes because it's part of your knowledge base, not part of the plugin's configuration.
The parser is 20 lines. It walks the file, skips markdown structure, and collects the terms:
function parseVocabTerms(content: string): string[] {
const terms: string[] = [];
const seen = new Set<string>();
let inCodeFence = false;
for (const rawLine of content.split(/\r?\n/)) {
const line = rawLine.trim();
if (!line) continue;
if (line.startsWith("```")) { inCodeFence = !inCodeFence; continue; }
if (inCodeFence) continue;
if (line.startsWith("#") || line.startsWith(">") || line.startsWith("<!--")) continue;
const term = normalize(line); // strip list markers, collapse whitespace
if (!term || term.length > 50 || seen.has(term)) continue;
seen.add(term);
terms.push(term);
}
return terms;
}Skip headings, blockquotes, comments, and code fences. Everything else is a term. Dedupe it, enforce the cap, and ship it to the API. Small surface area, massive difference in transcript quality.
In-place AI: the selection overlay
AI commands that dump text into a sidebar are useless. The whole point of working in Obsidian is that your notes are right in front of you. You don't want to copy-paste a polished summary of a 30-minute design sync from a chat window back into the middle of your meeting note. The AI should edit in place, like a collaborator.
You select text and hit "Summarize" from the command palette. The plugin immediately:
- Captures the selection's start and end offsets.
- Shows a visual overlay on that specific text ("Summarizing...").
- Fires the request to OpenRouter (specifically routed to a fast model because inline UI requires speed to feel native; waiting 5 seconds for a larger model breaks the illusion).
When the response comes back, it checks if the text at those offsets still matches the original selection:
// Before the request: capture the exact range
const fromOffset = editor.posToOffset(editor.getCursor("from"));
const toOffset = editor.posToOffset(editor.getCursor("to"));
const originalText = editor.getSelection().trim();
const hideOverlay = showSelectionOverlay(editorView, fromOffset, toOffset, "Summarizing...");
// ... AI request happens ...
// After the response: check if the range is still intact
const currentText = editor.getRange(
editor.offsetToPos(fromOffset),
editor.offsetToPos(toOffset)
).trim();
if (currentText === originalText) {
// Safe to replace in place
editor.replaceRange(result, editor.offsetToPos(fromOffset), editor.offsetToPos(toOffset));
} else {
// Selection changed — don't overwrite the wrong text
editor.replaceSelection(result);
new Notice("Selection changed; inserted at cursor instead.");
}
hideOverlay();If you didn't touch anything, it replaces the text in place. If you edited the note while waiting, it gracefully inserts the result at your cursor and warns you. No silent data loss. No overwriting the wrong paragraph.
The visual overlay is a CodeMirror StateField with two effects: show and hide.
const showOverlay = StateEffect.define<{ from: number; to: number; label: string }>();
const hideOverlay = StateEffect.define<void>();
const overlayField = StateField.define({
create: () => ({ overlay: null, decorations: Decoration.none }),
update(value, tr) {
let overlay = value.overlay;
for (const effect of tr.effects) {
if (effect.is(showOverlay)) overlay = effect.value;
else if (effect.is(hideOverlay)) overlay = null;
}
if (overlay) {
// Map positions through any edits that happened while we were waiting
const from = tr.changes.mapPos(overlay.from, 1);
const to = tr.changes.mapPos(overlay.to, -1);
return {
overlay: { ...overlay, from, to },
decorations: Decoration.set([
Decoration.mark({ class: "selection-overlay" }).range(from, to),
]),
};
}
return { overlay: null, decorations: Decoration.none };
},
provide: (f) => EditorView.decorations.from(f, (v) => v.decorations),
});The magic happens at tr.changes.mapPos. That's CodeMirror tracking position shifts from intermediate edits. If you type a new paragraph above the selection while the AI request is in flight, the overlay moves down with the text.
This sounds like a lot of machinery just to show a loading state. But without it, users don't trust the feature. They select text, run a command, and then nervously click around wondering if it broke. The visual overlay is a trust signal. It says "I have this exact text, I'm working on it, and I'll put the result right here." That's what gets people to actually use it daily.
Behaviors > Containers
The Scribe block alone is a parlor trick. It's just a custom renderer.
The vocabulary file is what makes the transcript accurate enough to keep. The selection overlay is what makes AI cleanup fast enough to bother with. Without either of those, the block is just a clunky dictation tool. With them, it becomes a surface where you can talk, review, clean up, and move on.
If you're building custom UI inside Obsidian, the lesson isn't "use a code fence and a hidden div." That's just implementation details. The lesson is that the container is only as good as the workflows attached to it.