Handling Chain-of-Thought for GPT-OSS-120b: Harmony Format and Provider Quirks

Handling Chain-of-Thought for GPT-OSS-120b: Harmony Format and Provider Quirks

OpenAI's GPT-OSS-120B is an open-weight reasoning model on OpenRouter. It supports tools and chain-of-thought. But it doesn't speak the same language as Claude or o4-mini. It uses OpenAI's Harmony format — a structured token protocol that separates reasoning from user-facing output. If you treat it like other reasoning models, you get raw "analysis:" prefixes leaking into your UI and weird CoT behavior.

Personally, I love working with the GPT-OSS models. They are great at instruction following. They are snappy. Aside from the weird nuances with structure, I feel like it's a great overall model.

Here's what Harmony is, why it exists, and how we handle GPT-OSS models.

What is Harmony format?

Harmony is OpenAI's response format for the GPT-OSS family. It was built so open-weight models can behave like the Responses API: structured messages, clear roles, and explicit channels for different kinds of output.

From the OpenAI Harmony cookbook:

"GPT-OSS should not be used without using the harmony format, as it will not work correctly."

Harmony isn't just a formatting convention. It's a token protocol. The model is trained to emit special tokens like <|start|>, <|channel|>, <|message|>, <|end|>, <|return|>, and <|call|>. Those tokens define message boundaries and output channels.

Channels: analysis, commentary, final

Assistant output is split into three channels:

ChannelPurpose
analysisChain-of-thought. Internal reasoning. Not held to the same safety standards as final output — don't show this to users.
commentaryTool calls, preambles, "I'm about to do X."
finalThe answer the user should see.

The model streams into analysis first, then emits a final message. The CoT lives in analysis; the answer lives in final.

Why Harmony exists

Without Harmony, GPT-OSS doesn't know how to structure multi-turn conversations, tool calls, or reasoning. The format gives it:

  • Clear separation of reasoning vs user-facing text
  • A stable way to represent tool calls and results
  • Compatibility with the Responses API so providers (OpenRouter, Ollama, Hugging Face) can handle it without you wiring it yourself

When you use GPT-OSS through OpenRouter, the provider handles Harmony. You send normal messages; they translate to and from Harmony tokens. You don't implement the renderer. But you do need to handle what comes back — and that's where the quirks show up.

How GPT-OSS differs from other reasoning models

We support several reasoning models: Claude 3.7, o4-mini, gpt-5-mini, Grok, and GPT-OSS-120b. Each emits CoT differently:

  • Claude / o4-mini: Reasoning in a dedicated reasoning content part. Clean separation.
  • GPT-OSS-120b: Reasoning can leak into the main content as raw "analysis:" or "analysis -" prefixes. Harmony's analysis channel is meant to be stripped, but sometimes it bleeds through.

We hit this in production. Users saw things like:

analysis: Let me think about this...
analysis - The user wants X. I'll do Y.

That's Harmony's analysis channel showing up in the visible stream. We had to add sanitization — a familiar pattern at this point.

Sanitizing GPT-OSS CoT leakage

In our chat API, we run a sanitizeAssistantVisibleText helper in the onFinish callback. It strips analysis prefixes before we save or display the message:

// From the chat route — strip "analysis:" style prefixes from visible text
const sanitizeAssistantVisibleText = (content: string): string => {
    const isGptOss = modelId === 'openai/GPT-OSS-120b';
    // Only apply to GPT-OSS or when content looks like raw CoT
    if (!isGptOss && !looksLikeRawCot) return content;
    // Strip analysis: / analysis - prefixes from each line
    sanitized = sanitized
        .split('\n')
        .map((line) => line.replace(/^\s*analysis\s*[:\-]?\s*/i, ''))
        .join('\n');
    return sanitized;
};

We only apply this for GPT-OSS (or when the content clearly looks like raw CoT). Other models don't need it.

Persisting reasoning separately

We persist reasoning in message.metadata.reasoning for models that support it — including GPT-OSS-120b. The extraction logic looks for a reasoning part in the response content (OpenAI Harmony notes that reasoning lives in the analysis channel). OpenRouter normalizes that into a structure we can read. We store it for replay and display in our ReasoningDisplay component, but we keep it out of the user-facing final content.

Provider routing for GPT-OSS

OpenRouter can route GPT-OSS to different backends. We found that some backends (e.g. Azure) enforce stricter tool-schema validation and reject valid tool calls. For gpt-5-mini we had to add provider options to avoid Azure for that model. GPT-OSS has similar quirks: we enable reasoning mode and use provider options so OpenRouter sends it to backends that handle Harmony and tools correctly.

Handling CoT in subsequent turns

Harmony has a rule for multi-turn: if the assistant's last output was on the final channel, you should drop the previous CoT when building the next prompt. Otherwise the context grows and the model can get confused. For tool-calling turns, you keep the CoT — the model uses it to decide what to do next.

We don't implement Harmony ourselves; OpenRouter does. But we had to handle the outputs: sanitize visible text, persist reasoning, and route to compatible providers.

Summary

  • Harmony is OpenAI's token protocol for GPT-OSS. It uses channels (analysis, commentary, final) to separate reasoning from user-facing output.
  • OpenRouter handles Harmony for you. You don't render or parse the raw tokens.
  • GPT-OSS can leak analysis prefixes into visible content. We sanitize those before display and persistence.
  • Provider routing matters — some backends reject valid tool schemas. We configure OpenRouter to use compatible providers for GPT-OSS.

If you're adding GPT-OSS-120B to a chat product, expect to add model-specific handling for CoT leakage and provider routing. The model is powerful and free; the integration is a bit more involved than "drop-in replacement."