| --- |
| summary: "How inbound audio/voice notes are downloaded, transcribed, and injected into replies" |
| read_when: |
| - Changing audio transcription or media handling |
| title: "Audio and Voice Notes" |
| --- |
| |
| # Audio / Voice Notes — 2026-01-17 |
|
|
| ## What works |
|
|
| - **Media understanding (audio)**: If audio understanding is enabled (or auto‑detected), OpenClaw: |
| 1. Locates the first audio attachment (local path or URL) and downloads it if needed. |
| 2. Enforces `maxBytes` before sending to each model entry. |
| 3. Runs the first eligible model entry in order (provider or CLI). |
| 4. If it fails or skips (size/timeout), it tries the next entry. |
| 5. On success, it replaces `Body` with an `[Audio]` block and sets `{{Transcript}}`. |
| - **Command parsing**: When transcription succeeds, `CommandBody`/`RawBody` are set to the transcript so slash commands still work. |
| - **Verbose logging**: In `--verbose`, we log when transcription runs and when it replaces the body. |
|
|
| ## Auto-detection (default) |
|
|
| If you **don’t configure models** and `tools.media.audio.enabled` is **not** set to `false`, |
| OpenClaw auto-detects in this order and stops at the first working option: |
|
|
| 1. **Local CLIs** (if installed) |
| - `sherpa-onnx-offline` (requires `SHERPA_ONNX_MODEL_DIR` with encoder/decoder/joiner/tokens) |
| - `whisper-cli` (from `whisper-cpp`; uses `WHISPER_CPP_MODEL` or the bundled tiny model) |
| - `whisper` (Python CLI; downloads models automatically) |
| 2. **Gemini CLI** (`gemini`) using `read_many_files` |
| 3. **Provider keys** (OpenAI → Groq → Deepgram → Google) |
|
|
| To disable auto-detection, set `tools.media.audio.enabled: false`. |
| To customize, set `tools.media.audio.models`. |
| Note: Binary detection is best-effort across macOS/Linux/Windows; ensure the CLI is on `PATH` (we expand `~`), or set an explicit CLI model with a full command path. |
|
|
| ## Config examples |
|
|
| ### Provider + CLI fallback (OpenAI + Whisper CLI) |
|
|
| ```json5 |
| { |
| tools: { |
| media: { |
| audio: { |
| enabled: true, |
| maxBytes: 20971520, |
| models: [ |
| { provider: "openai", model: "gpt-4o-mini-transcribe" }, |
| { |
| type: "cli", |
| command: "whisper", |
| args: ["--model", "base", "{{MediaPath}}"], |
| timeoutSeconds: 45, |
| }, |
| ], |
| }, |
| }, |
| }, |
| } |
| ``` |
|
|
| ### Provider-only with scope gating |
|
|
| ```json5 |
| { |
| tools: { |
| media: { |
| audio: { |
| enabled: true, |
| scope: { |
| default: "allow", |
| rules: [{ action: "deny", match: { chatType: "group" } }], |
| }, |
| models: [{ provider: "openai", model: "gpt-4o-mini-transcribe" }], |
| }, |
| }, |
| }, |
| } |
| ``` |
|
|
| ### Provider-only (Deepgram) |
|
|
| ```json5 |
| { |
| tools: { |
| media: { |
| audio: { |
| enabled: true, |
| models: [{ provider: "deepgram", model: "nova-3" }], |
| }, |
| }, |
| }, |
| } |
| ``` |
|
|
| ### Provider-only (Mistral Voxtral) |
|
|
| ```json5 |
| { |
| tools: { |
| media: { |
| audio: { |
| enabled: true, |
| models: [{ provider: "mistral", model: "voxtral-mini-latest" }], |
| }, |
| }, |
| }, |
| } |
| ``` |
|
|
| ### Echo transcript to chat (opt-in) |
|
|
| ```json5 |
| { |
| tools: { |
| media: { |
| audio: { |
| enabled: true, |
| echoTranscript: true, // default is false |
| echoFormat: '📝 "{transcript}"', // optional, supports {transcript} |
| models: [{ provider: "openai", model: "gpt-4o-mini-transcribe" }], |
| }, |
| }, |
| }, |
| } |
| ``` |
|
|
| ## Notes & limits |
|
|
| - Provider auth follows the standard model auth order (auth profiles, env vars, `models.providers.*.apiKey`). |
| - Deepgram picks up `DEEPGRAM_API_KEY` when `provider: "deepgram"` is used. |
| - Deepgram setup details: [Deepgram (audio transcription)](/providers/deepgram). |
| - Mistral setup details: [Mistral](/providers/mistral). |
| - Audio providers can override `baseUrl`, `headers`, and `providerOptions` via `tools.media.audio`. |
| - Default size cap is 20MB (`tools.media.audio.maxBytes`). Oversize audio is skipped for that model and the next entry is tried. |
| - Tiny/empty audio files below 1024 bytes are skipped before provider/CLI transcription. |
| - Default `maxChars` for audio is **unset** (full transcript). Set `tools.media.audio.maxChars` or per-entry `maxChars` to trim output. |
| - OpenAI auto default is `gpt-4o-mini-transcribe`; set `model: "gpt-4o-transcribe"` for higher accuracy. |
| - Use `tools.media.audio.attachments` to process multiple voice notes (`mode: "all"` + `maxAttachments`). |
| - Transcript is available to templates as `{{Transcript}}`. |
| - `tools.media.audio.echoTranscript` is off by default; enable it to send transcript confirmation back to the originating chat before agent processing. |
| - `tools.media.audio.echoFormat` customizes the echo text (placeholder: `{transcript}`). |
| - CLI stdout is capped (5MB); keep CLI output concise. |
|
|
| ### Proxy environment support |
|
|
| Provider-based audio transcription honors standard outbound proxy env vars: |
|
|
| - `HTTPS_PROXY` |
| - `HTTP_PROXY` |
| - `https_proxy` |
| - `http_proxy` |
|
|
| If no proxy env vars are set, direct egress is used. If proxy config is malformed, OpenClaw logs a warning and falls back to direct fetch. |
|
|
| ## Mention Detection in Groups |
|
|
| When `requireMention: true` is set for a group chat, OpenClaw now transcribes audio **before** checking for mentions. This allows voice notes to be processed even when they contain mentions. |
|
|
| **How it works:** |
|
|
| 1. If a voice message has no text body and the group requires mentions, OpenClaw performs a "preflight" transcription. |
| 2. The transcript is checked for mention patterns (e.g., `@BotName`, emoji triggers). |
| 3. If a mention is found, the message proceeds through the full reply pipeline. |
| 4. The transcript is used for mention detection so voice notes can pass the mention gate. |
|
|
| **Fallback behavior:** |
|
|
| - If transcription fails during preflight (timeout, API error, etc.), the message is processed based on text-only mention detection. |
| - This ensures that mixed messages (text + audio) are never incorrectly dropped. |
|
|
| **Opt-out per Telegram group/topic:** |
|
|
| - Set `channels.telegram.groups.<chatId>.disableAudioPreflight: true` to skip preflight transcript mention checks for that group. |
| - Set `channels.telegram.groups.<chatId>.topics.<threadId>.disableAudioPreflight` to override per-topic (`true` to skip, `false` to force-enable). |
| - Default is `false` (preflight enabled when mention-gated conditions match). |
|
|
| **Example:** A user sends a voice note saying "Hey @Claude, what's the weather?" in a Telegram group with `requireMention: true`. The voice note is transcribed, the mention is detected, and the agent replies. |
|
|
| ## Gotchas |
|
|
| - Scope rules use first-match wins. `chatType` is normalized to `direct`, `group`, or `room`. |
| - Ensure your CLI exits 0 and prints plain text; JSON needs to be massaged via `jq -r .text`. |
| - For `parakeet-mlx`, if you pass `--output-dir`, OpenClaw reads `<output-dir>/<media-basename>.txt` when `--output-format` is `txt` (or omitted); non-`txt` output formats fall back to stdout parsing. |
| - Keep timeouts reasonable (`timeoutSeconds`, default 60s) to avoid blocking the reply queue. |
| - Preflight transcription only processes the **first** audio attachment for mention detection. Additional audio is processed during the main media understanding phase. |
|
|