Spaces:
Paused
Paused
| summary: "How inbound audio/voice notes are downloaded, transcribed, and injected into replies" | |
| read_when: | |
| - Changing audio transcription or media handling | |
| title: "Audio and Voice Notes" | |
| # Audio / Voice Notes — 2026-01-17 | |
| ## What works | |
| - **Media understanding (audio)**: If audio understanding is enabled (or auto‑detected), OpenClaw: | |
| 1. Locates the first audio attachment (local path or URL) and downloads it if needed. | |
| 2. Enforces `maxBytes` before sending to each model entry. | |
| 3. Runs the first eligible model entry in order (provider or CLI). | |
| 4. If it fails or skips (size/timeout), it tries the next entry. | |
| 5. On success, it replaces `Body` with an `[Audio]` block and sets `{{Transcript}}`. | |
| - **Command parsing**: When transcription succeeds, `CommandBody`/`RawBody` are set to the transcript so slash commands still work. | |
| - **Verbose logging**: In `--verbose`, we log when transcription runs and when it replaces the body. | |
| ## Auto-detection (default) | |
| If you **don’t configure models** and `tools.media.audio.enabled` is **not** set to `false`, | |
| OpenClaw auto-detects in this order and stops at the first working option: | |
| 1. **Local CLIs** (if installed) | |
| - `sherpa-onnx-offline` (requires `SHERPA_ONNX_MODEL_DIR` with encoder/decoder/joiner/tokens) | |
| - `whisper-cli` (from `whisper-cpp`; uses `WHISPER_CPP_MODEL` or the bundled tiny model) | |
| - `whisper` (Python CLI; downloads models automatically) | |
| 2. **Gemini CLI** (`gemini`) using `read_many_files` | |
| 3. **Provider keys** (OpenAI → Groq → Deepgram → Google) | |
| To disable auto-detection, set `tools.media.audio.enabled: false`. | |
| To customize, set `tools.media.audio.models`. | |
| Note: Binary detection is best-effort across macOS/Linux/Windows; ensure the CLI is on `PATH` (we expand `~`), or set an explicit CLI model with a full command path. | |
| ## Config examples | |
| ### Provider + CLI fallback (OpenAI + Whisper CLI) | |
| ```json5 | |
| { | |
| tools: { | |
| media: { | |
| audio: { | |
| enabled: true, | |
| maxBytes: 20971520, | |
| models: [ | |
| { provider: "openai", model: "gpt-4o-mini-transcribe" }, | |
| { | |
| type: "cli", | |
| command: "whisper", | |
| args: ["--model", "base", "{{MediaPath}}"], | |
| timeoutSeconds: 45, | |
| }, | |
| ], | |
| }, | |
| }, | |
| }, | |
| } | |
| ``` | |
| ### Provider-only with scope gating | |
| ```json5 | |
| { | |
| tools: { | |
| media: { | |
| audio: { | |
| enabled: true, | |
| scope: { | |
| default: "allow", | |
| rules: [{ action: "deny", match: { chatType: "group" } }], | |
| }, | |
| models: [{ provider: "openai", model: "gpt-4o-mini-transcribe" }], | |
| }, | |
| }, | |
| }, | |
| } | |
| ``` | |
| ### Provider-only (Deepgram) | |
| ```json5 | |
| { | |
| tools: { | |
| media: { | |
| audio: { | |
| enabled: true, | |
| models: [{ provider: "deepgram", model: "nova-3" }], | |
| }, | |
| }, | |
| }, | |
| } | |
| ``` | |
| ## Notes & limits | |
| - Provider auth follows the standard model auth order (auth profiles, env vars, `models.providers.*.apiKey`). | |
| - Deepgram picks up `DEEPGRAM_API_KEY` when `provider: "deepgram"` is used. | |
| - Deepgram setup details: [Deepgram (audio transcription)](/providers/deepgram). | |
| - Audio providers can override `baseUrl`, `headers`, and `providerOptions` via `tools.media.audio`. | |
| - Default size cap is 20MB (`tools.media.audio.maxBytes`). Oversize audio is skipped for that model and the next entry is tried. | |
| - Default `maxChars` for audio is **unset** (full transcript). Set `tools.media.audio.maxChars` or per-entry `maxChars` to trim output. | |
| - OpenAI auto default is `gpt-4o-mini-transcribe`; set `model: "gpt-4o-transcribe"` for higher accuracy. | |
| - Use `tools.media.audio.attachments` to process multiple voice notes (`mode: "all"` + `maxAttachments`). | |
| - Transcript is available to templates as `{{Transcript}}`. | |
| - CLI stdout is capped (5MB); keep CLI output concise. | |
| ## Gotchas | |
| - Scope rules use first-match wins. `chatType` is normalized to `direct`, `group`, or `room`. | |
| - Ensure your CLI exits 0 and prints plain text; JSON needs to be massaged via `jq -r .text`. | |
| - Keep timeouts reasonable (`timeoutSeconds`, default 60s) to avoid blocking the reply queue. | |