| --- |
| summary: "Text-to-speech (TTS) for outbound replies" |
| read_when: |
| - Enabling text-to-speech for replies |
| - Configuring TTS providers or limits |
| - Using /tts commands |
| title: "Text-to-Speech" |
| --- |
| |
| # Text-to-speech (TTS) |
|
|
| OpenClaw can convert outbound replies into audio using ElevenLabs, OpenAI, or Edge TTS. |
| It works anywhere OpenClaw can send audio; Telegram gets a round voice-note bubble. |
|
|
| ## Supported services |
|
|
| - **ElevenLabs** (primary or fallback provider) |
| - **OpenAI** (primary or fallback provider; also used for summaries) |
| - **Edge TTS** (primary or fallback provider; uses `node-edge-tts`, default when no API keys) |
|
|
| ### Edge TTS notes |
|
|
| Edge TTS uses Microsoft Edge's online neural TTS service via the `node-edge-tts` |
| library. It's a hosted service (not local), uses Microsoft’s endpoints, and does |
| not require an API key. `node-edge-tts` exposes speech configuration options and |
| output formats, but not all options are supported by the Edge service. citeturn2search0 |
|
|
| Because Edge TTS is a public web service without a published SLA or quota, treat it |
| as best-effort. If you need guaranteed limits and support, use OpenAI or ElevenLabs. |
| Microsoft's Speech REST API documents a 10‑minute audio limit per request; Edge TTS |
| does not publish limits, so assume similar or lower limits. citeturn0search3 |
|
|
| ## Optional keys |
|
|
| If you want OpenAI or ElevenLabs: |
|
|
| - `ELEVENLABS_API_KEY` (or `XI_API_KEY`) |
| - `OPENAI_API_KEY` |
|
|
| Edge TTS does **not** require an API key. If no API keys are found, OpenClaw defaults |
| to Edge TTS (unless disabled via `messages.tts.edge.enabled=false`). |
|
|
| If multiple providers are configured, the selected provider is used first and the others are fallback options. |
| Auto-summary uses the configured `summaryModel` (or `agents.defaults.model.primary`), |
| so that provider must also be authenticated if you enable summaries. |
|
|
| ## Service links |
|
|
| - [OpenAI Text-to-Speech guide](https://platform.openai.com/docs/guides/text-to-speech) |
| - [OpenAI Audio API reference](https://platform.openai.com/docs/api-reference/audio) |
| - [ElevenLabs Text to Speech](https://elevenlabs.io/docs/api-reference/text-to-speech) |
| - [ElevenLabs Authentication](https://elevenlabs.io/docs/api-reference/authentication) |
| - [node-edge-tts](https://github.com/SchneeHertz/node-edge-tts) |
| - [Microsoft Speech output formats](https://learn.microsoft.com/azure/ai-services/speech-service/rest-text-to-speech#audio-outputs) |
|
|
| ## Is it enabled by default? |
|
|
| No. Auto‑TTS is **off** by default. Enable it in config with |
| `messages.tts.auto` or per session with `/tts always` (alias: `/tts on`). |
|
|
| Edge TTS **is** enabled by default once TTS is on, and is used automatically |
| when no OpenAI or ElevenLabs API keys are available. |
|
|
| ## Config |
|
|
| TTS config lives under `messages.tts` in `openclaw.json`. |
| Full schema is in [Gateway configuration](/gateway/configuration). |
|
|
| ### Minimal config (enable + provider) |
|
|
| ```json5 |
| { |
| messages: { |
| tts: { |
| auto: "always", |
| provider: "elevenlabs", |
| }, |
| }, |
| } |
| ``` |
|
|
| ### OpenAI primary with ElevenLabs fallback |
|
|
| ```json5 |
| { |
| messages: { |
| tts: { |
| auto: "always", |
| provider: "openai", |
| summaryModel: "openai/gpt-4.1-mini", |
| modelOverrides: { |
| enabled: true, |
| }, |
| openai: { |
| apiKey: "openai_api_key", |
| baseUrl: "https://api.openai.com/v1", |
| model: "gpt-4o-mini-tts", |
| voice: "alloy", |
| }, |
| elevenlabs: { |
| apiKey: "elevenlabs_api_key", |
| baseUrl: "https://api.elevenlabs.io", |
| voiceId: "voice_id", |
| modelId: "eleven_multilingual_v2", |
| seed: 42, |
| applyTextNormalization: "auto", |
| languageCode: "en", |
| voiceSettings: { |
| stability: 0.5, |
| similarityBoost: 0.75, |
| style: 0.0, |
| useSpeakerBoost: true, |
| speed: 1.0, |
| }, |
| }, |
| }, |
| }, |
| } |
| ``` |
|
|
| ### Edge TTS primary (no API key) |
|
|
| ```json5 |
| { |
| messages: { |
| tts: { |
| auto: "always", |
| provider: "edge", |
| edge: { |
| enabled: true, |
| voice: "en-US-MichelleNeural", |
| lang: "en-US", |
| outputFormat: "audio-24khz-48kbitrate-mono-mp3", |
| rate: "+10%", |
| pitch: "-5%", |
| }, |
| }, |
| }, |
| } |
| ``` |
|
|
| ### Disable Edge TTS |
|
|
| ```json5 |
| { |
| messages: { |
| tts: { |
| edge: { |
| enabled: false, |
| }, |
| }, |
| }, |
| } |
| ``` |
|
|
| ### Custom limits + prefs path |
|
|
| ```json5 |
| { |
| messages: { |
| tts: { |
| auto: "always", |
| maxTextLength: 4000, |
| timeoutMs: 30000, |
| prefsPath: "~/.openclaw/settings/tts.json", |
| }, |
| }, |
| } |
| ``` |
|
|
| ### Only reply with audio after an inbound voice note |
|
|
| ```json5 |
| { |
| messages: { |
| tts: { |
| auto: "inbound", |
| }, |
| }, |
| } |
| ``` |
|
|
| ### Disable auto-summary for long replies |
|
|
| ```json5 |
| { |
| messages: { |
| tts: { |
| auto: "always", |
| }, |
| }, |
| } |
| ``` |
|
|
| Then run: |
|
|
| ``` |
| /tts summary off |
| ``` |
|
|
| ### Notes on fields |
|
|
| - `auto`: auto‑TTS mode (`off`, `always`, `inbound`, `tagged`). |
| - `inbound` only sends audio after an inbound voice note. |
| - `tagged` only sends audio when the reply includes `[[tts]]` tags. |
| - `enabled`: legacy toggle (doctor migrates this to `auto`). |
| - `mode`: `"final"` (default) or `"all"` (includes tool/block replies). |
| - `provider`: `"elevenlabs"`, `"openai"`, or `"edge"` (fallback is automatic). |
| - If `provider` is **unset**, OpenClaw prefers `openai` (if key), then `elevenlabs` (if key), |
| otherwise `edge`. |
| - `summaryModel`: optional cheap model for auto-summary; defaults to `agents.defaults.model.primary`. |
| - Accepts `provider/model` or a configured model alias. |
| - `modelOverrides`: allow the model to emit TTS directives (on by default). |
| - `allowProvider` defaults to `false` (provider switching is opt-in). |
| - `maxTextLength`: hard cap for TTS input (chars). `/tts audio` fails if exceeded. |
| - `timeoutMs`: request timeout (ms). |
| - `prefsPath`: override the local prefs JSON path (provider/limit/summary). |
| - `apiKey` values fall back to env vars (`ELEVENLABS_API_KEY`/`XI_API_KEY`, `OPENAI_API_KEY`). |
| - `elevenlabs.baseUrl`: override ElevenLabs API base URL. |
| - `openai.baseUrl`: override the OpenAI TTS endpoint. |
| - Resolution order: `messages.tts.openai.baseUrl` -> `OPENAI_TTS_BASE_URL` -> `https://api.openai.com/v1` |
| - Non-default values are treated as OpenAI-compatible TTS endpoints, so custom model and voice names are accepted. |
| - `elevenlabs.voiceSettings`: |
| - `stability`, `similarityBoost`, `style`: `0..1` |
| - `useSpeakerBoost`: `true|false` |
| - `speed`: `0.5..2.0` (1.0 = normal) |
| - `elevenlabs.applyTextNormalization`: `auto|on|off` |
| - `elevenlabs.languageCode`: 2-letter ISO 639-1 (e.g. `en`, `de`) |
| - `elevenlabs.seed`: integer `0..4294967295` (best-effort determinism) |
| - `edge.enabled`: allow Edge TTS usage (default `true`; no API key). |
| - `edge.voice`: Edge neural voice name (e.g. `en-US-MichelleNeural`). |
| - `edge.lang`: language code (e.g. `en-US`). |
| - `edge.outputFormat`: Edge output format (e.g. `audio-24khz-48kbitrate-mono-mp3`). |
| - See Microsoft Speech output formats for valid values; not all formats are supported by Edge. |
| - `edge.rate` / `edge.pitch` / `edge.volume`: percent strings (e.g. `+10%`, `-5%`). |
| - `edge.saveSubtitles`: write JSON subtitles alongside the audio file. |
| - `edge.proxy`: proxy URL for Edge TTS requests. |
| - `edge.timeoutMs`: request timeout override (ms). |
|
|
| ## Model-driven overrides (default on) |
|
|
| By default, the model **can** emit TTS directives for a single reply. |
| When `messages.tts.auto` is `tagged`, these directives are required to trigger audio. |
|
|
| When enabled, the model can emit `[[tts:...]]` directives to override the voice |
| for a single reply, plus an optional `[[tts:text]]...[[/tts:text]]` block to |
| provide expressive tags (laughter, singing cues, etc) that should only appear in |
| the audio. |
|
|
| These inline directives are **not** the canonical persistence surface for user |
| preferences. Persistent provider/voice/language/format changes should be done |
| through `/tts ...` commands or config/prefs, not by relying on model-emitted |
| inline tags. |
|
|
| `provider=...` directives are ignored unless `modelOverrides.allowProvider: true`. |
|
|
| Example reply payload: |
|
|
| ``` |
| Here you go. |
| |
| [[tts:voiceId=pMsXgVXv3BLzUgSXRplE model=eleven_v3 speed=1.1]] |
| [[tts:text]](laughs) Read the song once more.[[/tts:text]] |
| ``` |
|
|
| Available directive keys (when enabled): |
|
|
| - `provider` (`openai` | `elevenlabs` | `edge`, requires `allowProvider: true`) |
| - `voice` (OpenAI voice) or `voiceId` (ElevenLabs) |
| - `model` (OpenAI TTS model or ElevenLabs model id) |
| - `stability`, `similarityBoost`, `style`, `speed`, `useSpeakerBoost` |
| - `applyTextNormalization` (`auto|on|off`) |
| - `languageCode` (ISO 639-1) |
| - `seed` |
|
|
| Disable all model overrides: |
|
|
| ```json5 |
| { |
| messages: { |
| tts: { |
| modelOverrides: { |
| enabled: false, |
| }, |
| }, |
| }, |
| } |
| ``` |
|
|
| Optional allowlist (enable provider switching while keeping other knobs configurable): |
|
|
| ```json5 |
| { |
| messages: { |
| tts: { |
| modelOverrides: { |
| enabled: true, |
| allowProvider: true, |
| allowSeed: false, |
| }, |
| }, |
| }, |
| } |
| ``` |
|
|
| ## Per-user preferences |
|
|
| Slash commands write local overrides to `prefsPath` (default: |
| `~/.openclaw/settings/tts.json`, override with `OPENCLAW_TTS_PREFS` or |
| `messages.tts.prefsPath`). |
|
|
| Stored fields: |
|
|
| - `enabled` |
| - `provider` |
| - `maxLength` (summary threshold; default 1500 chars) |
| - `summarize` (default `true`) |
|
|
| These override `messages.tts.*` for that host. |
|
|
| ## Output formats (fixed) |
|
|
| - **Telegram**: Opus voice note (`opus_48000_64` from ElevenLabs, `opus` from OpenAI). |
| - 48kHz / 64kbps is a good voice-note tradeoff and required for the round bubble. |
| - **Other channels**: MP3 (`mp3_44100_128` from ElevenLabs, `mp3` from OpenAI). |
| - 44.1kHz / 128kbps is the default balance for speech clarity. |
| - **Edge TTS**: uses `edge.outputFormat` (default `audio-24khz-48kbitrate-mono-mp3`). |
| - `node-edge-tts` accepts an `outputFormat`, but not all formats are available |
| from the Edge service. citeturn2search0 |
| - Output format values follow Microsoft Speech output formats (including Ogg/WebM Opus). citeturn1search0 |
| - Telegram `sendVoice` accepts OGG/MP3/M4A; use OpenAI/ElevenLabs if you need |
| guaranteed Opus voice notes. citeturn1search1 |
| - If the configured Edge output format fails, OpenClaw retries with MP3. |
| |
| OpenAI/ElevenLabs formats are fixed; Telegram expects Opus for voice-note UX. |
|
|
| ## Auto-TTS behavior |
|
|
| When enabled, OpenClaw: |
|
|
| - skips TTS if the reply already contains media or a `MEDIA:` directive. |
| - skips very short replies (< 10 chars). |
| - summarizes long replies when enabled using `agents.defaults.model.primary` (or `summaryModel`). |
| - attaches the generated audio to the reply. |
|
|
| If the reply exceeds `maxLength` and summary is off (or no API key for the |
| summary model), audio |
| is skipped and the normal text reply is sent. |
|
|
| ## Flow diagram |
|
|
| ``` |
| Reply -> TTS enabled? |
| no -> send text |
| yes -> has media / MEDIA: / short? |
| yes -> send text |
| no -> length > limit? |
| no -> TTS -> attach audio |
| yes -> summary enabled? |
| no -> send text |
| yes -> summarize (summaryModel or agents.defaults.model.primary) |
| -> TTS -> attach audio |
| ``` |
|
|
| ## Slash command usage |
|
|
| There is a single command: `/tts`. |
| See [Slash commands](/tools/slash-commands) for enablement details. |
|
|
| Discord note: `/tts` is a built-in Discord command, so OpenClaw registers |
| `/voice` as the native command there. Text `/tts ...` still works. |
|
|
| ``` |
| /tts off |
| /tts always |
| /tts inbound |
| /tts tagged |
| /tts status |
| /tts provider openai |
| /tts limit 2000 |
| /tts summary off |
| /tts audio Hello from OpenClaw |
| ``` |
|
|
| Notes: |
|
|
| - Commands require an authorized sender (allowlist/owner rules still apply). |
| - `commands.text` or native command registration must be enabled. |
| - `off|always|inbound|tagged` are per‑session toggles (`/tts on` is an alias for `/tts always`). |
| - `limit` and `summary` are stored in local prefs, not the main config. |
| - `/tts audio` generates a one-off audio reply (does not toggle TTS on). |
|
|
| ## Agent tool |
|
|
| The `tts` tool converts text to speech and returns a `MEDIA:` path. When the |
| result is Telegram-compatible, the tool includes `[[audio_as_voice]]` so |
| Telegram sends a voice bubble. |
|
|
| ## Gateway RPC |
|
|
| Gateway methods: |
|
|
| - `tts.status` |
| - `tts.enable` |
| - `tts.disable` |
| - `tts.convert` |
| - `tts.setProvider` |
| - `tts.providers` |
|
|