Mirror OpenSkyNet workspace snapshot from Git HEAD

fc93158 verified 9 days ago

12.4 kB

	---
	summary: "Text-to-speech (TTS) for outbound replies"
	read_when:
	- Enabling text-to-speech for replies
	- Configuring TTS providers or limits
	- Using /tts commands
	title: "Text-to-Speech"
	---

	# Text-to-speech (TTS)

	OpenClaw can convert outbound replies into audio using ElevenLabs, OpenAI, or Edge TTS.
	It works anywhere OpenClaw can send audio; Telegram gets a round voice-note bubble.

	## Supported services

	- ElevenLabs (primary or fallback provider)
	- OpenAI (primary or fallback provider; also used for summaries)
	- Edge TTS (primary or fallback provider; uses `node-edge-tts`, default when no API keys)

	### Edge TTS notes

	Edge TTS uses Microsoft Edge's online neural TTS service via the `node-edge-tts`
	library. It's a hosted service (not local), uses Microsoft’s endpoints, and does
	not require an API key. `node-edge-tts` exposes speech configuration options and
	output formats, but not all options are supported by the Edge service. citeturn2search0

	Because Edge TTS is a public web service without a published SLA or quota, treat it
	as best-effort. If you need guaranteed limits and support, use OpenAI or ElevenLabs.
	Microsoft's Speech REST API documents a 10‑minute audio limit per request; Edge TTS
	does not publish limits, so assume similar or lower limits. citeturn0search3

	## Optional keys

	If you want OpenAI or ElevenLabs:

	- `ELEVENLABS_API_KEY` (or `XI_API_KEY`)
	- `OPENAI_API_KEY`

	Edge TTS does not require an API key. If no API keys are found, OpenClaw defaults
	to Edge TTS (unless disabled via `messages.tts.edge.enabled=false`).

	If multiple providers are configured, the selected provider is used first and the others are fallback options.
	Auto-summary uses the configured `summaryModel` (or `agents.defaults.model.primary`),
	so that provider must also be authenticated if you enable summaries.

	## Service links

	- [OpenAI Text-to-Speech guide](https://platform.openai.com/docs/guides/text-to-speech)
	- [OpenAI Audio API reference](https://platform.openai.com/docs/api-reference/audio)
	- [ElevenLabs Text to Speech](https://elevenlabs.io/docs/api-reference/text-to-speech)
	- [ElevenLabs Authentication](https://elevenlabs.io/docs/api-reference/authentication)
	- [node-edge-tts](https://github.com/SchneeHertz/node-edge-tts)
	- [Microsoft Speech output formats](https://learn.microsoft.com/azure/ai-services/speech-service/rest-text-to-speech#audio-outputs)

	## Is it enabled by default?

	No. Auto‑TTS is off by default. Enable it in config with
	`messages.tts.auto` or per session with `/tts always` (alias: `/tts on`).

	Edge TTS is enabled by default once TTS is on, and is used automatically
	when no OpenAI or ElevenLabs API keys are available.

	## Config

	TTS config lives under `messages.tts` in `openclaw.json`.
	Full schema is in [Gateway configuration](/gateway/configuration).

	### Minimal config (enable + provider)

	```json5
	{
	messages: {
	tts: {
	auto: "always",
	provider: "elevenlabs",
	},
	},
	}
	```

	### OpenAI primary with ElevenLabs fallback

	```json5
	{
	messages: {
	tts: {
	auto: "always",
	provider: "openai",
	summaryModel: "openai/gpt-4.1-mini",
	modelOverrides: {
	enabled: true,
	},
	openai: {
	apiKey: "openai_api_key",
	baseUrl: "https://api.openai.com/v1",
	model: "gpt-4o-mini-tts",
	voice: "alloy",
	},
	elevenlabs: {
	apiKey: "elevenlabs_api_key",
	baseUrl: "https://api.elevenlabs.io",
	voiceId: "voice_id",
	modelId: "eleven_multilingual_v2",
	seed: 42,
	applyTextNormalization: "auto",
	languageCode: "en",
	voiceSettings: {
	stability: 0.5,
	similarityBoost: 0.75,
	style: 0.0,
	useSpeakerBoost: true,
	speed: 1.0,
	},
	},
	},
	},
	}
	```

	### Edge TTS primary (no API key)

	```json5
	{
	messages: {
	tts: {
	auto: "always",
	provider: "edge",
	edge: {
	enabled: true,
	voice: "en-US-MichelleNeural",
	lang: "en-US",
	outputFormat: "audio-24khz-48kbitrate-mono-mp3",
	rate: "+10%",
	pitch: "-5%",
	},
	},
	},
	}
	```

	### Disable Edge TTS

	```json5
	{
	messages: {
	tts: {
	edge: {
	enabled: false,
	},
	},
	},
	}
	```

	### Custom limits + prefs path

	```json5
	{
	messages: {
	tts: {
	auto: "always",
	maxTextLength: 4000,
	timeoutMs: 30000,
	prefsPath: "~/.openclaw/settings/tts.json",
	},
	},
	}
	```

	### Only reply with audio after an inbound voice note

	```json5
	{
	messages: {
	tts: {
	auto: "inbound",
	},
	},
	}
	```

	### Disable auto-summary for long replies

	```json5
	{
	messages: {
	tts: {
	auto: "always",
	},
	},
	}
	```

	Then run:

	```
	/tts summary off
	```

	### Notes on fields

	- `auto`: auto‑TTS mode (`off`, `always`, `inbound`, `tagged`).
	- `inbound` only sends audio after an inbound voice note.
	- `tagged` only sends audio when the reply includes `[[tts]]` tags.
	- `enabled`: legacy toggle (doctor migrates this to `auto`).
	- `mode`: `"final"` (default) or `"all"` (includes tool/block replies).
	- `provider`: `"elevenlabs"`, `"openai"`, or `"edge"` (fallback is automatic).
	- If `provider` is unset, OpenClaw prefers `openai` (if key), then `elevenlabs` (if key),
	otherwise `edge`.
	- `summaryModel`: optional cheap model for auto-summary; defaults to `agents.defaults.model.primary`.
	- Accepts `provider/model` or a configured model alias.
	- `modelOverrides`: allow the model to emit TTS directives (on by default).
	- `allowProvider` defaults to `false` (provider switching is opt-in).
	- `maxTextLength`: hard cap for TTS input (chars). `/tts audio` fails if exceeded.
	- `timeoutMs`: request timeout (ms).
	- `prefsPath`: override the local prefs JSON path (provider/limit/summary).
	- `apiKey` values fall back to env vars (`ELEVENLABS_API_KEY`/`XI_API_KEY`, `OPENAI_API_KEY`).
	- `elevenlabs.baseUrl`: override ElevenLabs API base URL.
	- `openai.baseUrl`: override the OpenAI TTS endpoint.
	- Resolution order: `messages.tts.openai.baseUrl` -> `OPENAI_TTS_BASE_URL` -> `https://api.openai.com/v1`
	- Non-default values are treated as OpenAI-compatible TTS endpoints, so custom model and voice names are accepted.
	- `elevenlabs.voiceSettings`:
	- `stability`, `similarityBoost`, `style`: `0..1`
	- `useSpeakerBoost`: `true\|false`
	- `speed`: `0.5..2.0` (1.0 = normal)
	- `elevenlabs.applyTextNormalization`: `auto\|on\|off`
	- `elevenlabs.languageCode`: 2-letter ISO 639-1 (e.g. `en`, `de`)
	- `elevenlabs.seed`: integer `0..4294967295` (best-effort determinism)
	- `edge.enabled`: allow Edge TTS usage (default `true`; no API key).
	- `edge.voice`: Edge neural voice name (e.g. `en-US-MichelleNeural`).
	- `edge.lang`: language code (e.g. `en-US`).
	- `edge.outputFormat`: Edge output format (e.g. `audio-24khz-48kbitrate-mono-mp3`).
	- See Microsoft Speech output formats for valid values; not all formats are supported by Edge.
	- `edge.rate` / `edge.pitch` / `edge.volume`: percent strings (e.g. `+10%`, `-5%`).
	- `edge.saveSubtitles`: write JSON subtitles alongside the audio file.
	- `edge.proxy`: proxy URL for Edge TTS requests.
	- `edge.timeoutMs`: request timeout override (ms).

	## Model-driven overrides (default on)

	By default, the model can emit TTS directives for a single reply.
	When `messages.tts.auto` is `tagged`, these directives are required to trigger audio.

	When enabled, the model can emit `[[tts:...]]` directives to override the voice
	for a single reply, plus an optional `[[tts:text]]...[[/tts:text]]` block to
	provide expressive tags (laughter, singing cues, etc) that should only appear in
	the audio.

	These inline directives are not the canonical persistence surface for user
	preferences. Persistent provider/voice/language/format changes should be done
	through `/tts ...` commands or config/prefs, not by relying on model-emitted
	inline tags.

	`provider=...` directives are ignored unless `modelOverrides.allowProvider: true`.

	Example reply payload:

	```
	Here you go.

	[[tts:voiceId=pMsXgVXv3BLzUgSXRplE model=eleven_v3 speed=1.1]]
	[[tts:text]](laughs) Read the song once more.[[/tts:text]]
	```

	Available directive keys (when enabled):

	- `provider` (`openai` \| `elevenlabs` \| `edge`, requires `allowProvider: true`)
	- `voice` (OpenAI voice) or `voiceId` (ElevenLabs)
	- `model` (OpenAI TTS model or ElevenLabs model id)
	- `stability`, `similarityBoost`, `style`, `speed`, `useSpeakerBoost`
	- `applyTextNormalization` (`auto\|on\|off`)
	- `languageCode` (ISO 639-1)
	- `seed`

	Disable all model overrides:

	```json5
	{
	messages: {
	tts: {
	modelOverrides: {
	enabled: false,
	},
	},
	},
	}
	```

	Optional allowlist (enable provider switching while keeping other knobs configurable):

	```json5
	{
	messages: {
	tts: {
	modelOverrides: {
	enabled: true,
	allowProvider: true,
	allowSeed: false,
	},
	},
	},
	}
	```

	## Per-user preferences

	Slash commands write local overrides to `prefsPath` (default:
	`~/.openclaw/settings/tts.json`, override with `OPENCLAW_TTS_PREFS` or
	`messages.tts.prefsPath`).

	Stored fields:

	- `enabled`
	- `provider`
	- `maxLength` (summary threshold; default 1500 chars)
	- `summarize` (default `true`)

	These override `messages.tts.*` for that host.

	## Output formats (fixed)

	- Telegram: Opus voice note (`opus_48000_64` from ElevenLabs, `opus` from OpenAI).
	- 48kHz / 64kbps is a good voice-note tradeoff and required for the round bubble.
	- Other channels: MP3 (`mp3_44100_128` from ElevenLabs, `mp3` from OpenAI).
	- 44.1kHz / 128kbps is the default balance for speech clarity.
	- Edge TTS: uses `edge.outputFormat` (default `audio-24khz-48kbitrate-mono-mp3`).
	- `node-edge-tts` accepts an `outputFormat`, but not all formats are available
	from the Edge service. citeturn2search0
	- Output format values follow Microsoft Speech output formats (including Ogg/WebM Opus). citeturn1search0
	- Telegram `sendVoice` accepts OGG/MP3/M4A; use OpenAI/ElevenLabs if you need
	guaranteed Opus voice notes. citeturn1search1
	- If the configured Edge output format fails, OpenClaw retries with MP3.

	OpenAI/ElevenLabs formats are fixed; Telegram expects Opus for voice-note UX.

	## Auto-TTS behavior

	When enabled, OpenClaw:

	- skips TTS if the reply already contains media or a `MEDIA:` directive.
	- skips very short replies (< 10 chars).
	- summarizes long replies when enabled using `agents.defaults.model.primary` (or `summaryModel`).
	- attaches the generated audio to the reply.

	If the reply exceeds `maxLength` and summary is off (or no API key for the
	summary model), audio
	is skipped and the normal text reply is sent.

	## Flow diagram

	```
	Reply -> TTS enabled?
	no -> send text
	yes -> has media / MEDIA: / short?
	yes -> send text
	no -> length > limit?
	no -> TTS -> attach audio
	yes -> summary enabled?
	no -> send text
	yes -> summarize (summaryModel or agents.defaults.model.primary)
	-> TTS -> attach audio
	```

	## Slash command usage

	There is a single command: `/tts`.
	See [Slash commands](/tools/slash-commands) for enablement details.

	Discord note: `/tts` is a built-in Discord command, so OpenClaw registers
	`/voice` as the native command there. Text `/tts ...` still works.

	```
	/tts off
	/tts always
	/tts inbound
	/tts tagged
	/tts status
	/tts provider openai
	/tts limit 2000
	/tts summary off
	/tts audio Hello from OpenClaw
	```

	Notes:

	- Commands require an authorized sender (allowlist/owner rules still apply).
	- `commands.text` or native command registration must be enabled.
	- `off\|always\|inbound\|tagged` are per‑session toggles (`/tts on` is an alias for `/tts always`).
	- `limit` and `summary` are stored in local prefs, not the main config.
	- `/tts audio` generates a one-off audio reply (does not toggle TTS on).

	## Agent tool

	The `tts` tool converts text to speech and returns a `MEDIA:` path. When the
	result is Telegram-compatible, the tool includes `[[audio_as_voice]]` so
	Telegram sends a voice bubble.

	## Gateway RPC

	Gateway methods:

	- `tts.status`
	- `tts.enable`
	- `tts.disable`
	- `tts.convert`
	- `tts.setProvider`
	- `tts.providers`