Mirror OpenSkyNet workspace snapshot from Git HEAD

fc93158 verified 9 days ago

3.12 kB

	---
	summary: "Talk mode: continuous speech conversations with ElevenLabs TTS"
	read_when:
	- Implementing Talk mode on macOS/iOS/Android
	- Changing voice/TTS/interrupt behavior
	title: "Talk Mode"
	---

	# Talk Mode

	Talk mode is a continuous voice conversation loop:

	1. Listen for speech
	2. Send transcript to the model (main session, chat.send)
	3. Wait for the response
	4. Speak it via ElevenLabs (streaming playback)

	## Behavior (macOS)

	- Always-on overlay while Talk mode is enabled.
	- Listening → Thinking → Speaking phase transitions.
	- On a short pause (silence window), the current transcript is sent.
	- Replies are written to WebChat (same as typing).
	- Interrupt on speech (default on): if the user starts talking while the assistant is speaking, we stop playback and note the interruption timestamp for the next prompt.

	## Voice directives in replies

	The assistant may prefix its reply with a single JSON line to control voice:

	```json
	{ "voice": "<voice-id>", "once": true }
	```

	Rules:

	- First non-empty line only.
	- Unknown keys are ignored.
	- `once: true` applies to the current reply only.
	- Without `once`, the voice becomes the new default for Talk mode.
	- The JSON line is stripped before TTS playback.

	Supported keys:

	- `voice` / `voice_id` / `voiceId`
	- `model` / `model_id` / `modelId`
	- `speed`, `rate` (WPM), `stability`, `similarity`, `style`, `speakerBoost`
	- `seed`, `normalize`, `lang`, `output_format`, `latency_tier`
	- `once`

	## Config (`~/.openclaw/openclaw.json`)

	```json5
	{
	talk: {
	voiceId: "elevenlabs_voice_id",
	modelId: "eleven_v3",
	outputFormat: "mp3_44100_128",
	apiKey: "elevenlabs_api_key",
	silenceTimeoutMs: 1500,
	interruptOnSpeech: true,
	},
	}
	```

	Defaults:

	- `interruptOnSpeech`: true
	- `silenceTimeoutMs`: when unset, Talk keeps the platform default pause window before sending the transcript (`700 ms on macOS and Android, 900 ms on iOS`)
	- `voiceId`: falls back to `ELEVENLABS_VOICE_ID` / `SAG_VOICE_ID` (or first ElevenLabs voice when API key is available)
	- `modelId`: defaults to `eleven_v3` when unset
	- `apiKey`: falls back to `ELEVENLABS_API_KEY` (or gateway shell profile if available)
	- `outputFormat`: defaults to `pcm_44100` on macOS/iOS and `pcm_24000` on Android (set `mp3_*` to force MP3 streaming)

	## macOS UI

	- Menu bar toggle: Talk
	- Config tab: Talk Mode group (voice id + interrupt toggle)
	- Overlay:
	- Listening: cloud pulses with mic level
	- Thinking: sinking animation
	- Speaking: radiating rings
	- Click cloud: stop speaking
	- Click X: exit Talk mode

	## Notes

	- Requires Speech + Microphone permissions.
	- Uses `chat.send` against session key `main`.
	- TTS uses ElevenLabs streaming API with `ELEVENLABS_API_KEY` and incremental playback on macOS/iOS/Android for lower latency.
	- `stability` for `eleven_v3` is validated to `0.0`, `0.5`, or `1.0`; other models accept `0..1`.
	- `latency_tier` is validated to `0..4` when set.
	- Android supports `pcm_16000`, `pcm_22050`, `pcm_24000`, and `pcm_44100` output formats for low-latency AudioTrack streaming.