Fix dropped conversation + add configurable voice volume
Browse filesConversation: the robot woke, listened, then went to sleep without ever
responding. Root cause: while waiting for speech, record_utterance()
called the *blocking* look_at_world() for head tracking, freezing the
audio loop ~0.4 s out of every 0.5 s — so the start of speech was missed
and the idle timeout fired. Fixes:
- head tracking while listening is now non-blocking (set_target)
- VAD threshold auto-calibrated from the ambient noise floor; RMS taken
over the loudest channel (one ReSpeaker channel can be near-silent)
- idle timeout 12 s -> 25 s
- INFO logging of ambient/threshold/max-RMS and speech start/end so the
behaviour can be tuned on-device
Voice volume: play_sound() exposes no volume control, so loudness is
applied at synthesis — edge-tts volume="+X%" and espeak -a. New
tts_volume setting (0-200, default 150 = louder) with a slider in the
web UI:
- talk/config.py: JSON settings store (~/.config/talk/settings.json),
outside the repo so it is never committed/packaged
- POST /set_config; /status now also returns tts_volume
Bump version 0.2.0 -> 0.3.0 so the robot picks up the new build.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- CLAUDE.md +5 -4
- pyproject.toml +1 -1
- talk/config.py +44 -0
- talk/main.py +17 -1
- talk/static/index.html +3 -0
- talk/static/main.js +27 -0
- talk/static/style.css +11 -0
- talk/stt.py +64 -28
- talk/tts.py +31 -6
|
@@ -70,16 +70,17 @@ SLEEPING → (speech detected) → TIME → CONVERSING → (silence/antenna pres
|
|
| 70 |
- **SLEEPING**: polls `get_DoA()` at 5 Hz; wakes after `DOA_DEBOUNCE` (3) consecutive speech-detected readings (same mechanism as the recognizer). Ignores audio for `DEBOUNCE_AFTER_SPEAK` (2 s) after the robot itself spoke so its own goodbye can't re-wake it.
|
| 71 |
- **TIME**: `wake_up()` → speak German datetime with gesture loop, facing the speaker via the captured DoA angle → `start_recording()` → enter CONVERSING.
|
| 72 |
- **CONVERSING** (inner loop):
|
| 73 |
-
- **LISTENING**: `record_utterance()` uses RMS-energy VAD
|
| 74 |
- **PROCESSING**: `transcribe(chunks)` → Google STT; `get_response(messages)` → Claude API.
|
| 75 |
- **RESPONDING**: `_speak_with_gestures()` → back to LISTENING. Recording runs continuously throughout the conversation; `record_utterance()` drains the echo captured during playback.
|
| 76 |
- Exit: antenna press *or* idle timeout → `stop_recording()` → `goto_sleep()` → SLEEPING.
|
| 77 |
|
| 78 |
## Helper Modules
|
| 79 |
|
| 80 |
-
- **`talk/tts.py`**: edge-tts (MS neural, `de-DE-KatjaNeural`) → MP3 → `media.play_sound()`. Falls back to espeak-ng. Blocks for estimated playback duration.
|
| 81 |
-
- **`talk/stt.py`**: records from ReSpeaker (16 kHz float32),
|
| 82 |
- **`talk/llm.py`**: stateless Claude API wrapper. Caller owns `messages` list. Resolves the API key via `get_api_key()` — the web-UI file (`~/.config/talk/api_key`) first, then `ANTHROPIC_API_KEY`. Also exposes `has_api_key()` and `save_api_key()` for the web UI.
|
|
|
|
| 83 |
|
| 84 |
## Key SDK APIs
|
| 85 |
|
|
@@ -108,4 +109,4 @@ reachy_mini.goto_sleep()
|
|
| 108 |
|
| 109 |
## Settings UI
|
| 110 |
|
| 111 |
-
`talk/static/` polls `GET /status` every second. Returns `{state, last_user, last_assistant, api_key_set}`. Shows colour-coded status chip and conversation bubbles (user on right, assistant on left) during CONVERSING. An *Einstellungen* section lets the user enter the Anthropic API key
|
|
|
|
| 70 |
- **SLEEPING**: polls `get_DoA()` at 5 Hz; wakes after `DOA_DEBOUNCE` (3) consecutive speech-detected readings (same mechanism as the recognizer). Ignores audio for `DEBOUNCE_AFTER_SPEAK` (2 s) after the robot itself spoke so its own goodbye can't re-wake it.
|
| 71 |
- **TIME**: `wake_up()` → speak German datetime with gesture loop, facing the speaker via the captured DoA angle → `start_recording()` → enter CONVERSING.
|
| 72 |
- **CONVERSING** (inner loop):
|
| 73 |
+
- **LISTENING**: `record_utterance()` uses RMS-energy VAD with a threshold auto-calibrated from the ambient noise floor; head tracking toward the speaker is **non-blocking** (`set_target`) so the audio loop is never frozen. Exits on antenna press, or returns empty after `IDLE_TIMEOUT` (25 s) of silence.
|
| 74 |
- **PROCESSING**: `transcribe(chunks)` → Google STT; `get_response(messages)` → Claude API.
|
| 75 |
- **RESPONDING**: `_speak_with_gestures()` → back to LISTENING. Recording runs continuously throughout the conversation; `record_utterance()` drains the echo captured during playback.
|
| 76 |
- Exit: antenna press *or* idle timeout → `stop_recording()` → `goto_sleep()` → SLEEPING.
|
| 77 |
|
| 78 |
## Helper Modules
|
| 79 |
|
| 80 |
+
- **`talk/tts.py`**: edge-tts (MS neural, `de-DE-KatjaNeural`) → MP3 → `media.play_sound()`. Falls back to espeak-ng. Loudness is configurable (0-200, 100 = engine default) via the `tts_volume` setting — applied as edge-tts `volume="+X%"` and espeak `-a`. Blocks for estimated playback duration.
|
| 81 |
+
- **`talk/stt.py`**: records from ReSpeaker (16 kHz float32), loudest-channel RMS-energy VAD with a threshold auto-calibrated from ambient noise (logs ambient/threshold/max-RMS for tuning), converts to mono 16-bit WAV, transcribes via Google Speech Recognition. Non-blocking DoA head tracking. Accepts an `idle_timeout` so a silent conversation returns to sleep.
|
| 82 |
- **`talk/llm.py`**: stateless Claude API wrapper. Caller owns `messages` list. Resolves the API key via `get_api_key()` — the web-UI file (`~/.config/talk/api_key`) first, then `ANTHROPIC_API_KEY`. Also exposes `has_api_key()` and `save_api_key()` for the web UI.
|
| 83 |
+
- **`talk/config.py`**: JSON-backed non-secret settings store at `~/.config/talk/settings.json` (`get_setting`/`set_setting`). Holds `tts_volume`. Outside the repo so it is never committed/packaged.
|
| 84 |
|
| 85 |
## Key SDK APIs
|
| 86 |
|
|
|
|
| 109 |
|
| 110 |
## Settings UI
|
| 111 |
|
| 112 |
+
`talk/static/` polls `GET /status` every second. Returns `{state, last_user, last_assistant, api_key_set, tts_volume}`. Shows colour-coded status chip and conversation bubbles (user on right, assistant on left) during CONVERSING. An *Einstellungen* section lets the user (a) enter the Anthropic API key — `POST /set_api_key` (`{api_key}`) → `save_api_key()`, with the `api_key_set` flag driving a "key set?" indicator; and (b) set the voice loudness with a slider — `POST /set_config` (`{tts_volume}`) → `config.set_setting()`.
|
|
@@ -5,7 +5,7 @@ build-backend = "setuptools.build_meta"
|
|
| 5 |
|
| 6 |
[project]
|
| 7 |
name = "talk"
|
| 8 |
-
version = "0.
|
| 9 |
description = "Wakes the robot on speech, announces the time, then chats with Claude in German"
|
| 10 |
readme = "README.md"
|
| 11 |
requires-python = ">=3.10"
|
|
|
|
| 5 |
|
| 6 |
[project]
|
| 7 |
name = "talk"
|
| 8 |
+
version = "0.3.0"
|
| 9 |
description = "Wakes the robot on speech, announces the time, then chats with Claude in German"
|
| 10 |
readme = "README.md"
|
| 11 |
requires-python = ">=3.10"
|
|
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""JSON-backed settings store for the Talk app (non-secret settings).
|
| 2 |
+
|
| 3 |
+
Stored at ``~/.config/talk/settings.json`` — outside the package directory so
|
| 4 |
+
it is never committed or packaged into the published Hugging Face Space. The
|
| 5 |
+
secret API key is handled separately in :mod:`talk.llm`.
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
+
import json
|
| 9 |
+
import logging
|
| 10 |
+
from pathlib import Path
|
| 11 |
+
from typing import Any
|
| 12 |
+
|
| 13 |
+
logger = logging.getLogger(__name__)
|
| 14 |
+
|
| 15 |
+
CONFIG_FILE = Path.home() / ".config" / "talk" / "settings.json"
|
| 16 |
+
|
| 17 |
+
# User-tunable defaults.
|
| 18 |
+
DEFAULTS: dict[str, Any] = {
|
| 19 |
+
"tts_volume": 150, # 0-200, where 100 = the TTS engine's default loudness
|
| 20 |
+
}
|
| 21 |
+
|
| 22 |
+
|
| 23 |
+
def _load() -> dict:
|
| 24 |
+
try:
|
| 25 |
+
if CONFIG_FILE.is_file():
|
| 26 |
+
return json.loads(CONFIG_FILE.read_text(encoding="utf-8"))
|
| 27 |
+
except (OSError, ValueError) as exc:
|
| 28 |
+
logger.warning("Could not read settings %s: %s", CONFIG_FILE, exc)
|
| 29 |
+
return {}
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
def get_setting(key: str, default: Any = None) -> Any:
|
| 33 |
+
"""Return a stored setting, falling back to DEFAULTS then *default*."""
|
| 34 |
+
if default is None:
|
| 35 |
+
default = DEFAULTS.get(key)
|
| 36 |
+
return _load().get(key, default)
|
| 37 |
+
|
| 38 |
+
|
| 39 |
+
def set_setting(key: str, value: Any) -> None:
|
| 40 |
+
"""Persist a single setting, creating the file/dirs as needed."""
|
| 41 |
+
data = _load()
|
| 42 |
+
data[key] = value
|
| 43 |
+
CONFIG_FILE.parent.mkdir(parents=True, exist_ok=True)
|
| 44 |
+
CONFIG_FILE.write_text(json.dumps(data, indent=2), encoding="utf-8")
|
|
@@ -25,6 +25,7 @@ from fastapi import HTTPException
|
|
| 25 |
from pydantic import BaseModel
|
| 26 |
from reachy_mini import ReachyMini, ReachyMiniApp
|
| 27 |
|
|
|
|
| 28 |
from talk.llm import get_response, has_api_key, save_api_key
|
| 29 |
from talk.stt import record_utterance, transcribe
|
| 30 |
from talk.tts import speak
|
|
@@ -35,7 +36,7 @@ AWAKE_ANTENNAS = [-0.1745, 0.1745]
|
|
| 35 |
AWAKE_PRESS_THRESHOLD = 0.35 # antenna-press exit during conversation (gesture-safe)
|
| 36 |
DEBOUNCE_AFTER_SPEAK = 2.0 # ignore audio this long after the robot speaks
|
| 37 |
DOA_DEBOUNCE = 3 # consecutive speech-detected readings to wake up
|
| 38 |
-
IDLE_TIMEOUT =
|
| 39 |
|
| 40 |
WEEKDAYS_DE = [
|
| 41 |
"Montag", "Dienstag", "Mittwoch", "Donnerstag",
|
|
@@ -114,6 +115,10 @@ class ApiKeyRequest(BaseModel):
|
|
| 114 |
api_key: str
|
| 115 |
|
| 116 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 117 |
class Talk(ReachyMiniApp):
|
| 118 |
custom_app_url: str | None = "http://0.0.0.0:8042"
|
| 119 |
request_media_backend: str | None = None
|
|
@@ -127,6 +132,7 @@ class Talk(ReachyMiniApp):
|
|
| 127 |
with _lock:
|
| 128 |
data = dict(_shared)
|
| 129 |
data["api_key_set"] = has_api_key()
|
|
|
|
| 130 |
return data
|
| 131 |
|
| 132 |
@self.settings_app.post("/set_api_key")
|
|
@@ -140,6 +146,16 @@ class Talk(ReachyMiniApp):
|
|
| 140 |
logger.info("API key updated via web UI")
|
| 141 |
return {"ok": True, "api_key_set": True}
|
| 142 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 143 |
reachy_mini.goto_sleep()
|
| 144 |
state = State.SLEEPING
|
| 145 |
last_spoke_at = 0.0
|
|
|
|
| 25 |
from pydantic import BaseModel
|
| 26 |
from reachy_mini import ReachyMini, ReachyMiniApp
|
| 27 |
|
| 28 |
+
from talk import config
|
| 29 |
from talk.llm import get_response, has_api_key, save_api_key
|
| 30 |
from talk.stt import record_utterance, transcribe
|
| 31 |
from talk.tts import speak
|
|
|
|
| 36 |
AWAKE_PRESS_THRESHOLD = 0.35 # antenna-press exit during conversation (gesture-safe)
|
| 37 |
DEBOUNCE_AFTER_SPEAK = 2.0 # ignore audio this long after the robot speaks
|
| 38 |
DOA_DEBOUNCE = 3 # consecutive speech-detected readings to wake up
|
| 39 |
+
IDLE_TIMEOUT = 25.0 # s of silence in conversation before returning to sleep
|
| 40 |
|
| 41 |
WEEKDAYS_DE = [
|
| 42 |
"Montag", "Dienstag", "Mittwoch", "Donnerstag",
|
|
|
|
| 115 |
api_key: str
|
| 116 |
|
| 117 |
|
| 118 |
+
class ConfigRequest(BaseModel):
|
| 119 |
+
tts_volume: int
|
| 120 |
+
|
| 121 |
+
|
| 122 |
class Talk(ReachyMiniApp):
|
| 123 |
custom_app_url: str | None = "http://0.0.0.0:8042"
|
| 124 |
request_media_backend: str | None = None
|
|
|
|
| 132 |
with _lock:
|
| 133 |
data = dict(_shared)
|
| 134 |
data["api_key_set"] = has_api_key()
|
| 135 |
+
data["tts_volume"] = config.get_setting("tts_volume", 150)
|
| 136 |
return data
|
| 137 |
|
| 138 |
@self.settings_app.post("/set_api_key")
|
|
|
|
| 146 |
logger.info("API key updated via web UI")
|
| 147 |
return {"ok": True, "api_key_set": True}
|
| 148 |
|
| 149 |
+
@self.settings_app.post("/set_config")
|
| 150 |
+
def set_config(body: ConfigRequest):
|
| 151 |
+
vol = max(0, min(200, int(body.tts_volume)))
|
| 152 |
+
try:
|
| 153 |
+
config.set_setting("tts_volume", vol)
|
| 154 |
+
except OSError as exc:
|
| 155 |
+
raise HTTPException(status_code=500, detail=f"Could not save config: {exc}")
|
| 156 |
+
logger.info("tts_volume set to %d via web UI", vol)
|
| 157 |
+
return {"ok": True, "tts_volume": vol}
|
| 158 |
+
|
| 159 |
reachy_mini.goto_sleep()
|
| 160 |
state = State.SLEEPING
|
| 161 |
last_spoke_at = 0.0
|
|
@@ -23,6 +23,9 @@
|
|
| 23 |
<button id="save-key" type="button">Speichern</button>
|
| 24 |
</div>
|
| 25 |
<p id="key-status" class="key-status"></p>
|
|
|
|
|
|
|
|
|
|
| 26 |
</details>
|
| 27 |
|
| 28 |
<script src="/static/main.js"></script>
|
|
|
|
| 23 |
<button id="save-key" type="button">Speichern</button>
|
| 24 |
</div>
|
| 25 |
<p id="key-status" class="key-status"></p>
|
| 26 |
+
|
| 27 |
+
<label for="volume" class="setting-label">Lautstärke: <span id="volume-val">–</span></label>
|
| 28 |
+
<input type="range" id="volume" class="slider" min="0" max="200" step="10" value="150">
|
| 29 |
</details>
|
| 30 |
|
| 31 |
<script src="/static/main.js"></script>
|
|
@@ -28,6 +28,27 @@ function reflectKeySet(isSet) {
|
|
| 28 |
}
|
| 29 |
}
|
| 30 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
saveBtn.addEventListener("click", async () => {
|
| 32 |
const key = keyInput.value.trim();
|
| 33 |
if (!key) {
|
|
@@ -90,6 +111,12 @@ async function poll() {
|
|
| 90 |
}
|
| 91 |
|
| 92 |
reflectKeySet(data.api_key_set);
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 93 |
} catch (_) {}
|
| 94 |
}
|
| 95 |
|
|
|
|
| 28 |
}
|
| 29 |
}
|
| 30 |
|
| 31 |
+
// --- Volume setting ---
|
| 32 |
+
const volumeInput = document.getElementById("volume");
|
| 33 |
+
const volumeVal = document.getElementById("volume-val");
|
| 34 |
+
|
| 35 |
+
function renderVolume(v) {
|
| 36 |
+
if (volumeVal) volumeVal.textContent = v + "%";
|
| 37 |
+
}
|
| 38 |
+
|
| 39 |
+
if (volumeInput) {
|
| 40 |
+
volumeInput.addEventListener("input", () => renderVolume(volumeInput.value));
|
| 41 |
+
volumeInput.addEventListener("change", async () => {
|
| 42 |
+
try {
|
| 43 |
+
await fetch("/set_config", {
|
| 44 |
+
method: "POST",
|
| 45 |
+
headers: { "Content-Type": "application/json" },
|
| 46 |
+
body: JSON.stringify({ tts_volume: parseInt(volumeInput.value, 10) }),
|
| 47 |
+
});
|
| 48 |
+
} catch (_) {}
|
| 49 |
+
});
|
| 50 |
+
}
|
| 51 |
+
|
| 52 |
saveBtn.addEventListener("click", async () => {
|
| 53 |
const key = keyInput.value.trim();
|
| 54 |
if (!key) {
|
|
|
|
| 111 |
}
|
| 112 |
|
| 113 |
reflectKeySet(data.api_key_set);
|
| 114 |
+
|
| 115 |
+
if (volumeInput && typeof data.tts_volume === "number"
|
| 116 |
+
&& document.activeElement !== volumeInput) {
|
| 117 |
+
volumeInput.value = data.tts_volume;
|
| 118 |
+
renderVolume(data.tts_volume);
|
| 119 |
+
}
|
| 120 |
} catch (_) {}
|
| 121 |
}
|
| 122 |
|
|
@@ -112,3 +112,14 @@ h1 { margin-bottom: 1rem; }
|
|
| 112 |
|
| 113 |
.key-status.ok { color: #065f46; }
|
| 114 |
.key-status.warn { color: #92400e; }
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 112 |
|
| 113 |
.key-status.ok { color: #065f46; }
|
| 114 |
.key-status.warn { color: #92400e; }
|
| 115 |
+
|
| 116 |
+
.setting-label {
|
| 117 |
+
display: block;
|
| 118 |
+
margin-top: 1rem;
|
| 119 |
+
margin-bottom: 0.35rem;
|
| 120 |
+
color: #555;
|
| 121 |
+
}
|
| 122 |
+
|
| 123 |
+
.slider {
|
| 124 |
+
width: 100%;
|
| 125 |
+
}
|
|
@@ -1,8 +1,10 @@
|
|
| 1 |
"""Speech recording + Google STT for the Talk app.
|
| 2 |
|
| 3 |
-
Records from the robot's ReSpeaker mic array (16 kHz, stereo float32)
|
| 4 |
-
|
| 5 |
-
|
|
|
|
|
|
|
| 6 |
"""
|
| 7 |
|
| 8 |
import io
|
|
@@ -16,18 +18,23 @@ import numpy as np
|
|
| 16 |
|
| 17 |
logger = logging.getLogger(__name__)
|
| 18 |
|
| 19 |
-
SAMPLE_RATE = 16000
|
| 20 |
-
SILENCE_DURATION = 1.2
|
| 21 |
-
MAX_DURATION = 20.0
|
| 22 |
-
MIN_SPEECH_DURATION = 0.4
|
| 23 |
-
|
| 24 |
-
|
|
|
|
|
|
|
| 25 |
|
| 26 |
|
| 27 |
def _rms(chunk: np.ndarray) -> float:
|
| 28 |
-
"""RMS energy
|
| 29 |
-
|
| 30 |
-
|
|
|
|
|
|
|
|
|
|
| 31 |
|
| 32 |
|
| 33 |
def _chunks_to_wav_bytes(chunks: list) -> bytes:
|
|
@@ -44,32 +51,56 @@ def _chunks_to_wav_bytes(chunks: list) -> bytes:
|
|
| 44 |
return buf.getvalue()
|
| 45 |
|
| 46 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
def record_utterance(
|
| 48 |
reachy_mini,
|
| 49 |
stop_event,
|
| 50 |
should_stop: Callable,
|
| 51 |
idle_timeout: Optional[float] = None,
|
| 52 |
) -> tuple[list, float, bool]:
|
| 53 |
-
"""Wait for speech, record until silence
|
| 54 |
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
|
|
|
| 59 |
"""
|
| 60 |
chunks: list = []
|
| 61 |
last_speech_t: Optional[float] = None
|
| 62 |
speech_started_t: Optional[float] = None
|
| 63 |
last_doa_angle: float = math.pi / 2 # default: facing front
|
| 64 |
last_head_update: float = 0.0
|
|
|
|
| 65 |
|
| 66 |
-
# Drain stale audio buffered during TTS playback.
|
| 67 |
drained = 0
|
| 68 |
while reachy_mini.media.get_audio_sample() is not None:
|
| 69 |
drained += 1
|
| 70 |
if drained:
|
| 71 |
logger.debug("Drained %d stale audio chunks", drained)
|
| 72 |
|
|
|
|
|
|
|
|
|
|
| 73 |
wait_start = time.time()
|
| 74 |
|
| 75 |
while not stop_event.is_set():
|
|
@@ -80,20 +111,23 @@ def record_utterance(
|
|
| 80 |
|
| 81 |
if (idle_timeout is not None and speech_started_t is None
|
| 82 |
and now - wait_start > idle_timeout):
|
| 83 |
-
logger.
|
|
|
|
|
|
|
|
|
|
| 84 |
return [], last_doa_angle, False
|
| 85 |
|
| 86 |
-
#
|
|
|
|
| 87 |
doa = reachy_mini.media.get_DoA()
|
| 88 |
if doa is not None:
|
| 89 |
last_doa_angle = doa[0]
|
| 90 |
|
| 91 |
-
# Smooth head tracking toward speaker, throttled to HEAD_UPDATE_INTERVAL.
|
| 92 |
-
# Only while waiting for speech; once recording, head stays put.
|
| 93 |
if speech_started_t is None and now - last_head_update >= HEAD_UPDATE_INTERVAL:
|
| 94 |
y = math.sin(last_doa_angle - math.pi / 2) * 0.6
|
| 95 |
try:
|
| 96 |
-
reachy_mini.look_at_world(1.0, y, 0.0,
|
|
|
|
| 97 |
except Exception:
|
| 98 |
pass
|
| 99 |
last_head_update = now
|
|
@@ -102,9 +136,11 @@ def record_utterance(
|
|
| 102 |
chunk = reachy_mini.media.get_audio_sample()
|
| 103 |
if chunk is not None:
|
| 104 |
rms = _rms(chunk)
|
| 105 |
-
if rms >
|
|
|
|
|
|
|
| 106 |
if speech_started_t is None:
|
| 107 |
-
logger.
|
| 108 |
speech_started_t = now
|
| 109 |
last_speech_t = now
|
| 110 |
if speech_started_t is not None:
|
|
@@ -115,14 +151,14 @@ def record_utterance(
|
|
| 115 |
and now - last_speech_t > SILENCE_DURATION
|
| 116 |
and speech_started_t is not None
|
| 117 |
and now - speech_started_t > MIN_SPEECH_DURATION):
|
| 118 |
-
logger.
|
| 119 |
break
|
| 120 |
|
| 121 |
if speech_started_t is not None and now - speech_started_t > MAX_DURATION:
|
| 122 |
-
logger.
|
| 123 |
break
|
| 124 |
|
| 125 |
-
time.sleep(0.
|
| 126 |
|
| 127 |
return chunks, last_doa_angle, False
|
| 128 |
|
|
|
|
| 1 |
"""Speech recording + Google STT for the Talk app.
|
| 2 |
|
| 3 |
+
Records from the robot's ReSpeaker mic array (16 kHz, stereo float32) using the
|
| 4 |
+
SDK's recording pipeline. VAD is energy-based (RMS) with a threshold that is
|
| 5 |
+
auto-calibrated from the ambient noise floor each time we start listening, so it
|
| 6 |
+
adapts to room noise and mic gain. DoA is used only for (non-blocking) head
|
| 7 |
+
direction. Transcribes via Google Speech Recognition.
|
| 8 |
"""
|
| 9 |
|
| 10 |
import io
|
|
|
|
| 18 |
|
| 19 |
logger = logging.getLogger(__name__)
|
| 20 |
|
| 21 |
+
SAMPLE_RATE = 16000 # ReSpeaker hardware rate
|
| 22 |
+
SILENCE_DURATION = 1.2 # s of silence to end an utterance
|
| 23 |
+
MAX_DURATION = 20.0 # hard cap per utterance
|
| 24 |
+
MIN_SPEECH_DURATION = 0.4 # discard very short sounds (spurious noise)
|
| 25 |
+
RMS_FLOOR = 0.003 # minimum VAD threshold (float32 RMS)
|
| 26 |
+
NOISE_CALIB_DURATION = 0.4 # s of ambient sampling to calibrate the threshold
|
| 27 |
+
NOISE_MULT = 2.5 # speech must exceed (ambient RMS * this)
|
| 28 |
+
HEAD_UPDATE_INTERVAL = 0.6 # s between (non-blocking) head-direction updates
|
| 29 |
|
| 30 |
|
| 31 |
def _rms(chunk: np.ndarray) -> float:
|
| 32 |
+
"""Loudest-channel RMS energy (one ReSpeaker channel can be much quieter)."""
|
| 33 |
+
a = chunk.astype(np.float32)
|
| 34 |
+
if a.ndim > 1:
|
| 35 |
+
per_channel = np.sqrt(np.mean(a ** 2, axis=0))
|
| 36 |
+
return float(per_channel.max())
|
| 37 |
+
return float(np.sqrt(np.mean(a ** 2)))
|
| 38 |
|
| 39 |
|
| 40 |
def _chunks_to_wav_bytes(chunks: list) -> bytes:
|
|
|
|
| 51 |
return buf.getvalue()
|
| 52 |
|
| 53 |
|
| 54 |
+
def _calibrate_threshold(reachy_mini, stop_event) -> float:
|
| 55 |
+
"""Sample ambient noise briefly and derive a VAD threshold from it."""
|
| 56 |
+
samples: list[float] = []
|
| 57 |
+
end = time.time() + NOISE_CALIB_DURATION
|
| 58 |
+
while time.time() < end and not stop_event.is_set():
|
| 59 |
+
chunk = reachy_mini.media.get_audio_sample()
|
| 60 |
+
if chunk is not None:
|
| 61 |
+
samples.append(_rms(chunk))
|
| 62 |
+
time.sleep(0.02)
|
| 63 |
+
if samples:
|
| 64 |
+
samples.sort()
|
| 65 |
+
ambient = samples[len(samples) // 2] # median, robust to a stray sound
|
| 66 |
+
else:
|
| 67 |
+
ambient = 0.0
|
| 68 |
+
threshold = max(ambient * NOISE_MULT, RMS_FLOOR)
|
| 69 |
+
logger.info("VAD calibrated: ambient=%.4f -> threshold=%.4f", ambient, threshold)
|
| 70 |
+
return threshold
|
| 71 |
+
|
| 72 |
+
|
| 73 |
def record_utterance(
|
| 74 |
reachy_mini,
|
| 75 |
stop_event,
|
| 76 |
should_stop: Callable,
|
| 77 |
idle_timeout: Optional[float] = None,
|
| 78 |
) -> tuple[list, float, bool]:
|
| 79 |
+
"""Wait for speech, record until silence; return (chunks, doa_angle, stopped).
|
| 80 |
|
| 81 |
+
Energy-based VAD with a threshold auto-calibrated from the ambient noise
|
| 82 |
+
floor. The head tracks the speaker via DoA using *non-blocking* updates, so
|
| 83 |
+
the audio loop is never frozen while waiting — a blocking head move here used
|
| 84 |
+
to make the robot miss the very start of speech. If ``idle_timeout`` passes
|
| 85 |
+
with no speech, returns empty chunks so the caller can end the conversation.
|
| 86 |
"""
|
| 87 |
chunks: list = []
|
| 88 |
last_speech_t: Optional[float] = None
|
| 89 |
speech_started_t: Optional[float] = None
|
| 90 |
last_doa_angle: float = math.pi / 2 # default: facing front
|
| 91 |
last_head_update: float = 0.0
|
| 92 |
+
max_rms_seen: float = 0.0
|
| 93 |
|
| 94 |
+
# Drain stale audio buffered during TTS playback (echo of our own voice).
|
| 95 |
drained = 0
|
| 96 |
while reachy_mini.media.get_audio_sample() is not None:
|
| 97 |
drained += 1
|
| 98 |
if drained:
|
| 99 |
logger.debug("Drained %d stale audio chunks", drained)
|
| 100 |
|
| 101 |
+
# Calibrate the VAD threshold from a short ambient sample.
|
| 102 |
+
threshold = _calibrate_threshold(reachy_mini, stop_event)
|
| 103 |
+
|
| 104 |
wait_start = time.time()
|
| 105 |
|
| 106 |
while not stop_event.is_set():
|
|
|
|
| 111 |
|
| 112 |
if (idle_timeout is not None and speech_started_t is None
|
| 113 |
and now - wait_start > idle_timeout):
|
| 114 |
+
logger.info(
|
| 115 |
+
"No speech within %.0f s (max RMS %.4f, threshold %.4f) — idle",
|
| 116 |
+
idle_timeout, max_rms_seen, threshold,
|
| 117 |
+
)
|
| 118 |
return [], last_doa_angle, False
|
| 119 |
|
| 120 |
+
# Track DoA direction; nudge the head toward the speaker non-blocking
|
| 121 |
+
# (only while waiting — never freeze the audio loop).
|
| 122 |
doa = reachy_mini.media.get_DoA()
|
| 123 |
if doa is not None:
|
| 124 |
last_doa_angle = doa[0]
|
| 125 |
|
|
|
|
|
|
|
| 126 |
if speech_started_t is None and now - last_head_update >= HEAD_UPDATE_INTERVAL:
|
| 127 |
y = math.sin(last_doa_angle - math.pi / 2) * 0.6
|
| 128 |
try:
|
| 129 |
+
pose = reachy_mini.look_at_world(1.0, y, 0.0, perform_movement=False)
|
| 130 |
+
reachy_mini.set_target(head=pose)
|
| 131 |
except Exception:
|
| 132 |
pass
|
| 133 |
last_head_update = now
|
|
|
|
| 136 |
chunk = reachy_mini.media.get_audio_sample()
|
| 137 |
if chunk is not None:
|
| 138 |
rms = _rms(chunk)
|
| 139 |
+
if rms > max_rms_seen:
|
| 140 |
+
max_rms_seen = rms
|
| 141 |
+
if rms > threshold:
|
| 142 |
if speech_started_t is None:
|
| 143 |
+
logger.info("Speech started (RMS %.4f > %.4f)", rms, threshold)
|
| 144 |
speech_started_t = now
|
| 145 |
last_speech_t = now
|
| 146 |
if speech_started_t is not None:
|
|
|
|
| 151 |
and now - last_speech_t > SILENCE_DURATION
|
| 152 |
and speech_started_t is not None
|
| 153 |
and now - speech_started_t > MIN_SPEECH_DURATION):
|
| 154 |
+
logger.info("Utterance ended (%.1f s)", now - speech_started_t)
|
| 155 |
break
|
| 156 |
|
| 157 |
if speech_started_t is not None and now - speech_started_t > MAX_DURATION:
|
| 158 |
+
logger.info("Utterance hit max duration")
|
| 159 |
break
|
| 160 |
|
| 161 |
+
time.sleep(0.02)
|
| 162 |
|
| 163 |
return chunks, last_doa_angle, False
|
| 164 |
|
|
@@ -15,26 +15,50 @@ import tempfile
|
|
| 15 |
import time
|
| 16 |
from typing import Optional
|
| 17 |
|
|
|
|
|
|
|
| 18 |
logger = logging.getLogger(__name__)
|
| 19 |
|
| 20 |
EDGE_VOICE = "de-DE-KatjaNeural"
|
| 21 |
EDGE_RATE = "-5%" # slightly slower for clarity
|
| 22 |
|
| 23 |
|
| 24 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
import edge_tts
|
| 26 |
-
communicate = edge_tts.Communicate(
|
|
|
|
|
|
|
| 27 |
await communicate.save(path)
|
| 28 |
|
| 29 |
|
| 30 |
-
def speak(
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
"""Synthesize *text* and play it through the robot's speakers.
|
| 32 |
|
| 33 |
Tries edge-tts first (neural quality), falls back to espeak-ng.
|
| 34 |
-
Blocks until playback should be complete.
|
|
|
|
| 35 |
"""
|
| 36 |
audio_path: Optional[str] = None
|
| 37 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
try:
|
| 39 |
# edge-tts outputs MP3; GStreamer playbin handles it natively.
|
| 40 |
with tempfile.NamedTemporaryFile(suffix=".mp3", delete=False) as f:
|
|
@@ -42,7 +66,7 @@ def speak(text: str, reachy_mini, words_per_minute: int = 130, lang: str = "de")
|
|
| 42 |
|
| 43 |
success = False
|
| 44 |
try:
|
| 45 |
-
asyncio.run(_edge_synthesize(text, audio_path))
|
| 46 |
success = True
|
| 47 |
except ImportError:
|
| 48 |
logger.warning("edge-tts not installed — falling back to espeak-ng")
|
|
@@ -59,7 +83,8 @@ def speak(text: str, reachy_mini, words_per_minute: int = 130, lang: str = "de")
|
|
| 59 |
logger.warning("No TTS engine available. Install edge-tts or espeak-ng.")
|
| 60 |
return
|
| 61 |
subprocess.run(
|
| 62 |
-
[cmd, "-v", lang, "-s", str(words_per_minute),
|
|
|
|
| 63 |
check=True, timeout=15, capture_output=True,
|
| 64 |
)
|
| 65 |
|
|
|
|
| 15 |
import time
|
| 16 |
from typing import Optional
|
| 17 |
|
| 18 |
+
from talk import config
|
| 19 |
+
|
| 20 |
logger = logging.getLogger(__name__)
|
| 21 |
|
| 22 |
EDGE_VOICE = "de-DE-KatjaNeural"
|
| 23 |
EDGE_RATE = "-5%" # slightly slower for clarity
|
| 24 |
|
| 25 |
|
| 26 |
+
def _edge_volume_str(volume_pct: int) -> str:
|
| 27 |
+
"""Map a 0-200 loudness (100 = default) to edge-tts's relative volume."""
|
| 28 |
+
delta = max(-100, min(100, int(volume_pct) - 100))
|
| 29 |
+
return f"+{delta}%" if delta >= 0 else f"{delta}%"
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
async def _edge_synthesize(text: str, path: str, volume_pct: int) -> None:
|
| 33 |
import edge_tts
|
| 34 |
+
communicate = edge_tts.Communicate(
|
| 35 |
+
text, EDGE_VOICE, rate=EDGE_RATE, volume=_edge_volume_str(volume_pct)
|
| 36 |
+
)
|
| 37 |
await communicate.save(path)
|
| 38 |
|
| 39 |
|
| 40 |
+
def speak(
|
| 41 |
+
text: str,
|
| 42 |
+
reachy_mini,
|
| 43 |
+
words_per_minute: int = 130,
|
| 44 |
+
lang: str = "de",
|
| 45 |
+
volume: Optional[int] = None,
|
| 46 |
+
) -> None:
|
| 47 |
"""Synthesize *text* and play it through the robot's speakers.
|
| 48 |
|
| 49 |
Tries edge-tts first (neural quality), falls back to espeak-ng.
|
| 50 |
+
Blocks until playback should be complete. *volume* is 0-200 (100 = engine
|
| 51 |
+
default); when None it is read from the web-UI setting ``tts_volume``.
|
| 52 |
"""
|
| 53 |
audio_path: Optional[str] = None
|
| 54 |
|
| 55 |
+
if volume is None:
|
| 56 |
+
try:
|
| 57 |
+
volume = int(config.get_setting("tts_volume", 150))
|
| 58 |
+
except (TypeError, ValueError):
|
| 59 |
+
volume = 150
|
| 60 |
+
volume = max(0, min(200, volume))
|
| 61 |
+
|
| 62 |
try:
|
| 63 |
# edge-tts outputs MP3; GStreamer playbin handles it natively.
|
| 64 |
with tempfile.NamedTemporaryFile(suffix=".mp3", delete=False) as f:
|
|
|
|
| 66 |
|
| 67 |
success = False
|
| 68 |
try:
|
| 69 |
+
asyncio.run(_edge_synthesize(text, audio_path, volume))
|
| 70 |
success = True
|
| 71 |
except ImportError:
|
| 72 |
logger.warning("edge-tts not installed — falling back to espeak-ng")
|
|
|
|
| 83 |
logger.warning("No TTS engine available. Install edge-tts or espeak-ng.")
|
| 84 |
return
|
| 85 |
subprocess.run(
|
| 86 |
+
[cmd, "-v", lang, "-s", str(words_per_minute),
|
| 87 |
+
"-a", str(volume), "-w", audio_path, "--", text],
|
| 88 |
check=True, timeout=15, capture_output=True,
|
| 89 |
)
|
| 90 |
|