onitsche Claude Opus 4.8 commited on
Commit
6c690a7
·
1 Parent(s): c0b616c

Fix dropped conversation + add configurable voice volume

Browse files

Conversation: the robot woke, listened, then went to sleep without ever
responding. Root cause: while waiting for speech, record_utterance()
called the *blocking* look_at_world() for head tracking, freezing the
audio loop ~0.4 s out of every 0.5 s — so the start of speech was missed
and the idle timeout fired. Fixes:
- head tracking while listening is now non-blocking (set_target)
- VAD threshold auto-calibrated from the ambient noise floor; RMS taken
over the loudest channel (one ReSpeaker channel can be near-silent)
- idle timeout 12 s -> 25 s
- INFO logging of ambient/threshold/max-RMS and speech start/end so the
behaviour can be tuned on-device

Voice volume: play_sound() exposes no volume control, so loudness is
applied at synthesis — edge-tts volume="+X%" and espeak -a. New
tts_volume setting (0-200, default 150 = louder) with a slider in the
web UI:
- talk/config.py: JSON settings store (~/.config/talk/settings.json),
outside the repo so it is never committed/packaged
- POST /set_config; /status now also returns tts_volume

Bump version 0.2.0 -> 0.3.0 so the robot picks up the new build.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

CLAUDE.md CHANGED
@@ -70,16 +70,17 @@ SLEEPING → (speech detected) → TIME → CONVERSING → (silence/antenna pres
70
  - **SLEEPING**: polls `get_DoA()` at 5 Hz; wakes after `DOA_DEBOUNCE` (3) consecutive speech-detected readings (same mechanism as the recognizer). Ignores audio for `DEBOUNCE_AFTER_SPEAK` (2 s) after the robot itself spoke so its own goodbye can't re-wake it.
71
  - **TIME**: `wake_up()` → speak German datetime with gesture loop, facing the speaker via the captured DoA angle → `start_recording()` → enter CONVERSING.
72
  - **CONVERSING** (inner loop):
73
- - **LISTENING**: `record_utterance()` uses RMS-energy VAD, tracks head toward the speaker via DoA. Exits on antenna press, or returns empty after `IDLE_TIMEOUT` (12 s) of silence.
74
  - **PROCESSING**: `transcribe(chunks)` → Google STT; `get_response(messages)` → Claude API.
75
  - **RESPONDING**: `_speak_with_gestures()` → back to LISTENING. Recording runs continuously throughout the conversation; `record_utterance()` drains the echo captured during playback.
76
  - Exit: antenna press *or* idle timeout → `stop_recording()` → `goto_sleep()` → SLEEPING.
77
 
78
  ## Helper Modules
79
 
80
- - **`talk/tts.py`**: edge-tts (MS neural, `de-DE-KatjaNeural`) → MP3 → `media.play_sound()`. Falls back to espeak-ng. Blocks for estimated playback duration.
81
- - **`talk/stt.py`**: records from ReSpeaker (16 kHz float32), uses RMS-energy VAD (DoA is for head direction only), converts to mono 16-bit WAV, transcribes via Google Speech Recognition. Accepts an `idle_timeout` so a silent conversation returns to sleep.
82
  - **`talk/llm.py`**: stateless Claude API wrapper. Caller owns `messages` list. Resolves the API key via `get_api_key()` — the web-UI file (`~/.config/talk/api_key`) first, then `ANTHROPIC_API_KEY`. Also exposes `has_api_key()` and `save_api_key()` for the web UI.
 
83
 
84
  ## Key SDK APIs
85
 
@@ -108,4 +109,4 @@ reachy_mini.goto_sleep()
108
 
109
  ## Settings UI
110
 
111
- `talk/static/` polls `GET /status` every second. Returns `{state, last_user, last_assistant, api_key_set}`. Shows colour-coded status chip and conversation bubbles (user on right, assistant on left) during CONVERSING. An *Einstellungen* section lets the user enter the Anthropic API key, which is `POST`ed to `/set_api_key` (`{api_key}`) and persisted via `save_api_key()`; the `api_key_set` flag drives a "key set?" indicator.
 
70
  - **SLEEPING**: polls `get_DoA()` at 5 Hz; wakes after `DOA_DEBOUNCE` (3) consecutive speech-detected readings (same mechanism as the recognizer). Ignores audio for `DEBOUNCE_AFTER_SPEAK` (2 s) after the robot itself spoke so its own goodbye can't re-wake it.
71
  - **TIME**: `wake_up()` → speak German datetime with gesture loop, facing the speaker via the captured DoA angle → `start_recording()` → enter CONVERSING.
72
  - **CONVERSING** (inner loop):
73
+ - **LISTENING**: `record_utterance()` uses RMS-energy VAD with a threshold auto-calibrated from the ambient noise floor; head tracking toward the speaker is **non-blocking** (`set_target`) so the audio loop is never frozen. Exits on antenna press, or returns empty after `IDLE_TIMEOUT` (25 s) of silence.
74
  - **PROCESSING**: `transcribe(chunks)` → Google STT; `get_response(messages)` → Claude API.
75
  - **RESPONDING**: `_speak_with_gestures()` → back to LISTENING. Recording runs continuously throughout the conversation; `record_utterance()` drains the echo captured during playback.
76
  - Exit: antenna press *or* idle timeout → `stop_recording()` → `goto_sleep()` → SLEEPING.
77
 
78
  ## Helper Modules
79
 
80
+ - **`talk/tts.py`**: edge-tts (MS neural, `de-DE-KatjaNeural`) → MP3 → `media.play_sound()`. Falls back to espeak-ng. Loudness is configurable (0-200, 100 = engine default) via the `tts_volume` setting — applied as edge-tts `volume="+X%"` and espeak `-a`. Blocks for estimated playback duration.
81
+ - **`talk/stt.py`**: records from ReSpeaker (16 kHz float32), loudest-channel RMS-energy VAD with a threshold auto-calibrated from ambient noise (logs ambient/threshold/max-RMS for tuning), converts to mono 16-bit WAV, transcribes via Google Speech Recognition. Non-blocking DoA head tracking. Accepts an `idle_timeout` so a silent conversation returns to sleep.
82
  - **`talk/llm.py`**: stateless Claude API wrapper. Caller owns `messages` list. Resolves the API key via `get_api_key()` — the web-UI file (`~/.config/talk/api_key`) first, then `ANTHROPIC_API_KEY`. Also exposes `has_api_key()` and `save_api_key()` for the web UI.
83
+ - **`talk/config.py`**: JSON-backed non-secret settings store at `~/.config/talk/settings.json` (`get_setting`/`set_setting`). Holds `tts_volume`. Outside the repo so it is never committed/packaged.
84
 
85
  ## Key SDK APIs
86
 
 
109
 
110
  ## Settings UI
111
 
112
+ `talk/static/` polls `GET /status` every second. Returns `{state, last_user, last_assistant, api_key_set, tts_volume}`. Shows colour-coded status chip and conversation bubbles (user on right, assistant on left) during CONVERSING. An *Einstellungen* section lets the user (a) enter the Anthropic API key `POST /set_api_key` (`{api_key}`) `save_api_key()`, with the `api_key_set` flag driving a "key set?" indicator; and (b) set the voice loudness with a slider — `POST /set_config` (`{tts_volume}`) → `config.set_setting()`.
pyproject.toml CHANGED
@@ -5,7 +5,7 @@ build-backend = "setuptools.build_meta"
5
 
6
  [project]
7
  name = "talk"
8
- version = "0.2.0"
9
  description = "Wakes the robot on speech, announces the time, then chats with Claude in German"
10
  readme = "README.md"
11
  requires-python = ">=3.10"
 
5
 
6
  [project]
7
  name = "talk"
8
+ version = "0.3.0"
9
  description = "Wakes the robot on speech, announces the time, then chats with Claude in German"
10
  readme = "README.md"
11
  requires-python = ">=3.10"
talk/config.py ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """JSON-backed settings store for the Talk app (non-secret settings).
2
+
3
+ Stored at ``~/.config/talk/settings.json`` — outside the package directory so
4
+ it is never committed or packaged into the published Hugging Face Space. The
5
+ secret API key is handled separately in :mod:`talk.llm`.
6
+ """
7
+
8
+ import json
9
+ import logging
10
+ from pathlib import Path
11
+ from typing import Any
12
+
13
+ logger = logging.getLogger(__name__)
14
+
15
+ CONFIG_FILE = Path.home() / ".config" / "talk" / "settings.json"
16
+
17
+ # User-tunable defaults.
18
+ DEFAULTS: dict[str, Any] = {
19
+ "tts_volume": 150, # 0-200, where 100 = the TTS engine's default loudness
20
+ }
21
+
22
+
23
+ def _load() -> dict:
24
+ try:
25
+ if CONFIG_FILE.is_file():
26
+ return json.loads(CONFIG_FILE.read_text(encoding="utf-8"))
27
+ except (OSError, ValueError) as exc:
28
+ logger.warning("Could not read settings %s: %s", CONFIG_FILE, exc)
29
+ return {}
30
+
31
+
32
+ def get_setting(key: str, default: Any = None) -> Any:
33
+ """Return a stored setting, falling back to DEFAULTS then *default*."""
34
+ if default is None:
35
+ default = DEFAULTS.get(key)
36
+ return _load().get(key, default)
37
+
38
+
39
+ def set_setting(key: str, value: Any) -> None:
40
+ """Persist a single setting, creating the file/dirs as needed."""
41
+ data = _load()
42
+ data[key] = value
43
+ CONFIG_FILE.parent.mkdir(parents=True, exist_ok=True)
44
+ CONFIG_FILE.write_text(json.dumps(data, indent=2), encoding="utf-8")
talk/main.py CHANGED
@@ -25,6 +25,7 @@ from fastapi import HTTPException
25
  from pydantic import BaseModel
26
  from reachy_mini import ReachyMini, ReachyMiniApp
27
 
 
28
  from talk.llm import get_response, has_api_key, save_api_key
29
  from talk.stt import record_utterance, transcribe
30
  from talk.tts import speak
@@ -35,7 +36,7 @@ AWAKE_ANTENNAS = [-0.1745, 0.1745]
35
  AWAKE_PRESS_THRESHOLD = 0.35 # antenna-press exit during conversation (gesture-safe)
36
  DEBOUNCE_AFTER_SPEAK = 2.0 # ignore audio this long after the robot speaks
37
  DOA_DEBOUNCE = 3 # consecutive speech-detected readings to wake up
38
- IDLE_TIMEOUT = 12.0 # s of silence in conversation before returning to sleep
39
 
40
  WEEKDAYS_DE = [
41
  "Montag", "Dienstag", "Mittwoch", "Donnerstag",
@@ -114,6 +115,10 @@ class ApiKeyRequest(BaseModel):
114
  api_key: str
115
 
116
 
 
 
 
 
117
  class Talk(ReachyMiniApp):
118
  custom_app_url: str | None = "http://0.0.0.0:8042"
119
  request_media_backend: str | None = None
@@ -127,6 +132,7 @@ class Talk(ReachyMiniApp):
127
  with _lock:
128
  data = dict(_shared)
129
  data["api_key_set"] = has_api_key()
 
130
  return data
131
 
132
  @self.settings_app.post("/set_api_key")
@@ -140,6 +146,16 @@ class Talk(ReachyMiniApp):
140
  logger.info("API key updated via web UI")
141
  return {"ok": True, "api_key_set": True}
142
 
 
 
 
 
 
 
 
 
 
 
143
  reachy_mini.goto_sleep()
144
  state = State.SLEEPING
145
  last_spoke_at = 0.0
 
25
  from pydantic import BaseModel
26
  from reachy_mini import ReachyMini, ReachyMiniApp
27
 
28
+ from talk import config
29
  from talk.llm import get_response, has_api_key, save_api_key
30
  from talk.stt import record_utterance, transcribe
31
  from talk.tts import speak
 
36
  AWAKE_PRESS_THRESHOLD = 0.35 # antenna-press exit during conversation (gesture-safe)
37
  DEBOUNCE_AFTER_SPEAK = 2.0 # ignore audio this long after the robot speaks
38
  DOA_DEBOUNCE = 3 # consecutive speech-detected readings to wake up
39
+ IDLE_TIMEOUT = 25.0 # s of silence in conversation before returning to sleep
40
 
41
  WEEKDAYS_DE = [
42
  "Montag", "Dienstag", "Mittwoch", "Donnerstag",
 
115
  api_key: str
116
 
117
 
118
+ class ConfigRequest(BaseModel):
119
+ tts_volume: int
120
+
121
+
122
  class Talk(ReachyMiniApp):
123
  custom_app_url: str | None = "http://0.0.0.0:8042"
124
  request_media_backend: str | None = None
 
132
  with _lock:
133
  data = dict(_shared)
134
  data["api_key_set"] = has_api_key()
135
+ data["tts_volume"] = config.get_setting("tts_volume", 150)
136
  return data
137
 
138
  @self.settings_app.post("/set_api_key")
 
146
  logger.info("API key updated via web UI")
147
  return {"ok": True, "api_key_set": True}
148
 
149
+ @self.settings_app.post("/set_config")
150
+ def set_config(body: ConfigRequest):
151
+ vol = max(0, min(200, int(body.tts_volume)))
152
+ try:
153
+ config.set_setting("tts_volume", vol)
154
+ except OSError as exc:
155
+ raise HTTPException(status_code=500, detail=f"Could not save config: {exc}")
156
+ logger.info("tts_volume set to %d via web UI", vol)
157
+ return {"ok": True, "tts_volume": vol}
158
+
159
  reachy_mini.goto_sleep()
160
  state = State.SLEEPING
161
  last_spoke_at = 0.0
talk/static/index.html CHANGED
@@ -23,6 +23,9 @@
23
  <button id="save-key" type="button">Speichern</button>
24
  </div>
25
  <p id="key-status" class="key-status"></p>
 
 
 
26
  </details>
27
 
28
  <script src="/static/main.js"></script>
 
23
  <button id="save-key" type="button">Speichern</button>
24
  </div>
25
  <p id="key-status" class="key-status"></p>
26
+
27
+ <label for="volume" class="setting-label">Lautstärke: <span id="volume-val">–</span></label>
28
+ <input type="range" id="volume" class="slider" min="0" max="200" step="10" value="150">
29
  </details>
30
 
31
  <script src="/static/main.js"></script>
talk/static/main.js CHANGED
@@ -28,6 +28,27 @@ function reflectKeySet(isSet) {
28
  }
29
  }
30
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
  saveBtn.addEventListener("click", async () => {
32
  const key = keyInput.value.trim();
33
  if (!key) {
@@ -90,6 +111,12 @@ async function poll() {
90
  }
91
 
92
  reflectKeySet(data.api_key_set);
 
 
 
 
 
 
93
  } catch (_) {}
94
  }
95
 
 
28
  }
29
  }
30
 
31
+ // --- Volume setting ---
32
+ const volumeInput = document.getElementById("volume");
33
+ const volumeVal = document.getElementById("volume-val");
34
+
35
+ function renderVolume(v) {
36
+ if (volumeVal) volumeVal.textContent = v + "%";
37
+ }
38
+
39
+ if (volumeInput) {
40
+ volumeInput.addEventListener("input", () => renderVolume(volumeInput.value));
41
+ volumeInput.addEventListener("change", async () => {
42
+ try {
43
+ await fetch("/set_config", {
44
+ method: "POST",
45
+ headers: { "Content-Type": "application/json" },
46
+ body: JSON.stringify({ tts_volume: parseInt(volumeInput.value, 10) }),
47
+ });
48
+ } catch (_) {}
49
+ });
50
+ }
51
+
52
  saveBtn.addEventListener("click", async () => {
53
  const key = keyInput.value.trim();
54
  if (!key) {
 
111
  }
112
 
113
  reflectKeySet(data.api_key_set);
114
+
115
+ if (volumeInput && typeof data.tts_volume === "number"
116
+ && document.activeElement !== volumeInput) {
117
+ volumeInput.value = data.tts_volume;
118
+ renderVolume(data.tts_volume);
119
+ }
120
  } catch (_) {}
121
  }
122
 
talk/static/style.css CHANGED
@@ -112,3 +112,14 @@ h1 { margin-bottom: 1rem; }
112
 
113
  .key-status.ok { color: #065f46; }
114
  .key-status.warn { color: #92400e; }
 
 
 
 
 
 
 
 
 
 
 
 
112
 
113
  .key-status.ok { color: #065f46; }
114
  .key-status.warn { color: #92400e; }
115
+
116
+ .setting-label {
117
+ display: block;
118
+ margin-top: 1rem;
119
+ margin-bottom: 0.35rem;
120
+ color: #555;
121
+ }
122
+
123
+ .slider {
124
+ width: 100%;
125
+ }
talk/stt.py CHANGED
@@ -1,8 +1,10 @@
1
  """Speech recording + Google STT for the Talk app.
2
 
3
- Records from the robot's ReSpeaker mic array (16 kHz, stereo float32)
4
- using the SDK's recording pipeline. Uses RMS energy for VAD; DoA for
5
- head-direction tracking only. Transcribes via Google Speech Recognition.
 
 
6
  """
7
 
8
  import io
@@ -16,18 +18,23 @@ import numpy as np
16
 
17
  logger = logging.getLogger(__name__)
18
 
19
- SAMPLE_RATE = 16000 # ReSpeaker hardware rate
20
- SILENCE_DURATION = 1.2 # s of silence to end utterance
21
- MAX_DURATION = 20.0 # hard cap per utterance
22
- MIN_SPEECH_DURATION = 0.4 # discard very short sounds (spurious noise)
23
- RMS_SPEECH_THRESHOLD = 0.005 # float32 RMS; tunable
24
- HEAD_UPDATE_INTERVAL = 0.5 # s between head-direction updates while waiting
 
 
25
 
26
 
27
  def _rms(chunk: np.ndarray) -> float:
28
- """RMS energy of channel 0 (handles mono or multi-channel float32)."""
29
- ch0 = chunk[:, 0] if chunk.ndim > 1 else chunk
30
- return float(np.sqrt(np.mean(ch0 ** 2)))
 
 
 
31
 
32
 
33
  def _chunks_to_wav_bytes(chunks: list) -> bytes:
@@ -44,32 +51,56 @@ def _chunks_to_wav_bytes(chunks: list) -> bytes:
44
  return buf.getvalue()
45
 
46
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
  def record_utterance(
48
  reachy_mini,
49
  stop_event,
50
  should_stop: Callable,
51
  idle_timeout: Optional[float] = None,
52
  ) -> tuple[list, float, bool]:
53
- """Wait for speech, record until silence, return (chunks, doa_angle, antenna_pressed).
54
 
55
- VAD is energy-based (RMS threshold). DoA is used only for head tracking
56
- while waiting updates are throttled to avoid jerky movement. If
57
- idle_timeout is set and no speech begins within that window, returns empty
58
- chunks (antenna_pressed=False) so the caller can end the conversation.
 
59
  """
60
  chunks: list = []
61
  last_speech_t: Optional[float] = None
62
  speech_started_t: Optional[float] = None
63
  last_doa_angle: float = math.pi / 2 # default: facing front
64
  last_head_update: float = 0.0
 
65
 
66
- # Drain stale audio buffered during TTS playback.
67
  drained = 0
68
  while reachy_mini.media.get_audio_sample() is not None:
69
  drained += 1
70
  if drained:
71
  logger.debug("Drained %d stale audio chunks", drained)
72
 
 
 
 
73
  wait_start = time.time()
74
 
75
  while not stop_event.is_set():
@@ -80,20 +111,23 @@ def record_utterance(
80
 
81
  if (idle_timeout is not None and speech_started_t is None
82
  and now - wait_start > idle_timeout):
83
- logger.debug("No speech within %.1f s — idle", idle_timeout)
 
 
 
84
  return [], last_doa_angle, False
85
 
86
- # Update DoA angle (direction only not used as VAD).
 
87
  doa = reachy_mini.media.get_DoA()
88
  if doa is not None:
89
  last_doa_angle = doa[0]
90
 
91
- # Smooth head tracking toward speaker, throttled to HEAD_UPDATE_INTERVAL.
92
- # Only while waiting for speech; once recording, head stays put.
93
  if speech_started_t is None and now - last_head_update >= HEAD_UPDATE_INTERVAL:
94
  y = math.sin(last_doa_angle - math.pi / 2) * 0.6
95
  try:
96
- reachy_mini.look_at_world(1.0, y, 0.0, duration=0.4)
 
97
  except Exception:
98
  pass
99
  last_head_update = now
@@ -102,9 +136,11 @@ def record_utterance(
102
  chunk = reachy_mini.media.get_audio_sample()
103
  if chunk is not None:
104
  rms = _rms(chunk)
105
- if rms > RMS_SPEECH_THRESHOLD:
 
 
106
  if speech_started_t is None:
107
- logger.debug("Speech started (RMS %.4f)", rms)
108
  speech_started_t = now
109
  last_speech_t = now
110
  if speech_started_t is not None:
@@ -115,14 +151,14 @@ def record_utterance(
115
  and now - last_speech_t > SILENCE_DURATION
116
  and speech_started_t is not None
117
  and now - speech_started_t > MIN_SPEECH_DURATION):
118
- logger.debug("Utterance ended (%.1f s)", now - speech_started_t)
119
  break
120
 
121
  if speech_started_t is not None and now - speech_started_t > MAX_DURATION:
122
- logger.debug("Utterance hit max duration")
123
  break
124
 
125
- time.sleep(0.03)
126
 
127
  return chunks, last_doa_angle, False
128
 
 
1
  """Speech recording + Google STT for the Talk app.
2
 
3
+ Records from the robot's ReSpeaker mic array (16 kHz, stereo float32) using the
4
+ SDK's recording pipeline. VAD is energy-based (RMS) with a threshold that is
5
+ auto-calibrated from the ambient noise floor each time we start listening, so it
6
+ adapts to room noise and mic gain. DoA is used only for (non-blocking) head
7
+ direction. Transcribes via Google Speech Recognition.
8
  """
9
 
10
  import io
 
18
 
19
  logger = logging.getLogger(__name__)
20
 
21
+ SAMPLE_RATE = 16000 # ReSpeaker hardware rate
22
+ SILENCE_DURATION = 1.2 # s of silence to end an utterance
23
+ MAX_DURATION = 20.0 # hard cap per utterance
24
+ MIN_SPEECH_DURATION = 0.4 # discard very short sounds (spurious noise)
25
+ RMS_FLOOR = 0.003 # minimum VAD threshold (float32 RMS)
26
+ NOISE_CALIB_DURATION = 0.4 # s of ambient sampling to calibrate the threshold
27
+ NOISE_MULT = 2.5 # speech must exceed (ambient RMS * this)
28
+ HEAD_UPDATE_INTERVAL = 0.6 # s between (non-blocking) head-direction updates
29
 
30
 
31
  def _rms(chunk: np.ndarray) -> float:
32
+ """Loudest-channel RMS energy (one ReSpeaker channel can be much quieter)."""
33
+ a = chunk.astype(np.float32)
34
+ if a.ndim > 1:
35
+ per_channel = np.sqrt(np.mean(a ** 2, axis=0))
36
+ return float(per_channel.max())
37
+ return float(np.sqrt(np.mean(a ** 2)))
38
 
39
 
40
  def _chunks_to_wav_bytes(chunks: list) -> bytes:
 
51
  return buf.getvalue()
52
 
53
 
54
+ def _calibrate_threshold(reachy_mini, stop_event) -> float:
55
+ """Sample ambient noise briefly and derive a VAD threshold from it."""
56
+ samples: list[float] = []
57
+ end = time.time() + NOISE_CALIB_DURATION
58
+ while time.time() < end and not stop_event.is_set():
59
+ chunk = reachy_mini.media.get_audio_sample()
60
+ if chunk is not None:
61
+ samples.append(_rms(chunk))
62
+ time.sleep(0.02)
63
+ if samples:
64
+ samples.sort()
65
+ ambient = samples[len(samples) // 2] # median, robust to a stray sound
66
+ else:
67
+ ambient = 0.0
68
+ threshold = max(ambient * NOISE_MULT, RMS_FLOOR)
69
+ logger.info("VAD calibrated: ambient=%.4f -> threshold=%.4f", ambient, threshold)
70
+ return threshold
71
+
72
+
73
  def record_utterance(
74
  reachy_mini,
75
  stop_event,
76
  should_stop: Callable,
77
  idle_timeout: Optional[float] = None,
78
  ) -> tuple[list, float, bool]:
79
+ """Wait for speech, record until silence; return (chunks, doa_angle, stopped).
80
 
81
+ Energy-based VAD with a threshold auto-calibrated from the ambient noise
82
+ floor. The head tracks the speaker via DoA using *non-blocking* updates, so
83
+ the audio loop is never frozen while waiting a blocking head move here used
84
+ to make the robot miss the very start of speech. If ``idle_timeout`` passes
85
+ with no speech, returns empty chunks so the caller can end the conversation.
86
  """
87
  chunks: list = []
88
  last_speech_t: Optional[float] = None
89
  speech_started_t: Optional[float] = None
90
  last_doa_angle: float = math.pi / 2 # default: facing front
91
  last_head_update: float = 0.0
92
+ max_rms_seen: float = 0.0
93
 
94
+ # Drain stale audio buffered during TTS playback (echo of our own voice).
95
  drained = 0
96
  while reachy_mini.media.get_audio_sample() is not None:
97
  drained += 1
98
  if drained:
99
  logger.debug("Drained %d stale audio chunks", drained)
100
 
101
+ # Calibrate the VAD threshold from a short ambient sample.
102
+ threshold = _calibrate_threshold(reachy_mini, stop_event)
103
+
104
  wait_start = time.time()
105
 
106
  while not stop_event.is_set():
 
111
 
112
  if (idle_timeout is not None and speech_started_t is None
113
  and now - wait_start > idle_timeout):
114
+ logger.info(
115
+ "No speech within %.0f s (max RMS %.4f, threshold %.4f) — idle",
116
+ idle_timeout, max_rms_seen, threshold,
117
+ )
118
  return [], last_doa_angle, False
119
 
120
+ # Track DoA direction; nudge the head toward the speaker non-blocking
121
+ # (only while waiting — never freeze the audio loop).
122
  doa = reachy_mini.media.get_DoA()
123
  if doa is not None:
124
  last_doa_angle = doa[0]
125
 
 
 
126
  if speech_started_t is None and now - last_head_update >= HEAD_UPDATE_INTERVAL:
127
  y = math.sin(last_doa_angle - math.pi / 2) * 0.6
128
  try:
129
+ pose = reachy_mini.look_at_world(1.0, y, 0.0, perform_movement=False)
130
+ reachy_mini.set_target(head=pose)
131
  except Exception:
132
  pass
133
  last_head_update = now
 
136
  chunk = reachy_mini.media.get_audio_sample()
137
  if chunk is not None:
138
  rms = _rms(chunk)
139
+ if rms > max_rms_seen:
140
+ max_rms_seen = rms
141
+ if rms > threshold:
142
  if speech_started_t is None:
143
+ logger.info("Speech started (RMS %.4f > %.4f)", rms, threshold)
144
  speech_started_t = now
145
  last_speech_t = now
146
  if speech_started_t is not None:
 
151
  and now - last_speech_t > SILENCE_DURATION
152
  and speech_started_t is not None
153
  and now - speech_started_t > MIN_SPEECH_DURATION):
154
+ logger.info("Utterance ended (%.1f s)", now - speech_started_t)
155
  break
156
 
157
  if speech_started_t is not None and now - speech_started_t > MAX_DURATION:
158
+ logger.info("Utterance hit max duration")
159
  break
160
 
161
+ time.sleep(0.02)
162
 
163
  return chunks, last_doa_angle, False
164
 
talk/tts.py CHANGED
@@ -15,26 +15,50 @@ import tempfile
15
  import time
16
  from typing import Optional
17
 
 
 
18
  logger = logging.getLogger(__name__)
19
 
20
  EDGE_VOICE = "de-DE-KatjaNeural"
21
  EDGE_RATE = "-5%" # slightly slower for clarity
22
 
23
 
24
- async def _edge_synthesize(text: str, path: str) -> None:
 
 
 
 
 
 
25
  import edge_tts
26
- communicate = edge_tts.Communicate(text, EDGE_VOICE, rate=EDGE_RATE)
 
 
27
  await communicate.save(path)
28
 
29
 
30
- def speak(text: str, reachy_mini, words_per_minute: int = 130, lang: str = "de") -> None:
 
 
 
 
 
 
31
  """Synthesize *text* and play it through the robot's speakers.
32
 
33
  Tries edge-tts first (neural quality), falls back to espeak-ng.
34
- Blocks until playback should be complete.
 
35
  """
36
  audio_path: Optional[str] = None
37
 
 
 
 
 
 
 
 
38
  try:
39
  # edge-tts outputs MP3; GStreamer playbin handles it natively.
40
  with tempfile.NamedTemporaryFile(suffix=".mp3", delete=False) as f:
@@ -42,7 +66,7 @@ def speak(text: str, reachy_mini, words_per_minute: int = 130, lang: str = "de")
42
 
43
  success = False
44
  try:
45
- asyncio.run(_edge_synthesize(text, audio_path))
46
  success = True
47
  except ImportError:
48
  logger.warning("edge-tts not installed — falling back to espeak-ng")
@@ -59,7 +83,8 @@ def speak(text: str, reachy_mini, words_per_minute: int = 130, lang: str = "de")
59
  logger.warning("No TTS engine available. Install edge-tts or espeak-ng.")
60
  return
61
  subprocess.run(
62
- [cmd, "-v", lang, "-s", str(words_per_minute), "-w", audio_path, "--", text],
 
63
  check=True, timeout=15, capture_output=True,
64
  )
65
 
 
15
  import time
16
  from typing import Optional
17
 
18
+ from talk import config
19
+
20
  logger = logging.getLogger(__name__)
21
 
22
  EDGE_VOICE = "de-DE-KatjaNeural"
23
  EDGE_RATE = "-5%" # slightly slower for clarity
24
 
25
 
26
+ def _edge_volume_str(volume_pct: int) -> str:
27
+ """Map a 0-200 loudness (100 = default) to edge-tts's relative volume."""
28
+ delta = max(-100, min(100, int(volume_pct) - 100))
29
+ return f"+{delta}%" if delta >= 0 else f"{delta}%"
30
+
31
+
32
+ async def _edge_synthesize(text: str, path: str, volume_pct: int) -> None:
33
  import edge_tts
34
+ communicate = edge_tts.Communicate(
35
+ text, EDGE_VOICE, rate=EDGE_RATE, volume=_edge_volume_str(volume_pct)
36
+ )
37
  await communicate.save(path)
38
 
39
 
40
+ def speak(
41
+ text: str,
42
+ reachy_mini,
43
+ words_per_minute: int = 130,
44
+ lang: str = "de",
45
+ volume: Optional[int] = None,
46
+ ) -> None:
47
  """Synthesize *text* and play it through the robot's speakers.
48
 
49
  Tries edge-tts first (neural quality), falls back to espeak-ng.
50
+ Blocks until playback should be complete. *volume* is 0-200 (100 = engine
51
+ default); when None it is read from the web-UI setting ``tts_volume``.
52
  """
53
  audio_path: Optional[str] = None
54
 
55
+ if volume is None:
56
+ try:
57
+ volume = int(config.get_setting("tts_volume", 150))
58
+ except (TypeError, ValueError):
59
+ volume = 150
60
+ volume = max(0, min(200, volume))
61
+
62
  try:
63
  # edge-tts outputs MP3; GStreamer playbin handles it natively.
64
  with tempfile.NamedTemporaryFile(suffix=".mp3", delete=False) as f:
 
66
 
67
  success = False
68
  try:
69
+ asyncio.run(_edge_synthesize(text, audio_path, volume))
70
  success = True
71
  except ImportError:
72
  logger.warning("edge-tts not installed — falling back to espeak-ng")
 
83
  logger.warning("No TTS engine available. Install edge-tts or espeak-ng.")
84
  return
85
  subprocess.run(
86
+ [cmd, "-v", lang, "-s", str(words_per_minute),
87
+ "-a", str(volume), "-w", audio_path, "--", text],
88
  check=True, timeout=15, capture_output=True,
89
  )
90