Spaces:

onitsche
/

talk

Running

onitsche Claude Opus 4.8 commited on 3 days ago

Commit

6c690a7

1 Parent(s): c0b616c

Fix dropped conversation + add configurable voice volume

Conversation: the robot woke, listened, then went to sleep without ever
responding. Root cause: while waiting for speech, record_utterance()
called the *blocking* look_at_world() for head tracking, freezing the
audio loop ~0.4 s out of every 0.5 s — so the start of speech was missed
and the idle timeout fired. Fixes:
- head tracking while listening is now non-blocking (set_target)
- VAD threshold auto-calibrated from the ambient noise floor; RMS taken
over the loudest channel (one ReSpeaker channel can be near-silent)
- idle timeout 12 s -> 25 s
- INFO logging of ambient/threshold/max-RMS and speech start/end so the
behaviour can be tuned on-device

Voice volume: play_sound() exposes no volume control, so loudness is
applied at synthesis — edge-tts volume="+X%" and espeak -a. New
tts_volume setting (0-200, default 150 = louder) with a slider in the
web UI:
- talk/config.py: JSON settings store (~/.config/talk/settings.json),
outside the repo so it is never committed/packaged
- POST /set_config; /status now also returns tts_volume

Bump version 0.2.0 -> 0.3.0 so the robot picks up the new build.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Files changed (9) hide show

CLAUDE.md +5 -4
pyproject.toml +1 -1
talk/config.py +44 -0
talk/main.py +17 -1
talk/static/index.html +3 -0
talk/static/main.js +27 -0
talk/static/style.css +11 -0
talk/stt.py +64 -28
talk/tts.py +31 -6

CLAUDE.md CHANGED Viewed

@@ -70,16 +70,17 @@ SLEEPING → (speech detected) → TIME → CONVERSING → (silence/antenna pres
 - **SLEEPING**: polls `get_DoA()` at 5 Hz; wakes after `DOA_DEBOUNCE` (3) consecutive speech-detected readings (same mechanism as the recognizer). Ignores audio for `DEBOUNCE_AFTER_SPEAK` (2 s) after the robot itself spoke so its own goodbye can't re-wake it.
 - **TIME**: `wake_up()` → speak German datetime with gesture loop, facing the speaker via the captured DoA angle → `start_recording()` → enter CONVERSING.
 - **CONVERSING** (inner loop):
-  - **LISTENING**: `record_utterance()` uses RMS-energy VAD, tracks head toward the speaker via DoA. Exits on antenna press, or returns empty after `IDLE_TIMEOUT` (12 s) of silence.
   - **PROCESSING**: `transcribe(chunks)` → Google STT; `get_response(messages)` → Claude API.
   - **RESPONDING**: `_speak_with_gestures()` → back to LISTENING. Recording runs continuously throughout the conversation; `record_utterance()` drains the echo captured during playback.
   - Exit: antenna press *or* idle timeout → `stop_recording()` → `goto_sleep()` → SLEEPING.
 ## Helper Modules
-- **`talk/tts.py`**: edge-tts (MS neural, `de-DE-KatjaNeural`) → MP3 → `media.play_sound()`. Falls back to espeak-ng. Blocks for estimated playback duration.
-- **`talk/stt.py`**: records from ReSpeaker (16 kHz float32), uses RMS-energy VAD (DoA is for head direction only), converts to mono 16-bit WAV, transcribes via Google Speech Recognition. Accepts an `idle_timeout` so a silent conversation returns to sleep.
 - **`talk/llm.py`**: stateless Claude API wrapper. Caller owns `messages` list. Resolves the API key via `get_api_key()` — the web-UI file (`~/.config/talk/api_key`) first, then `ANTHROPIC_API_KEY`. Also exposes `has_api_key()` and `save_api_key()` for the web UI.
 ## Key SDK APIs
@@ -108,4 +109,4 @@ reachy_mini.goto_sleep()
 ## Settings UI
-`talk/static/` polls `GET /status` every second. Returns `{state, last_user, last_assistant, api_key_set}`. Shows colour-coded status chip and conversation bubbles (user on right, assistant on left) during CONVERSING. An *Einstellungen* section lets the user enter the Anthropic API key, which is `POST`ed to `/set_api_key` (`{api_key}`) and persisted via `save_api_key()`; the `api_key_set` flag drives a "key set?" indicator.

 - **SLEEPING**: polls `get_DoA()` at 5 Hz; wakes after `DOA_DEBOUNCE` (3) consecutive speech-detected readings (same mechanism as the recognizer). Ignores audio for `DEBOUNCE_AFTER_SPEAK` (2 s) after the robot itself spoke so its own goodbye can't re-wake it.
 - **TIME**: `wake_up()` → speak German datetime with gesture loop, facing the speaker via the captured DoA angle → `start_recording()` → enter CONVERSING.
 - **CONVERSING** (inner loop):
+  - **LISTENING**: `record_utterance()` uses RMS-energy VAD with a threshold auto-calibrated from the ambient noise floor; head tracking toward the speaker is **non-blocking** (`set_target`) so the audio loop is never frozen. Exits on antenna press, or returns empty after `IDLE_TIMEOUT` (25 s) of silence.
   - **PROCESSING**: `transcribe(chunks)` → Google STT; `get_response(messages)` → Claude API.
   - **RESPONDING**: `_speak_with_gestures()` → back to LISTENING. Recording runs continuously throughout the conversation; `record_utterance()` drains the echo captured during playback.
   - Exit: antenna press *or* idle timeout → `stop_recording()` → `goto_sleep()` → SLEEPING.
 ## Helper Modules
+- **`talk/tts.py`**: edge-tts (MS neural, `de-DE-KatjaNeural`) → MP3 → `media.play_sound()`. Falls back to espeak-ng. Loudness is configurable (0-200, 100 = engine default) via the `tts_volume` setting — applied as edge-tts `volume="+X%"` and espeak `-a`. Blocks for estimated playback duration.
+- **`talk/stt.py`**: records from ReSpeaker (16 kHz float32), loudest-channel RMS-energy VAD with a threshold auto-calibrated from ambient noise (logs ambient/threshold/max-RMS for tuning), converts to mono 16-bit WAV, transcribes via Google Speech Recognition. Non-blocking DoA head tracking. Accepts an `idle_timeout` so a silent conversation returns to sleep.
 - **`talk/llm.py`**: stateless Claude API wrapper. Caller owns `messages` list. Resolves the API key via `get_api_key()` — the web-UI file (`~/.config/talk/api_key`) first, then `ANTHROPIC_API_KEY`. Also exposes `has_api_key()` and `save_api_key()` for the web UI.
+- **`talk/config.py`**: JSON-backed non-secret settings store at `~/.config/talk/settings.json` (`get_setting`/`set_setting`). Holds `tts_volume`. Outside the repo so it is never committed/packaged.
 ## Key SDK APIs
 ## Settings UI
+`talk/static/` polls `GET /status` every second. Returns `{state, last_user, last_assistant, api_key_set, tts_volume}`. Shows colour-coded status chip and conversation bubbles (user on right, assistant on left) during CONVERSING. An *Einstellungen* section lets the user (a) enter the Anthropic API key — `POST /set_api_key` (`{api_key}`) → `save_api_key()`, with the `api_key_set` flag driving a "key set?" indicator; and (b) set the voice loudness with a slider — `POST /set_config` (`{tts_volume}`) → `config.set_setting()`.

pyproject.toml CHANGED Viewed

@@ -5,7 +5,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "talk"
-version = "0.2.0"
 description = "Wakes the robot on speech, announces the time, then chats with Claude in German"
 readme = "README.md"
 requires-python = ">=3.10"

 [project]
 name = "talk"
+version = "0.3.0"
 description = "Wakes the robot on speech, announces the time, then chats with Claude in German"
 readme = "README.md"
 requires-python = ">=3.10"

talk/config.py ADDED Viewed

	@@ -0,0 +1,44 @@

+"""JSON-backed settings store for the Talk app (non-secret settings).
+Stored at ``~/.config/talk/settings.json`` — outside the package directory so
+it is never committed or packaged into the published Hugging Face Space. The
+secret API key is handled separately in :mod:`talk.llm`.
+"""
+import json
+import logging
+from pathlib import Path
+from typing import Any
+logger = logging.getLogger(__name__)
+CONFIG_FILE = Path.home() / ".config" / "talk" / "settings.json"
+# User-tunable defaults.
+DEFAULTS: dict[str, Any] = {
+    "tts_volume": 150,  # 0-200, where 100 = the TTS engine's default loudness
+}
+def _load() -> dict:
+    try:
+        if CONFIG_FILE.is_file():
+            return json.loads(CONFIG_FILE.read_text(encoding="utf-8"))
+    except (OSError, ValueError) as exc:
+        logger.warning("Could not read settings %s: %s", CONFIG_FILE, exc)
+    return {}
+def get_setting(key: str, default: Any = None) -> Any:
+    """Return a stored setting, falling back to DEFAULTS then *default*."""
+    if default is None:
+        default = DEFAULTS.get(key)
+    return _load().get(key, default)
+def set_setting(key: str, value: Any) -> None:
+    """Persist a single setting, creating the file/dirs as needed."""
+    data = _load()
+    data[key] = value
+    CONFIG_FILE.parent.mkdir(parents=True, exist_ok=True)
+    CONFIG_FILE.write_text(json.dumps(data, indent=2), encoding="utf-8")

talk/main.py CHANGED Viewed

@@ -25,6 +25,7 @@ from fastapi import HTTPException
 from pydantic import BaseModel
 from reachy_mini import ReachyMini, ReachyMiniApp
 from talk.llm import get_response, has_api_key, save_api_key
 from talk.stt import record_utterance, transcribe
 from talk.tts import speak
@@ -35,7 +36,7 @@ AWAKE_ANTENNAS = [-0.1745, 0.1745]
 AWAKE_PRESS_THRESHOLD = 0.35   # antenna-press exit during conversation (gesture-safe)
 DEBOUNCE_AFTER_SPEAK = 2.0     # ignore audio this long after the robot speaks
 DOA_DEBOUNCE = 3               # consecutive speech-detected readings to wake up
-IDLE_TIMEOUT = 12.0            # s of silence in conversation before returning to sleep
 WEEKDAYS_DE = [
     "Montag", "Dienstag", "Mittwoch", "Donnerstag",
@@ -114,6 +115,10 @@ class ApiKeyRequest(BaseModel):
     api_key: str
 class Talk(ReachyMiniApp):
     custom_app_url: str | None = "http://0.0.0.0:8042"
     request_media_backend: str | None = None
@@ -127,6 +132,7 @@ class Talk(ReachyMiniApp):
             with _lock:
                 data = dict(_shared)
             data["api_key_set"] = has_api_key()
             return data
         @self.settings_app.post("/set_api_key")
@@ -140,6 +146,16 @@ class Talk(ReachyMiniApp):
             logger.info("API key updated via web UI")
             return {"ok": True, "api_key_set": True}
         reachy_mini.goto_sleep()
         state = State.SLEEPING
         last_spoke_at = 0.0

 from pydantic import BaseModel
 from reachy_mini import ReachyMini, ReachyMiniApp
+from talk import config
 from talk.llm import get_response, has_api_key, save_api_key
 from talk.stt import record_utterance, transcribe
 from talk.tts import speak
 AWAKE_PRESS_THRESHOLD = 0.35   # antenna-press exit during conversation (gesture-safe)
 DEBOUNCE_AFTER_SPEAK = 2.0     # ignore audio this long after the robot speaks
 DOA_DEBOUNCE = 3               # consecutive speech-detected readings to wake up
+IDLE_TIMEOUT = 25.0            # s of silence in conversation before returning to sleep
 WEEKDAYS_DE = [
     "Montag", "Dienstag", "Mittwoch", "Donnerstag",
     api_key: str
+class ConfigRequest(BaseModel):
+    tts_volume: int
 class Talk(ReachyMiniApp):
     custom_app_url: str | None = "http://0.0.0.0:8042"
     request_media_backend: str | None = None
             with _lock:
                 data = dict(_shared)
             data["api_key_set"] = has_api_key()
+            data["tts_volume"] = config.get_setting("tts_volume", 150)
             return data
         @self.settings_app.post("/set_api_key")
             logger.info("API key updated via web UI")
             return {"ok": True, "api_key_set": True}
+        @self.settings_app.post("/set_config")
+        def set_config(body: ConfigRequest):
+            vol = max(0, min(200, int(body.tts_volume)))
+            try:
+                config.set_setting("tts_volume", vol)
+            except OSError as exc:
+                raise HTTPException(status_code=500, detail=f"Could not save config: {exc}")
+            logger.info("tts_volume set to %d via web UI", vol)
+            return {"ok": True, "tts_volume": vol}
         reachy_mini.goto_sleep()
         state = State.SLEEPING
         last_spoke_at = 0.0

talk/static/index.html CHANGED Viewed

@@ -23,6 +23,9 @@
             <button id="save-key" type="button">Speichern</button>
         </div>
         <p id="key-status" class="key-status"></p>
     </details>
     <script src="/static/main.js"></script>

             <button id="save-key" type="button">Speichern</button>
         </div>
         <p id="key-status" class="key-status"></p>
+        <label for="volume" class="setting-label">Lautstärke: <span id="volume-val">–</span></label>
+        <input type="range" id="volume" class="slider" min="0" max="200" step="10" value="150">
     </details>
     <script src="/static/main.js"></script>

talk/static/main.js CHANGED Viewed

@@ -28,6 +28,27 @@ function reflectKeySet(isSet) {
     }
 }
 saveBtn.addEventListener("click", async () => {
     const key = keyInput.value.trim();
     if (!key) {
@@ -90,6 +111,12 @@ async function poll() {
         }
         reflectKeySet(data.api_key_set);
     } catch (_) {}
 }

     }
 }
+// --- Volume setting ---
+const volumeInput = document.getElementById("volume");
+const volumeVal = document.getElementById("volume-val");
+function renderVolume(v) {
+    if (volumeVal) volumeVal.textContent = v + "%";
+}
+if (volumeInput) {
+    volumeInput.addEventListener("input", () => renderVolume(volumeInput.value));
+    volumeInput.addEventListener("change", async () => {
+        try {
+            await fetch("/set_config", {
+                method: "POST",
+                headers: { "Content-Type": "application/json" },
+                body: JSON.stringify({ tts_volume: parseInt(volumeInput.value, 10) }),
+            });
+        } catch (_) {}
+    });
+}
 saveBtn.addEventListener("click", async () => {
     const key = keyInput.value.trim();
     if (!key) {
         }
         reflectKeySet(data.api_key_set);
+        if (volumeInput && typeof data.tts_volume === "number"
+                && document.activeElement !== volumeInput) {
+            volumeInput.value = data.tts_volume;
+            renderVolume(data.tts_volume);
+        }
     } catch (_) {}
 }

talk/static/style.css CHANGED Viewed

@@ -112,3 +112,14 @@ h1 { margin-bottom: 1rem; }
 .key-status.ok { color: #065f46; }
 .key-status.warn { color: #92400e; }

 .key-status.ok { color: #065f46; }
 .key-status.warn { color: #92400e; }
+.setting-label {
+    display: block;
+    margin-top: 1rem;
+    margin-bottom: 0.35rem;
+    color: #555;
+}
+.slider {
+    width: 100%;
+}

talk/stt.py CHANGED Viewed

@@ -1,8 +1,10 @@
 """Speech recording + Google STT for the Talk app.
-Records from the robot's ReSpeaker mic array (16 kHz, stereo float32)
-using the SDK's recording pipeline. Uses RMS energy for VAD; DoA for
-head-direction tracking only. Transcribes via Google Speech Recognition.
 """
 import io
@@ -16,18 +18,23 @@ import numpy as np
 logger = logging.getLogger(__name__)
-SAMPLE_RATE = 16000       # ReSpeaker hardware rate
-SILENCE_DURATION = 1.2    # s of silence to end utterance
-MAX_DURATION = 20.0       # hard cap per utterance
-MIN_SPEECH_DURATION = 0.4 # discard very short sounds (spurious noise)
-RMS_SPEECH_THRESHOLD = 0.005  # float32 RMS; tunable
-HEAD_UPDATE_INTERVAL = 0.5    # s between head-direction updates while waiting
 def _rms(chunk: np.ndarray) -> float:
-    """RMS energy of channel 0 (handles mono or multi-channel float32)."""
-    ch0 = chunk[:, 0] if chunk.ndim > 1 else chunk
-    return float(np.sqrt(np.mean(ch0 ** 2)))
 def _chunks_to_wav_bytes(chunks: list) -> bytes:
@@ -44,32 +51,56 @@ def _chunks_to_wav_bytes(chunks: list) -> bytes:
     return buf.getvalue()
 def record_utterance(
     reachy_mini,
     stop_event,
     should_stop: Callable,
     idle_timeout: Optional[float] = None,
 ) -> tuple[list, float, bool]:
-    """Wait for speech, record until silence, return (chunks, doa_angle, antenna_pressed).
-    VAD is energy-based (RMS threshold). DoA is used only for head tracking
-    while waiting — updates are throttled to avoid jerky movement. If
-    idle_timeout is set and no speech begins within that window, returns empty
-    chunks (antenna_pressed=False) so the caller can end the conversation.
     """
     chunks: list = []
     last_speech_t: Optional[float] = None
     speech_started_t: Optional[float] = None
     last_doa_angle: float = math.pi / 2   # default: facing front
     last_head_update: float = 0.0
-    # Drain stale audio buffered during TTS playback.
     drained = 0
     while reachy_mini.media.get_audio_sample() is not None:
         drained += 1
     if drained:
         logger.debug("Drained %d stale audio chunks", drained)
     wait_start = time.time()
     while not stop_event.is_set():
@@ -80,20 +111,23 @@ def record_utterance(
         if (idle_timeout is not None and speech_started_t is None
                 and now - wait_start > idle_timeout):
-            logger.debug("No speech within %.1f s — idle", idle_timeout)
             return [], last_doa_angle, False
-        # Update DoA angle (direction only — not used as VAD).
         doa = reachy_mini.media.get_DoA()
         if doa is not None:
             last_doa_angle = doa[0]
-        # Smooth head tracking toward speaker, throttled to HEAD_UPDATE_INTERVAL.
-        # Only while waiting for speech; once recording, head stays put.
         if speech_started_t is None and now - last_head_update >= HEAD_UPDATE_INTERVAL:
             y = math.sin(last_doa_angle - math.pi / 2) * 0.6
             try:
-                reachy_mini.look_at_world(1.0, y, 0.0, duration=0.4)
             except Exception:
                 pass
             last_head_update = now
@@ -102,9 +136,11 @@ def record_utterance(
         chunk = reachy_mini.media.get_audio_sample()
         if chunk is not None:
             rms = _rms(chunk)
-            if rms > RMS_SPEECH_THRESHOLD:
                 if speech_started_t is None:
-                    logger.debug("Speech started (RMS %.4f)", rms)
                     speech_started_t = now
                 last_speech_t = now
             if speech_started_t is not None:
@@ -115,14 +151,14 @@ def record_utterance(
                 and now - last_speech_t > SILENCE_DURATION
                 and speech_started_t is not None
                 and now - speech_started_t > MIN_SPEECH_DURATION):
-            logger.debug("Utterance ended (%.1f s)", now - speech_started_t)
             break
         if speech_started_t is not None and now - speech_started_t > MAX_DURATION:
-            logger.debug("Utterance hit max duration")
             break
-        time.sleep(0.03)
     return chunks, last_doa_angle, False

 """Speech recording + Google STT for the Talk app.
+Records from the robot's ReSpeaker mic array (16 kHz, stereo float32) using the
+SDK's recording pipeline. VAD is energy-based (RMS) with a threshold that is
+auto-calibrated from the ambient noise floor each time we start listening, so it
+adapts to room noise and mic gain. DoA is used only for (non-blocking) head
+direction. Transcribes via Google Speech Recognition.
 """
 import io
 logger = logging.getLogger(__name__)
+SAMPLE_RATE = 16000        # ReSpeaker hardware rate
+SILENCE_DURATION = 1.2     # s of silence to end an utterance
+MAX_DURATION = 20.0        # hard cap per utterance
+MIN_SPEECH_DURATION = 0.4  # discard very short sounds (spurious noise)
+RMS_FLOOR = 0.003          # minimum VAD threshold (float32 RMS)
+NOISE_CALIB_DURATION = 0.4 # s of ambient sampling to calibrate the threshold
+NOISE_MULT = 2.5           # speech must exceed (ambient RMS * this)
+HEAD_UPDATE_INTERVAL = 0.6 # s between (non-blocking) head-direction updates
 def _rms(chunk: np.ndarray) -> float:
+    """Loudest-channel RMS energy (one ReSpeaker channel can be much quieter)."""
+    a = chunk.astype(np.float32)
+    if a.ndim > 1:
+        per_channel = np.sqrt(np.mean(a ** 2, axis=0))
+        return float(per_channel.max())
+    return float(np.sqrt(np.mean(a ** 2)))
 def _chunks_to_wav_bytes(chunks: list) -> bytes:
     return buf.getvalue()
+def _calibrate_threshold(reachy_mini, stop_event) -> float:
+    """Sample ambient noise briefly and derive a VAD threshold from it."""
+    samples: list[float] = []
+    end = time.time() + NOISE_CALIB_DURATION
+    while time.time() < end and not stop_event.is_set():
+        chunk = reachy_mini.media.get_audio_sample()
+        if chunk is not None:
+            samples.append(_rms(chunk))
+        time.sleep(0.02)
+    if samples:
+        samples.sort()
+        ambient = samples[len(samples) // 2]   # median, robust to a stray sound
+    else:
+        ambient = 0.0
+    threshold = max(ambient * NOISE_MULT, RMS_FLOOR)
+    logger.info("VAD calibrated: ambient=%.4f -> threshold=%.4f", ambient, threshold)
+    return threshold
 def record_utterance(
     reachy_mini,
     stop_event,
     should_stop: Callable,
     idle_timeout: Optional[float] = None,
 ) -> tuple[list, float, bool]:
+    """Wait for speech, record until silence; return (chunks, doa_angle, stopped).
+    Energy-based VAD with a threshold auto-calibrated from the ambient noise
+    floor. The head tracks the speaker via DoA using *non-blocking* updates, so
+    the audio loop is never frozen while waiting — a blocking head move here used
+    to make the robot miss the very start of speech. If ``idle_timeout`` passes
+    with no speech, returns empty chunks so the caller can end the conversation.
     """
     chunks: list = []
     last_speech_t: Optional[float] = None
     speech_started_t: Optional[float] = None
     last_doa_angle: float = math.pi / 2   # default: facing front
     last_head_update: float = 0.0
+    max_rms_seen: float = 0.0
+    # Drain stale audio buffered during TTS playback (echo of our own voice).
     drained = 0
     while reachy_mini.media.get_audio_sample() is not None:
         drained += 1
     if drained:
         logger.debug("Drained %d stale audio chunks", drained)
+    # Calibrate the VAD threshold from a short ambient sample.
+    threshold = _calibrate_threshold(reachy_mini, stop_event)
     wait_start = time.time()
     while not stop_event.is_set():
         if (idle_timeout is not None and speech_started_t is None
                 and now - wait_start > idle_timeout):
+            logger.info(
+                "No speech within %.0f s (max RMS %.4f, threshold %.4f) — idle",
+                idle_timeout, max_rms_seen, threshold,
+            )
             return [], last_doa_angle, False
+        # Track DoA direction; nudge the head toward the speaker non-blocking
+        # (only while waiting — never freeze the audio loop).
         doa = reachy_mini.media.get_DoA()
         if doa is not None:
             last_doa_angle = doa[0]
         if speech_started_t is None and now - last_head_update >= HEAD_UPDATE_INTERVAL:
             y = math.sin(last_doa_angle - math.pi / 2) * 0.6
             try:
+                pose = reachy_mini.look_at_world(1.0, y, 0.0, perform_movement=False)
+                reachy_mini.set_target(head=pose)
             except Exception:
                 pass
             last_head_update = now
         chunk = reachy_mini.media.get_audio_sample()
         if chunk is not None:
             rms = _rms(chunk)
+            if rms > max_rms_seen:
+                max_rms_seen = rms
+            if rms > threshold:
                 if speech_started_t is None:
+                    logger.info("Speech started (RMS %.4f > %.4f)", rms, threshold)
                     speech_started_t = now
                 last_speech_t = now
             if speech_started_t is not None:
                 and now - last_speech_t > SILENCE_DURATION
                 and speech_started_t is not None
                 and now - speech_started_t > MIN_SPEECH_DURATION):
+            logger.info("Utterance ended (%.1f s)", now - speech_started_t)
             break
         if speech_started_t is not None and now - speech_started_t > MAX_DURATION:
+            logger.info("Utterance hit max duration")
             break
+        time.sleep(0.02)
     return chunks, last_doa_angle, False

talk/tts.py CHANGED Viewed

@@ -15,26 +15,50 @@ import tempfile
 import time
 from typing import Optional
 logger = logging.getLogger(__name__)
 EDGE_VOICE = "de-DE-KatjaNeural"
 EDGE_RATE = "-5%"   # slightly slower for clarity
-async def _edge_synthesize(text: str, path: str) -> None:
     import edge_tts
-    communicate = edge_tts.Communicate(text, EDGE_VOICE, rate=EDGE_RATE)
     await communicate.save(path)
-def speak(text: str, reachy_mini, words_per_minute: int = 130, lang: str = "de") -> None:
     """Synthesize *text* and play it through the robot's speakers.
     Tries edge-tts first (neural quality), falls back to espeak-ng.
-    Blocks until playback should be complete.
     """
     audio_path: Optional[str] = None
     try:
         # edge-tts outputs MP3; GStreamer playbin handles it natively.
         with tempfile.NamedTemporaryFile(suffix=".mp3", delete=False) as f:
@@ -42,7 +66,7 @@ def speak(text: str, reachy_mini, words_per_minute: int = 130, lang: str = "de")
         success = False
         try:
-            asyncio.run(_edge_synthesize(text, audio_path))
             success = True
         except ImportError:
             logger.warning("edge-tts not installed — falling back to espeak-ng")
@@ -59,7 +83,8 @@ def speak(text: str, reachy_mini, words_per_minute: int = 130, lang: str = "de")
                 logger.warning("No TTS engine available. Install edge-tts or espeak-ng.")
                 return
             subprocess.run(
-                [cmd, "-v", lang, "-s", str(words_per_minute), "-w", audio_path, "--", text],
                 check=True, timeout=15, capture_output=True,
             )

 import time
 from typing import Optional
+from talk import config
 logger = logging.getLogger(__name__)
 EDGE_VOICE = "de-DE-KatjaNeural"
 EDGE_RATE = "-5%"   # slightly slower for clarity
+def _edge_volume_str(volume_pct: int) -> str:
+    """Map a 0-200 loudness (100 = default) to edge-tts's relative volume."""
+    delta = max(-100, min(100, int(volume_pct) - 100))
+    return f"+{delta}%" if delta >= 0 else f"{delta}%"
+async def _edge_synthesize(text: str, path: str, volume_pct: int) -> None:
     import edge_tts
+    communicate = edge_tts.Communicate(
+        text, EDGE_VOICE, rate=EDGE_RATE, volume=_edge_volume_str(volume_pct)
+    )
     await communicate.save(path)
+def speak(
+    text: str,
+    reachy_mini,
+    words_per_minute: int = 130,
+    lang: str = "de",
+    volume: Optional[int] = None,
+) -> None:
     """Synthesize *text* and play it through the robot's speakers.
     Tries edge-tts first (neural quality), falls back to espeak-ng.
+    Blocks until playback should be complete. *volume* is 0-200 (100 = engine
+    default); when None it is read from the web-UI setting ``tts_volume``.
     """
     audio_path: Optional[str] = None
+    if volume is None:
+        try:
+            volume = int(config.get_setting("tts_volume", 150))
+        except (TypeError, ValueError):
+            volume = 150
+    volume = max(0, min(200, volume))
     try:
         # edge-tts outputs MP3; GStreamer playbin handles it natively.
         with tempfile.NamedTemporaryFile(suffix=".mp3", delete=False) as f:
         success = False
         try:
+            asyncio.run(_edge_synthesize(text, audio_path, volume))
             success = True
         except ImportError:
             logger.warning("edge-tts not installed — falling back to espeak-ng")
                 logger.warning("No TTS engine available. Install edge-tts or espeak-ng.")
                 return
             subprocess.run(
+                [cmd, "-v", lang, "-s", str(words_per_minute),
+                 "-a", str(volume), "-w", audio_path, "--", text],
                 check=True, timeout=15, capture_output=True,
             )