Spaces:

onitsche
/

talk

Running

onitsche Claude Opus 4.8 commited on 2 days ago

Commit

ec4de34

1 Parent(s): 035779f

Fix unresponsive wake: trigger on DoA speech instead of antenna press

The antenna-press wake never fired reliably (motors hold the antennas
stiff and 5 Hz polling misses transient presses), so the robot stayed
asleep and "reacted not at all". Wake on DoA speech_detected instead —
the mechanism already proven in the recognizer app.

- SLEEPING wakes after 3 consecutive speech-detected readings; ignores
audio for 2 s after speaking so its own goodbye can't re-wake it
- Add 12 s idle-timeout exit from CONVERSING (says "Bis bald!"), since
the antenna-press exit shares the same unreliability
- Harden stt: handle mono/stereo chunk shapes; add idle_timeout to
record_utterance
- Update CLAUDE.md state-machine docs to match

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Files changed (3) hide show

CLAUDE.md +7 -7
talk/main.py +42 -21
talk/stt.py +15 -4

CLAUDE.md CHANGED Viewed

@@ -60,21 +60,21 @@ talk = "talk.main:Talk"
 ## State Machine
 ```
-SLEEPING → (antenna press) → TIME → CONVERSING → (antenna press) → SLEEPING
 ```
-- **SLEEPING**: polls antennas at 5 Hz; detects deviation > 0.15 rad from `[-3.05, 3.05]`; 2 s debounce.
-- **TIME**: `wake_up()` → speak German datetime with gesture loop → `start_recording()` → enter CONVERSING.
 - **CONVERSING** (inner loop):
-  - **LISTENING**: `record_utterance()` polls DoA for VAD, tracks head toward speaker, checks for antenna press exit.
   - **PROCESSING**: `transcribe(chunks)` → Google STT; `get_response(messages)` → Claude API.
-  - **RESPONDING**: `stop_recording()` → `_speak_with_gestures()` → `start_recording()` → back to LISTENING.
-  - Antenna press: `stop_recording()` → `goto_sleep()` → SLEEPING.
 ## Helper Modules
 - **`talk/tts.py`**: edge-tts (MS neural, `de-DE-KatjaNeural`) → MP3 → `media.play_sound()`. Falls back to espeak-ng. Blocks for estimated playback duration.
-- **`talk/stt.py`**: records from ReSpeaker (16 kHz stereo float32), uses DoA as VAD, converts to mono 16-bit WAV, transcribes via Google Speech Recognition.
 - **`talk/llm.py`**: stateless Claude API wrapper. Caller owns `messages` list. Requires `ANTHROPIC_API_KEY`.
 ## Key SDK APIs

 ## State Machine
 ```
+SLEEPING → (speech detected) → TIME → CONVERSING → (silence/antenna press) → SLEEPING
 ```
+- **SLEEPING**: polls `get_DoA()` at 5 Hz; wakes after `DOA_DEBOUNCE` (3) consecutive speech-detected readings (same mechanism as the recognizer). Ignores audio for `DEBOUNCE_AFTER_SPEAK` (2 s) after the robot itself spoke so its own goodbye can't re-wake it.
+- **TIME**: `wake_up()` → speak German datetime with gesture loop, facing the speaker via the captured DoA angle → `start_recording()` → enter CONVERSING.
 - **CONVERSING** (inner loop):
+  - **LISTENING**: `record_utterance()` uses RMS-energy VAD, tracks head toward the speaker via DoA. Exits on antenna press, or returns empty after `IDLE_TIMEOUT` (12 s) of silence.
   - **PROCESSING**: `transcribe(chunks)` → Google STT; `get_response(messages)` → Claude API.
+  - **RESPONDING**: `_speak_with_gestures()` → back to LISTENING. Recording runs continuously throughout the conversation; `record_utterance()` drains the echo captured during playback.
+  - Exit: antenna press *or* idle timeout → `stop_recording()` → `goto_sleep()` → SLEEPING.
 ## Helper Modules
 - **`talk/tts.py`**: edge-tts (MS neural, `de-DE-KatjaNeural`) → MP3 → `media.play_sound()`. Falls back to espeak-ng. Blocks for estimated playback duration.
+- **`talk/stt.py`**: records from ReSpeaker (16 kHz float32), uses RMS-energy VAD (DoA is for head direction only), converts to mono 16-bit WAV, transcribes via Google Speech Recognition. Accepts an `idle_timeout` so a silent conversation returns to sleep.
 - **`talk/llm.py`**: stateless Claude API wrapper. Caller owns `messages` list. Requires `ANTHROPIC_API_KEY`.
 ## Key SDK APIs

talk/main.py CHANGED Viewed

@@ -1,13 +1,13 @@
 """Talk app for Reachy Mini wireless.
 State machine:
-  SLEEPING → (antenna press) → TIME → CONVERSING → (antenna press) → SLEEPING
-SLEEPING  – antennas folded; antenna press wakes the robot.
-TIME      – wake up, speak current date/time, start conversation history.
 CONVERSING – multi-turn conversation with Claude via STT → LLM → TTS.
-             Head tracks the speaker via DoA.
-             Antenna press exits back to sleep.
 """
 import logging
@@ -29,11 +29,11 @@ from talk.tts import speak
 logger = logging.getLogger(__name__)
-ANTENNA_PRESS_THRESHOLD = 0.15
-SLEEP_ANTENNAS = [-3.05, 3.05]
 AWAKE_ANTENNAS = [-0.1745, 0.1745]
-AWAKE_PRESS_THRESHOLD = 0.35   # larger: gesture loop animates antennas ±20°
-DEBOUNCE_AFTER_SPEAK = 2.0
 WEEKDAYS_DE = [
     "Montag", "Dienstag", "Mittwoch", "Donnerstag",
@@ -124,6 +124,8 @@ class Talk(ReachyMiniApp):
         reachy_mini.goto_sleep()
         state = State.SLEEPING
         last_spoke_at = 0.0
         messages: list = []
         while not stop_event.is_set():
@@ -133,16 +135,24 @@ class Talk(ReachyMiniApp):
                 with _lock:
                     _shared["state"] = "sleeping"
-                antennas = reachy_mini.get_present_antenna_joint_positions()
                 if time.time() - last_spoke_at > DEBOUNCE_AFTER_SPEAK:
-                    right_dev = abs(antennas[0] - SLEEP_ANTENNAS[0])
-                    left_dev = abs(antennas[1] - SLEEP_ANTENNAS[1])
-                    if right_dev > ANTENNA_PRESS_THRESHOLD or left_dev > ANTENNA_PRESS_THRESHOLD:
-                        logger.info(
-                            "Antenna press detected (R=%.3f rad, L=%.3f rad)",
-                            right_dev, left_dev,
-                        )
-                        state = State.TIME
                 time.sleep(0.2)
@@ -159,7 +169,8 @@ class Talk(ReachyMiniApp):
                 reachy_mini.wake_up()
                 messages = []  # fresh conversation history
-                _speak_with_gestures(text, reachy_mini, look_y=0.0)
                 last_spoke_at = time.time()
                 try:
@@ -179,7 +190,7 @@ class Talk(ReachyMiniApp):
                     return _antenna_pressed_awake(rm)
                 chunks, doa_angle, antenna_pressed = record_utterance(
-                    reachy_mini, stop_event, _should_stop
                 )
                 if antenna_pressed or stop_event.is_set():
@@ -195,7 +206,17 @@ class Talk(ReachyMiniApp):
                     continue
                 if not chunks:
-                    # Timed out with no speech — keep listening.
                     continue
                 with _lock:

 """Talk app for Reachy Mini wireless.
 State machine:
+  SLEEPING → (speech detected) → TIME → CONVERSING → (silence/antenna press) → SLEEPING
+SLEEPING   – robot asleep; wakes when speech is detected via DoA.
+TIME       – wake up, speak current date/time, start conversation history.
 CONVERSING – multi-turn conversation with Claude via STT → LLM → TTS.
+             Head tracks the speaker via DoA. Returns to sleep after a silence
+             timeout or an antenna press.
 """
 import logging
 logger = logging.getLogger(__name__)
 AWAKE_ANTENNAS = [-0.1745, 0.1745]
+AWAKE_PRESS_THRESHOLD = 0.35   # antenna-press exit during conversation (gesture-safe)
+DEBOUNCE_AFTER_SPEAK = 2.0     # ignore audio this long after the robot speaks
+DOA_DEBOUNCE = 3               # consecutive speech-detected readings to wake up
+IDLE_TIMEOUT = 12.0            # s of silence in conversation before returning to sleep
 WEEKDAYS_DE = [
     "Montag", "Dienstag", "Mittwoch", "Donnerstag",
         reachy_mini.goto_sleep()
         state = State.SLEEPING
         last_spoke_at = 0.0
+        speech_count = 0
+        doa_angle = math.pi / 2   # default: facing front
         messages: list = []
         while not stop_event.is_set():
                 with _lock:
                     _shared["state"] = "sleeping"
+                # Wake on speech via DoA — the same proven mechanism the recognizer
+                # uses. Ignore audio briefly after the robot itself spoke so the tail
+                # of its own goodbye can't immediately re-wake it.
                 if time.time() - last_spoke_at > DEBOUNCE_AFTER_SPEAK:
+                    doa = reachy_mini.media.get_DoA()
+                    if doa is not None:
+                        angle, speech = doa
+                        if speech:
+                            speech_count += 1
+                            if speech_count >= DOA_DEBOUNCE:
+                                doa_angle = angle
+                                speech_count = 0
+                                logger.info("Speech detected (DoA %.2f rad) — waking", angle)
+                                state = State.TIME
+                        else:
+                            speech_count = max(0, speech_count - 1)
+                else:
+                    speech_count = 0
                 time.sleep(0.2)
                 reachy_mini.wake_up()
                 messages = []  # fresh conversation history
+                look_y = math.sin(doa_angle - math.pi / 2) * 0.6
+                _speak_with_gestures(text, reachy_mini, look_y=look_y)
                 last_spoke_at = time.time()
                 try:
                     return _antenna_pressed_awake(rm)
                 chunks, doa_angle, antenna_pressed = record_utterance(
+                    reachy_mini, stop_event, _should_stop, idle_timeout=IDLE_TIMEOUT
                 )
                 if antenna_pressed or stop_event.is_set():
                     continue
                 if not chunks:
+                    # No speech for IDLE_TIMEOUT seconds — end the conversation.
+                    logger.info("Idle timeout — going to sleep")
+                    try:
+                        reachy_mini.media.stop_recording()
+                    except Exception:
+                        pass
+                    _speak_with_gestures("Bis bald!", reachy_mini)
+                    reachy_mini.goto_sleep()
+                    last_spoke_at = time.time()
+                    messages = []
+                    state = State.SLEEPING
                     continue
                 with _lock:

talk/stt.py CHANGED Viewed

@@ -25,14 +25,15 @@ HEAD_UPDATE_INTERVAL = 0.5    # s between head-direction updates while waiting
 def _rms(chunk: np.ndarray) -> float:
-    """RMS energy of channel 0 of a float32 stereo chunk."""
-    return float(np.sqrt(np.mean(chunk[:, 0] ** 2)))
 def _chunks_to_wav_bytes(chunks: list) -> bytes:
     """Convert (N, 2) float32 chunks to mono 16-bit PCM WAV bytes."""
     audio = np.concatenate(chunks)
-    mono = audio[:, 0]
     int16 = (mono * 32767.0).clip(-32768, 32767).astype(np.int16)
     buf = io.BytesIO()
     with wave.open(buf, "wb") as w:
@@ -47,11 +48,14 @@ def record_utterance(
     reachy_mini,
     stop_event,
     should_stop: Callable,
 ) -> tuple[list, float, bool]:
     """Wait for speech, record until silence, return (chunks, doa_angle, antenna_pressed).
     VAD is energy-based (RMS threshold). DoA is used only for head tracking
-    while waiting — updates are throttled to avoid jerky movement.
     """
     chunks: list = []
     last_speech_t: Optional[float] = None
@@ -66,12 +70,19 @@ def record_utterance(
     if drained:
         logger.debug("Drained %d stale audio chunks", drained)
     while not stop_event.is_set():
         now = time.time()
         if should_stop(reachy_mini):
             return [], last_doa_angle, True
         # Update DoA angle (direction only — not used as VAD).
         doa = reachy_mini.media.get_DoA()
         if doa is not None:

 def _rms(chunk: np.ndarray) -> float:
+    """RMS energy of channel 0 (handles mono or multi-channel float32)."""
+    ch0 = chunk[:, 0] if chunk.ndim > 1 else chunk
+    return float(np.sqrt(np.mean(ch0 ** 2)))
 def _chunks_to_wav_bytes(chunks: list) -> bytes:
     """Convert (N, 2) float32 chunks to mono 16-bit PCM WAV bytes."""
     audio = np.concatenate(chunks)
+    mono = audio[:, 0] if audio.ndim > 1 else audio
     int16 = (mono * 32767.0).clip(-32768, 32767).astype(np.int16)
     buf = io.BytesIO()
     with wave.open(buf, "wb") as w:
     reachy_mini,
     stop_event,
     should_stop: Callable,
+    idle_timeout: Optional[float] = None,
 ) -> tuple[list, float, bool]:
     """Wait for speech, record until silence, return (chunks, doa_angle, antenna_pressed).
     VAD is energy-based (RMS threshold). DoA is used only for head tracking
+    while waiting — updates are throttled to avoid jerky movement. If
+    idle_timeout is set and no speech begins within that window, returns empty
+    chunks (antenna_pressed=False) so the caller can end the conversation.
     """
     chunks: list = []
     last_speech_t: Optional[float] = None
     if drained:
         logger.debug("Drained %d stale audio chunks", drained)
+    wait_start = time.time()
     while not stop_event.is_set():
         now = time.time()
         if should_stop(reachy_mini):
             return [], last_doa_angle, True
+        if (idle_timeout is not None and speech_started_t is None
+                and now - wait_start > idle_timeout):
+            logger.debug("No speech within %.1f s — idle", idle_timeout)
+            return [], last_doa_angle, False
         # Update DoA angle (direction only — not used as VAD).
         doa = reachy_mini.media.get_DoA()
         if doa is not None: