Fix unresponsive wake: trigger on DoA speech instead of antenna press
Browse filesThe antenna-press wake never fired reliably (motors hold the antennas
stiff and 5 Hz polling misses transient presses), so the robot stayed
asleep and "reacted not at all". Wake on DoA speech_detected instead β
the mechanism already proven in the recognizer app.
- SLEEPING wakes after 3 consecutive speech-detected readings; ignores
audio for 2 s after speaking so its own goodbye can't re-wake it
- Add 12 s idle-timeout exit from CONVERSING (says "Bis bald!"), since
the antenna-press exit shares the same unreliability
- Harden stt: handle mono/stereo chunk shapes; add idle_timeout to
record_utterance
- Update CLAUDE.md state-machine docs to match
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- CLAUDE.md +7 -7
- talk/main.py +42 -21
- talk/stt.py +15 -4
CLAUDE.md
CHANGED
|
@@ -60,21 +60,21 @@ talk = "talk.main:Talk"
|
|
| 60 |
## State Machine
|
| 61 |
|
| 62 |
```
|
| 63 |
-
SLEEPING β (
|
| 64 |
```
|
| 65 |
|
| 66 |
-
- **SLEEPING**: polls
|
| 67 |
-
- **TIME**: `wake_up()` β speak German datetime with gesture loop β `start_recording()` β enter CONVERSING.
|
| 68 |
- **CONVERSING** (inner loop):
|
| 69 |
-
- **LISTENING**: `record_utterance()`
|
| 70 |
- **PROCESSING**: `transcribe(chunks)` β Google STT; `get_response(messages)` β Claude API.
|
| 71 |
-
- **RESPONDING**: `
|
| 72 |
-
-
|
| 73 |
|
| 74 |
## Helper Modules
|
| 75 |
|
| 76 |
- **`talk/tts.py`**: edge-tts (MS neural, `de-DE-KatjaNeural`) β MP3 β `media.play_sound()`. Falls back to espeak-ng. Blocks for estimated playback duration.
|
| 77 |
-
- **`talk/stt.py`**: records from ReSpeaker (16 kHz
|
| 78 |
- **`talk/llm.py`**: stateless Claude API wrapper. Caller owns `messages` list. Requires `ANTHROPIC_API_KEY`.
|
| 79 |
|
| 80 |
## Key SDK APIs
|
|
|
|
| 60 |
## State Machine
|
| 61 |
|
| 62 |
```
|
| 63 |
+
SLEEPING β (speech detected) β TIME β CONVERSING β (silence/antenna press) β SLEEPING
|
| 64 |
```
|
| 65 |
|
| 66 |
+
- **SLEEPING**: polls `get_DoA()` at 5 Hz; wakes after `DOA_DEBOUNCE` (3) consecutive speech-detected readings (same mechanism as the recognizer). Ignores audio for `DEBOUNCE_AFTER_SPEAK` (2 s) after the robot itself spoke so its own goodbye can't re-wake it.
|
| 67 |
+
- **TIME**: `wake_up()` β speak German datetime with gesture loop, facing the speaker via the captured DoA angle β `start_recording()` β enter CONVERSING.
|
| 68 |
- **CONVERSING** (inner loop):
|
| 69 |
+
- **LISTENING**: `record_utterance()` uses RMS-energy VAD, tracks head toward the speaker via DoA. Exits on antenna press, or returns empty after `IDLE_TIMEOUT` (12 s) of silence.
|
| 70 |
- **PROCESSING**: `transcribe(chunks)` β Google STT; `get_response(messages)` β Claude API.
|
| 71 |
+
- **RESPONDING**: `_speak_with_gestures()` β back to LISTENING. Recording runs continuously throughout the conversation; `record_utterance()` drains the echo captured during playback.
|
| 72 |
+
- Exit: antenna press *or* idle timeout β `stop_recording()` β `goto_sleep()` β SLEEPING.
|
| 73 |
|
| 74 |
## Helper Modules
|
| 75 |
|
| 76 |
- **`talk/tts.py`**: edge-tts (MS neural, `de-DE-KatjaNeural`) β MP3 β `media.play_sound()`. Falls back to espeak-ng. Blocks for estimated playback duration.
|
| 77 |
+
- **`talk/stt.py`**: records from ReSpeaker (16 kHz float32), uses RMS-energy VAD (DoA is for head direction only), converts to mono 16-bit WAV, transcribes via Google Speech Recognition. Accepts an `idle_timeout` so a silent conversation returns to sleep.
|
| 78 |
- **`talk/llm.py`**: stateless Claude API wrapper. Caller owns `messages` list. Requires `ANTHROPIC_API_KEY`.
|
| 79 |
|
| 80 |
## Key SDK APIs
|
talk/main.py
CHANGED
|
@@ -1,13 +1,13 @@
|
|
| 1 |
"""Talk app for Reachy Mini wireless.
|
| 2 |
|
| 3 |
State machine:
|
| 4 |
-
SLEEPING β (
|
| 5 |
|
| 6 |
-
SLEEPING
|
| 7 |
-
TIME
|
| 8 |
CONVERSING β multi-turn conversation with Claude via STT β LLM β TTS.
|
| 9 |
-
Head tracks the speaker via DoA.
|
| 10 |
-
|
| 11 |
"""
|
| 12 |
|
| 13 |
import logging
|
|
@@ -29,11 +29,11 @@ from talk.tts import speak
|
|
| 29 |
|
| 30 |
logger = logging.getLogger(__name__)
|
| 31 |
|
| 32 |
-
ANTENNA_PRESS_THRESHOLD = 0.15
|
| 33 |
-
SLEEP_ANTENNAS = [-3.05, 3.05]
|
| 34 |
AWAKE_ANTENNAS = [-0.1745, 0.1745]
|
| 35 |
-
AWAKE_PRESS_THRESHOLD = 0.35 #
|
| 36 |
-
DEBOUNCE_AFTER_SPEAK = 2.0
|
|
|
|
|
|
|
| 37 |
|
| 38 |
WEEKDAYS_DE = [
|
| 39 |
"Montag", "Dienstag", "Mittwoch", "Donnerstag",
|
|
@@ -124,6 +124,8 @@ class Talk(ReachyMiniApp):
|
|
| 124 |
reachy_mini.goto_sleep()
|
| 125 |
state = State.SLEEPING
|
| 126 |
last_spoke_at = 0.0
|
|
|
|
|
|
|
| 127 |
messages: list = []
|
| 128 |
|
| 129 |
while not stop_event.is_set():
|
|
@@ -133,16 +135,24 @@ class Talk(ReachyMiniApp):
|
|
| 133 |
with _lock:
|
| 134 |
_shared["state"] = "sleeping"
|
| 135 |
|
| 136 |
-
|
|
|
|
|
|
|
| 137 |
if time.time() - last_spoke_at > DEBOUNCE_AFTER_SPEAK:
|
| 138 |
-
|
| 139 |
-
|
| 140 |
-
|
| 141 |
-
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
|
| 145 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 146 |
|
| 147 |
time.sleep(0.2)
|
| 148 |
|
|
@@ -159,7 +169,8 @@ class Talk(ReachyMiniApp):
|
|
| 159 |
reachy_mini.wake_up()
|
| 160 |
messages = [] # fresh conversation history
|
| 161 |
|
| 162 |
-
|
|
|
|
| 163 |
last_spoke_at = time.time()
|
| 164 |
|
| 165 |
try:
|
|
@@ -179,7 +190,7 @@ class Talk(ReachyMiniApp):
|
|
| 179 |
return _antenna_pressed_awake(rm)
|
| 180 |
|
| 181 |
chunks, doa_angle, antenna_pressed = record_utterance(
|
| 182 |
-
reachy_mini, stop_event, _should_stop
|
| 183 |
)
|
| 184 |
|
| 185 |
if antenna_pressed or stop_event.is_set():
|
|
@@ -195,7 +206,17 @@ class Talk(ReachyMiniApp):
|
|
| 195 |
continue
|
| 196 |
|
| 197 |
if not chunks:
|
| 198 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 199 |
continue
|
| 200 |
|
| 201 |
with _lock:
|
|
|
|
| 1 |
"""Talk app for Reachy Mini wireless.
|
| 2 |
|
| 3 |
State machine:
|
| 4 |
+
SLEEPING β (speech detected) β TIME β CONVERSING β (silence/antenna press) β SLEEPING
|
| 5 |
|
| 6 |
+
SLEEPING β robot asleep; wakes when speech is detected via DoA.
|
| 7 |
+
TIME β wake up, speak current date/time, start conversation history.
|
| 8 |
CONVERSING β multi-turn conversation with Claude via STT β LLM β TTS.
|
| 9 |
+
Head tracks the speaker via DoA. Returns to sleep after a silence
|
| 10 |
+
timeout or an antenna press.
|
| 11 |
"""
|
| 12 |
|
| 13 |
import logging
|
|
|
|
| 29 |
|
| 30 |
logger = logging.getLogger(__name__)
|
| 31 |
|
|
|
|
|
|
|
| 32 |
AWAKE_ANTENNAS = [-0.1745, 0.1745]
|
| 33 |
+
AWAKE_PRESS_THRESHOLD = 0.35 # antenna-press exit during conversation (gesture-safe)
|
| 34 |
+
DEBOUNCE_AFTER_SPEAK = 2.0 # ignore audio this long after the robot speaks
|
| 35 |
+
DOA_DEBOUNCE = 3 # consecutive speech-detected readings to wake up
|
| 36 |
+
IDLE_TIMEOUT = 12.0 # s of silence in conversation before returning to sleep
|
| 37 |
|
| 38 |
WEEKDAYS_DE = [
|
| 39 |
"Montag", "Dienstag", "Mittwoch", "Donnerstag",
|
|
|
|
| 124 |
reachy_mini.goto_sleep()
|
| 125 |
state = State.SLEEPING
|
| 126 |
last_spoke_at = 0.0
|
| 127 |
+
speech_count = 0
|
| 128 |
+
doa_angle = math.pi / 2 # default: facing front
|
| 129 |
messages: list = []
|
| 130 |
|
| 131 |
while not stop_event.is_set():
|
|
|
|
| 135 |
with _lock:
|
| 136 |
_shared["state"] = "sleeping"
|
| 137 |
|
| 138 |
+
# Wake on speech via DoA β the same proven mechanism the recognizer
|
| 139 |
+
# uses. Ignore audio briefly after the robot itself spoke so the tail
|
| 140 |
+
# of its own goodbye can't immediately re-wake it.
|
| 141 |
if time.time() - last_spoke_at > DEBOUNCE_AFTER_SPEAK:
|
| 142 |
+
doa = reachy_mini.media.get_DoA()
|
| 143 |
+
if doa is not None:
|
| 144 |
+
angle, speech = doa
|
| 145 |
+
if speech:
|
| 146 |
+
speech_count += 1
|
| 147 |
+
if speech_count >= DOA_DEBOUNCE:
|
| 148 |
+
doa_angle = angle
|
| 149 |
+
speech_count = 0
|
| 150 |
+
logger.info("Speech detected (DoA %.2f rad) β waking", angle)
|
| 151 |
+
state = State.TIME
|
| 152 |
+
else:
|
| 153 |
+
speech_count = max(0, speech_count - 1)
|
| 154 |
+
else:
|
| 155 |
+
speech_count = 0
|
| 156 |
|
| 157 |
time.sleep(0.2)
|
| 158 |
|
|
|
|
| 169 |
reachy_mini.wake_up()
|
| 170 |
messages = [] # fresh conversation history
|
| 171 |
|
| 172 |
+
look_y = math.sin(doa_angle - math.pi / 2) * 0.6
|
| 173 |
+
_speak_with_gestures(text, reachy_mini, look_y=look_y)
|
| 174 |
last_spoke_at = time.time()
|
| 175 |
|
| 176 |
try:
|
|
|
|
| 190 |
return _antenna_pressed_awake(rm)
|
| 191 |
|
| 192 |
chunks, doa_angle, antenna_pressed = record_utterance(
|
| 193 |
+
reachy_mini, stop_event, _should_stop, idle_timeout=IDLE_TIMEOUT
|
| 194 |
)
|
| 195 |
|
| 196 |
if antenna_pressed or stop_event.is_set():
|
|
|
|
| 206 |
continue
|
| 207 |
|
| 208 |
if not chunks:
|
| 209 |
+
# No speech for IDLE_TIMEOUT seconds β end the conversation.
|
| 210 |
+
logger.info("Idle timeout β going to sleep")
|
| 211 |
+
try:
|
| 212 |
+
reachy_mini.media.stop_recording()
|
| 213 |
+
except Exception:
|
| 214 |
+
pass
|
| 215 |
+
_speak_with_gestures("Bis bald!", reachy_mini)
|
| 216 |
+
reachy_mini.goto_sleep()
|
| 217 |
+
last_spoke_at = time.time()
|
| 218 |
+
messages = []
|
| 219 |
+
state = State.SLEEPING
|
| 220 |
continue
|
| 221 |
|
| 222 |
with _lock:
|
talk/stt.py
CHANGED
|
@@ -25,14 +25,15 @@ HEAD_UPDATE_INTERVAL = 0.5 # s between head-direction updates while waiting
|
|
| 25 |
|
| 26 |
|
| 27 |
def _rms(chunk: np.ndarray) -> float:
|
| 28 |
-
"""RMS energy of channel 0
|
| 29 |
-
|
|
|
|
| 30 |
|
| 31 |
|
| 32 |
def _chunks_to_wav_bytes(chunks: list) -> bytes:
|
| 33 |
"""Convert (N, 2) float32 chunks to mono 16-bit PCM WAV bytes."""
|
| 34 |
audio = np.concatenate(chunks)
|
| 35 |
-
mono = audio[:, 0]
|
| 36 |
int16 = (mono * 32767.0).clip(-32768, 32767).astype(np.int16)
|
| 37 |
buf = io.BytesIO()
|
| 38 |
with wave.open(buf, "wb") as w:
|
|
@@ -47,11 +48,14 @@ def record_utterance(
|
|
| 47 |
reachy_mini,
|
| 48 |
stop_event,
|
| 49 |
should_stop: Callable,
|
|
|
|
| 50 |
) -> tuple[list, float, bool]:
|
| 51 |
"""Wait for speech, record until silence, return (chunks, doa_angle, antenna_pressed).
|
| 52 |
|
| 53 |
VAD is energy-based (RMS threshold). DoA is used only for head tracking
|
| 54 |
-
while waiting β updates are throttled to avoid jerky movement.
|
|
|
|
|
|
|
| 55 |
"""
|
| 56 |
chunks: list = []
|
| 57 |
last_speech_t: Optional[float] = None
|
|
@@ -66,12 +70,19 @@ def record_utterance(
|
|
| 66 |
if drained:
|
| 67 |
logger.debug("Drained %d stale audio chunks", drained)
|
| 68 |
|
|
|
|
|
|
|
| 69 |
while not stop_event.is_set():
|
| 70 |
now = time.time()
|
| 71 |
|
| 72 |
if should_stop(reachy_mini):
|
| 73 |
return [], last_doa_angle, True
|
| 74 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
# Update DoA angle (direction only β not used as VAD).
|
| 76 |
doa = reachy_mini.media.get_DoA()
|
| 77 |
if doa is not None:
|
|
|
|
| 25 |
|
| 26 |
|
| 27 |
def _rms(chunk: np.ndarray) -> float:
|
| 28 |
+
"""RMS energy of channel 0 (handles mono or multi-channel float32)."""
|
| 29 |
+
ch0 = chunk[:, 0] if chunk.ndim > 1 else chunk
|
| 30 |
+
return float(np.sqrt(np.mean(ch0 ** 2)))
|
| 31 |
|
| 32 |
|
| 33 |
def _chunks_to_wav_bytes(chunks: list) -> bytes:
|
| 34 |
"""Convert (N, 2) float32 chunks to mono 16-bit PCM WAV bytes."""
|
| 35 |
audio = np.concatenate(chunks)
|
| 36 |
+
mono = audio[:, 0] if audio.ndim > 1 else audio
|
| 37 |
int16 = (mono * 32767.0).clip(-32768, 32767).astype(np.int16)
|
| 38 |
buf = io.BytesIO()
|
| 39 |
with wave.open(buf, "wb") as w:
|
|
|
|
| 48 |
reachy_mini,
|
| 49 |
stop_event,
|
| 50 |
should_stop: Callable,
|
| 51 |
+
idle_timeout: Optional[float] = None,
|
| 52 |
) -> tuple[list, float, bool]:
|
| 53 |
"""Wait for speech, record until silence, return (chunks, doa_angle, antenna_pressed).
|
| 54 |
|
| 55 |
VAD is energy-based (RMS threshold). DoA is used only for head tracking
|
| 56 |
+
while waiting β updates are throttled to avoid jerky movement. If
|
| 57 |
+
idle_timeout is set and no speech begins within that window, returns empty
|
| 58 |
+
chunks (antenna_pressed=False) so the caller can end the conversation.
|
| 59 |
"""
|
| 60 |
chunks: list = []
|
| 61 |
last_speech_t: Optional[float] = None
|
|
|
|
| 70 |
if drained:
|
| 71 |
logger.debug("Drained %d stale audio chunks", drained)
|
| 72 |
|
| 73 |
+
wait_start = time.time()
|
| 74 |
+
|
| 75 |
while not stop_event.is_set():
|
| 76 |
now = time.time()
|
| 77 |
|
| 78 |
if should_stop(reachy_mini):
|
| 79 |
return [], last_doa_angle, True
|
| 80 |
|
| 81 |
+
if (idle_timeout is not None and speech_started_t is None
|
| 82 |
+
and now - wait_start > idle_timeout):
|
| 83 |
+
logger.debug("No speech within %.1f s β idle", idle_timeout)
|
| 84 |
+
return [], last_doa_angle, False
|
| 85 |
+
|
| 86 |
# Update DoA angle (direction only β not used as VAD).
|
| 87 |
doa = reachy_mini.media.get_DoA()
|
| 88 |
if doa is not None:
|