onitsche Claude Opus 4.8 commited on
Commit
ec4de34
Β·
1 Parent(s): 035779f

Fix unresponsive wake: trigger on DoA speech instead of antenna press

Browse files

The antenna-press wake never fired reliably (motors hold the antennas
stiff and 5 Hz polling misses transient presses), so the robot stayed
asleep and "reacted not at all". Wake on DoA speech_detected instead β€”
the mechanism already proven in the recognizer app.

- SLEEPING wakes after 3 consecutive speech-detected readings; ignores
audio for 2 s after speaking so its own goodbye can't re-wake it
- Add 12 s idle-timeout exit from CONVERSING (says "Bis bald!"), since
the antenna-press exit shares the same unreliability
- Harden stt: handle mono/stereo chunk shapes; add idle_timeout to
record_utterance
- Update CLAUDE.md state-machine docs to match

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Files changed (3) hide show
  1. CLAUDE.md +7 -7
  2. talk/main.py +42 -21
  3. talk/stt.py +15 -4
CLAUDE.md CHANGED
@@ -60,21 +60,21 @@ talk = "talk.main:Talk"
60
  ## State Machine
61
 
62
  ```
63
- SLEEPING β†’ (antenna press) β†’ TIME β†’ CONVERSING β†’ (antenna press) β†’ SLEEPING
64
  ```
65
 
66
- - **SLEEPING**: polls antennas at 5 Hz; detects deviation > 0.15 rad from `[-3.05, 3.05]`; 2 s debounce.
67
- - **TIME**: `wake_up()` β†’ speak German datetime with gesture loop β†’ `start_recording()` β†’ enter CONVERSING.
68
  - **CONVERSING** (inner loop):
69
- - **LISTENING**: `record_utterance()` polls DoA for VAD, tracks head toward speaker, checks for antenna press exit.
70
  - **PROCESSING**: `transcribe(chunks)` β†’ Google STT; `get_response(messages)` β†’ Claude API.
71
- - **RESPONDING**: `stop_recording()` β†’ `_speak_with_gestures()` β†’ `start_recording()` β†’ back to LISTENING.
72
- - Antenna press: `stop_recording()` β†’ `goto_sleep()` β†’ SLEEPING.
73
 
74
  ## Helper Modules
75
 
76
  - **`talk/tts.py`**: edge-tts (MS neural, `de-DE-KatjaNeural`) β†’ MP3 β†’ `media.play_sound()`. Falls back to espeak-ng. Blocks for estimated playback duration.
77
- - **`talk/stt.py`**: records from ReSpeaker (16 kHz stereo float32), uses DoA as VAD, converts to mono 16-bit WAV, transcribes via Google Speech Recognition.
78
  - **`talk/llm.py`**: stateless Claude API wrapper. Caller owns `messages` list. Requires `ANTHROPIC_API_KEY`.
79
 
80
  ## Key SDK APIs
 
60
  ## State Machine
61
 
62
  ```
63
+ SLEEPING β†’ (speech detected) β†’ TIME β†’ CONVERSING β†’ (silence/antenna press) β†’ SLEEPING
64
  ```
65
 
66
+ - **SLEEPING**: polls `get_DoA()` at 5 Hz; wakes after `DOA_DEBOUNCE` (3) consecutive speech-detected readings (same mechanism as the recognizer). Ignores audio for `DEBOUNCE_AFTER_SPEAK` (2 s) after the robot itself spoke so its own goodbye can't re-wake it.
67
+ - **TIME**: `wake_up()` β†’ speak German datetime with gesture loop, facing the speaker via the captured DoA angle β†’ `start_recording()` β†’ enter CONVERSING.
68
  - **CONVERSING** (inner loop):
69
+ - **LISTENING**: `record_utterance()` uses RMS-energy VAD, tracks head toward the speaker via DoA. Exits on antenna press, or returns empty after `IDLE_TIMEOUT` (12 s) of silence.
70
  - **PROCESSING**: `transcribe(chunks)` β†’ Google STT; `get_response(messages)` β†’ Claude API.
71
+ - **RESPONDING**: `_speak_with_gestures()` β†’ back to LISTENING. Recording runs continuously throughout the conversation; `record_utterance()` drains the echo captured during playback.
72
+ - Exit: antenna press *or* idle timeout β†’ `stop_recording()` β†’ `goto_sleep()` β†’ SLEEPING.
73
 
74
  ## Helper Modules
75
 
76
  - **`talk/tts.py`**: edge-tts (MS neural, `de-DE-KatjaNeural`) β†’ MP3 β†’ `media.play_sound()`. Falls back to espeak-ng. Blocks for estimated playback duration.
77
+ - **`talk/stt.py`**: records from ReSpeaker (16 kHz float32), uses RMS-energy VAD (DoA is for head direction only), converts to mono 16-bit WAV, transcribes via Google Speech Recognition. Accepts an `idle_timeout` so a silent conversation returns to sleep.
78
  - **`talk/llm.py`**: stateless Claude API wrapper. Caller owns `messages` list. Requires `ANTHROPIC_API_KEY`.
79
 
80
  ## Key SDK APIs
talk/main.py CHANGED
@@ -1,13 +1,13 @@
1
  """Talk app for Reachy Mini wireless.
2
 
3
  State machine:
4
- SLEEPING β†’ (antenna press) β†’ TIME β†’ CONVERSING β†’ (antenna press) β†’ SLEEPING
5
 
6
- SLEEPING – antennas folded; antenna press wakes the robot.
7
- TIME – wake up, speak current date/time, start conversation history.
8
  CONVERSING – multi-turn conversation with Claude via STT β†’ LLM β†’ TTS.
9
- Head tracks the speaker via DoA.
10
- Antenna press exits back to sleep.
11
  """
12
 
13
  import logging
@@ -29,11 +29,11 @@ from talk.tts import speak
29
 
30
  logger = logging.getLogger(__name__)
31
 
32
- ANTENNA_PRESS_THRESHOLD = 0.15
33
- SLEEP_ANTENNAS = [-3.05, 3.05]
34
  AWAKE_ANTENNAS = [-0.1745, 0.1745]
35
- AWAKE_PRESS_THRESHOLD = 0.35 # larger: gesture loop animates antennas Β±20Β°
36
- DEBOUNCE_AFTER_SPEAK = 2.0
 
 
37
 
38
  WEEKDAYS_DE = [
39
  "Montag", "Dienstag", "Mittwoch", "Donnerstag",
@@ -124,6 +124,8 @@ class Talk(ReachyMiniApp):
124
  reachy_mini.goto_sleep()
125
  state = State.SLEEPING
126
  last_spoke_at = 0.0
 
 
127
  messages: list = []
128
 
129
  while not stop_event.is_set():
@@ -133,16 +135,24 @@ class Talk(ReachyMiniApp):
133
  with _lock:
134
  _shared["state"] = "sleeping"
135
 
136
- antennas = reachy_mini.get_present_antenna_joint_positions()
 
 
137
  if time.time() - last_spoke_at > DEBOUNCE_AFTER_SPEAK:
138
- right_dev = abs(antennas[0] - SLEEP_ANTENNAS[0])
139
- left_dev = abs(antennas[1] - SLEEP_ANTENNAS[1])
140
- if right_dev > ANTENNA_PRESS_THRESHOLD or left_dev > ANTENNA_PRESS_THRESHOLD:
141
- logger.info(
142
- "Antenna press detected (R=%.3f rad, L=%.3f rad)",
143
- right_dev, left_dev,
144
- )
145
- state = State.TIME
 
 
 
 
 
 
146
 
147
  time.sleep(0.2)
148
 
@@ -159,7 +169,8 @@ class Talk(ReachyMiniApp):
159
  reachy_mini.wake_up()
160
  messages = [] # fresh conversation history
161
 
162
- _speak_with_gestures(text, reachy_mini, look_y=0.0)
 
163
  last_spoke_at = time.time()
164
 
165
  try:
@@ -179,7 +190,7 @@ class Talk(ReachyMiniApp):
179
  return _antenna_pressed_awake(rm)
180
 
181
  chunks, doa_angle, antenna_pressed = record_utterance(
182
- reachy_mini, stop_event, _should_stop
183
  )
184
 
185
  if antenna_pressed or stop_event.is_set():
@@ -195,7 +206,17 @@ class Talk(ReachyMiniApp):
195
  continue
196
 
197
  if not chunks:
198
- # Timed out with no speech β€” keep listening.
 
 
 
 
 
 
 
 
 
 
199
  continue
200
 
201
  with _lock:
 
1
  """Talk app for Reachy Mini wireless.
2
 
3
  State machine:
4
+ SLEEPING β†’ (speech detected) β†’ TIME β†’ CONVERSING β†’ (silence/antenna press) β†’ SLEEPING
5
 
6
+ SLEEPING – robot asleep; wakes when speech is detected via DoA.
7
+ TIME – wake up, speak current date/time, start conversation history.
8
  CONVERSING – multi-turn conversation with Claude via STT β†’ LLM β†’ TTS.
9
+ Head tracks the speaker via DoA. Returns to sleep after a silence
10
+ timeout or an antenna press.
11
  """
12
 
13
  import logging
 
29
 
30
  logger = logging.getLogger(__name__)
31
 
 
 
32
  AWAKE_ANTENNAS = [-0.1745, 0.1745]
33
+ AWAKE_PRESS_THRESHOLD = 0.35 # antenna-press exit during conversation (gesture-safe)
34
+ DEBOUNCE_AFTER_SPEAK = 2.0 # ignore audio this long after the robot speaks
35
+ DOA_DEBOUNCE = 3 # consecutive speech-detected readings to wake up
36
+ IDLE_TIMEOUT = 12.0 # s of silence in conversation before returning to sleep
37
 
38
  WEEKDAYS_DE = [
39
  "Montag", "Dienstag", "Mittwoch", "Donnerstag",
 
124
  reachy_mini.goto_sleep()
125
  state = State.SLEEPING
126
  last_spoke_at = 0.0
127
+ speech_count = 0
128
+ doa_angle = math.pi / 2 # default: facing front
129
  messages: list = []
130
 
131
  while not stop_event.is_set():
 
135
  with _lock:
136
  _shared["state"] = "sleeping"
137
 
138
+ # Wake on speech via DoA β€” the same proven mechanism the recognizer
139
+ # uses. Ignore audio briefly after the robot itself spoke so the tail
140
+ # of its own goodbye can't immediately re-wake it.
141
  if time.time() - last_spoke_at > DEBOUNCE_AFTER_SPEAK:
142
+ doa = reachy_mini.media.get_DoA()
143
+ if doa is not None:
144
+ angle, speech = doa
145
+ if speech:
146
+ speech_count += 1
147
+ if speech_count >= DOA_DEBOUNCE:
148
+ doa_angle = angle
149
+ speech_count = 0
150
+ logger.info("Speech detected (DoA %.2f rad) β€” waking", angle)
151
+ state = State.TIME
152
+ else:
153
+ speech_count = max(0, speech_count - 1)
154
+ else:
155
+ speech_count = 0
156
 
157
  time.sleep(0.2)
158
 
 
169
  reachy_mini.wake_up()
170
  messages = [] # fresh conversation history
171
 
172
+ look_y = math.sin(doa_angle - math.pi / 2) * 0.6
173
+ _speak_with_gestures(text, reachy_mini, look_y=look_y)
174
  last_spoke_at = time.time()
175
 
176
  try:
 
190
  return _antenna_pressed_awake(rm)
191
 
192
  chunks, doa_angle, antenna_pressed = record_utterance(
193
+ reachy_mini, stop_event, _should_stop, idle_timeout=IDLE_TIMEOUT
194
  )
195
 
196
  if antenna_pressed or stop_event.is_set():
 
206
  continue
207
 
208
  if not chunks:
209
+ # No speech for IDLE_TIMEOUT seconds β€” end the conversation.
210
+ logger.info("Idle timeout β€” going to sleep")
211
+ try:
212
+ reachy_mini.media.stop_recording()
213
+ except Exception:
214
+ pass
215
+ _speak_with_gestures("Bis bald!", reachy_mini)
216
+ reachy_mini.goto_sleep()
217
+ last_spoke_at = time.time()
218
+ messages = []
219
+ state = State.SLEEPING
220
  continue
221
 
222
  with _lock:
talk/stt.py CHANGED
@@ -25,14 +25,15 @@ HEAD_UPDATE_INTERVAL = 0.5 # s between head-direction updates while waiting
25
 
26
 
27
  def _rms(chunk: np.ndarray) -> float:
28
- """RMS energy of channel 0 of a float32 stereo chunk."""
29
- return float(np.sqrt(np.mean(chunk[:, 0] ** 2)))
 
30
 
31
 
32
  def _chunks_to_wav_bytes(chunks: list) -> bytes:
33
  """Convert (N, 2) float32 chunks to mono 16-bit PCM WAV bytes."""
34
  audio = np.concatenate(chunks)
35
- mono = audio[:, 0]
36
  int16 = (mono * 32767.0).clip(-32768, 32767).astype(np.int16)
37
  buf = io.BytesIO()
38
  with wave.open(buf, "wb") as w:
@@ -47,11 +48,14 @@ def record_utterance(
47
  reachy_mini,
48
  stop_event,
49
  should_stop: Callable,
 
50
  ) -> tuple[list, float, bool]:
51
  """Wait for speech, record until silence, return (chunks, doa_angle, antenna_pressed).
52
 
53
  VAD is energy-based (RMS threshold). DoA is used only for head tracking
54
- while waiting β€” updates are throttled to avoid jerky movement.
 
 
55
  """
56
  chunks: list = []
57
  last_speech_t: Optional[float] = None
@@ -66,12 +70,19 @@ def record_utterance(
66
  if drained:
67
  logger.debug("Drained %d stale audio chunks", drained)
68
 
 
 
69
  while not stop_event.is_set():
70
  now = time.time()
71
 
72
  if should_stop(reachy_mini):
73
  return [], last_doa_angle, True
74
 
 
 
 
 
 
75
  # Update DoA angle (direction only β€” not used as VAD).
76
  doa = reachy_mini.media.get_DoA()
77
  if doa is not None:
 
25
 
26
 
27
  def _rms(chunk: np.ndarray) -> float:
28
+ """RMS energy of channel 0 (handles mono or multi-channel float32)."""
29
+ ch0 = chunk[:, 0] if chunk.ndim > 1 else chunk
30
+ return float(np.sqrt(np.mean(ch0 ** 2)))
31
 
32
 
33
  def _chunks_to_wav_bytes(chunks: list) -> bytes:
34
  """Convert (N, 2) float32 chunks to mono 16-bit PCM WAV bytes."""
35
  audio = np.concatenate(chunks)
36
+ mono = audio[:, 0] if audio.ndim > 1 else audio
37
  int16 = (mono * 32767.0).clip(-32768, 32767).astype(np.int16)
38
  buf = io.BytesIO()
39
  with wave.open(buf, "wb") as w:
 
48
  reachy_mini,
49
  stop_event,
50
  should_stop: Callable,
51
+ idle_timeout: Optional[float] = None,
52
  ) -> tuple[list, float, bool]:
53
  """Wait for speech, record until silence, return (chunks, doa_angle, antenna_pressed).
54
 
55
  VAD is energy-based (RMS threshold). DoA is used only for head tracking
56
+ while waiting β€” updates are throttled to avoid jerky movement. If
57
+ idle_timeout is set and no speech begins within that window, returns empty
58
+ chunks (antenna_pressed=False) so the caller can end the conversation.
59
  """
60
  chunks: list = []
61
  last_speech_t: Optional[float] = None
 
70
  if drained:
71
  logger.debug("Drained %d stale audio chunks", drained)
72
 
73
+ wait_start = time.time()
74
+
75
  while not stop_event.is_set():
76
  now = time.time()
77
 
78
  if should_stop(reachy_mini):
79
  return [], last_doa_angle, True
80
 
81
+ if (idle_timeout is not None and speech_started_t is None
82
+ and now - wait_start > idle_timeout):
83
+ logger.debug("No speech within %.1f s β€” idle", idle_timeout)
84
+ return [], last_doa_angle, False
85
+
86
  # Update DoA angle (direction only β€” not used as VAD).
87
  doa = reachy_mini.media.get_DoA()
88
  if doa is not None: