File size: 8,932 Bytes
cc303f4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
---
sidebar_position: 2
---

# Chapter 1: Voice-to-Action with Whisper

## Learning Objectives

- Understand how speech recognition systems work in robotics
- Learn about OpenAI Whisper and its capabilities
- Implement a voice command pipeline for robot control
- Integrate speech recognition with action execution
- Create a complete voice-to-action system

## Introduction to Voice-to-Action Systems

Voice-to-action systems enable natural human-robot interaction by allowing users to control robots using spoken commands. These systems are particularly important for humanoid robots, as they enhance the natural interaction between humans and robotic systems.

### Key Components of Voice-to-Action Systems

1. **Speech Recognition**: Convert spoken language to text
2. **Natural Language Understanding**: Interpret the meaning of the text
3. **Action Mapping**: Map understood commands to robot actions
4. **Execution**: Perform the requested robot actions
5. **Feedback**: Provide confirmation of actions to the user

## OpenAI Whisper for Speech Recognition

OpenAI Whisper is a state-of-the-art speech recognition model that:

- Supports multiple languages
- Has robust performance across different accents and background noise
- Can be fine-tuned for specific applications
- Performs well with limited training data

### Whisper Architecture

Whisper is a transformer-based model that:

- Uses an encoder-decoder architecture
- Processes audio in 30-second chunks
- Outputs text in the detected language
- Can be prompted to focus on specific domains

### Whisper in Robotics Context

For robotics applications, Whisper can be used to:

- Convert voice commands to text that can be processed by NLP systems
- Handle background noise common in robot environments
- Support multiple languages for international applications
- Operate in real-time with appropriate computational resources

## Implementing Voice-to-Action Pipeline

The complete voice-to-action pipeline consists of:

```
[Microphone] → [Audio Preprocessing] → [Whisper ASR] → [NLU] → [Action Mapping] → [Robot Execution]
```

### Audio Preprocessing

Before sending audio to Whisper, preprocessing may include:

```python
import pyaudio
import numpy as np
import webrtcvad
from scipy.io import wavfile

# Initialize audio stream
audio = pyaudio.PyAudio()
stream = audio.open(
    format=pyaudio.paInt16,
    channels=1,
    rate=16000,  # Whisper expects 16kHz
    input=True,
    frames_per_buffer=1024
)

# Voice activity detection to identify speech segments
vad = webrtcvad.Vad()
vad.set_mode(1)  # Aggressiveness mode

# Process audio in chunks
frames = []
for i in range(0, int(16000 / 1024 * 5)):  # 5 seconds of audio
    data = stream.read(1024)
    frames.append(data)
    # Check for voice activity if needed
```

### Integrating Whisper

```python
import whisper

# Load model (use 'base' or 'small' for real-time applications)
model = whisper.load_model("base")

# Transcribe audio
result = model.transcribe("audio_file.wav")
command_text = result["text"]
print(f"Recognized command: {command_text}")
```

## Natural Language Understanding

Once speech is converted to text, we need to understand the intent:

### Simple Command Recognition

```python
# Define command patterns
COMMAND_PATTERNS = {
    "move_forward": ["move forward", "go forward", "walk forward"],
    "turn_left": ["turn left", "left turn", "rotate left"],
    "turn_right": ["turn right", "right turn", "rotate right"],
    "stop": ["stop", "halt", "freeze"],
    "wave": ["wave", "waving", "wave hello"],
    "dance": ["dance", "dancing", "perform dance"]
}

def extract_command(text):
    text_lower = text.lower()
    for action, patterns in COMMAND_PATTERNS.items():
        for pattern in patterns:
            if pattern in text_lower:
                return action
    return None
```

### Using LLMs for Understanding

For more complex commands, we can use large language models:

```python
import openai

def parse_complex_command(text):
    prompt = f"""
    Parse the following human command to a robot and return the appropriate action(s):
    
    Command: "{text}"
    
    Available actions: move_forward, turn_left, turn_right, stop, wave, dance, pickup_object, place_object, speak_text, navigate_to, follow_person
    
    Response format: 
    - action: <action_name>
    - parameters: <dict with any needed parameters>
    
    If the command cannot be parsed, respond with:
    - action: "unknown"
    - parameters: {{"text": "<original command>"}}
    """
    
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}]
    )
    
    return response.choices[0].message.content
```

## Voice Command System Architecture

### Complete System Implementation

```python
import asyncio
import threading
import queue
import time
from dataclasses import dataclass
from typing import Optional

@dataclass
class VoiceCommand:
    text: str
    timestamp: float
    confidence: float

class VoiceCommandSystem:
    def __init__(self, ros_node):
        self.ros_node = ros_node
        self.command_queue = queue.Queue()
        self.is_listening = False
        self.whisper_model = whisper.load_model("base")
        
    def start_listening(self):
        self.is_listening = True
        # Start audio capture thread
        audio_thread = threading.Thread(target=self._capture_audio)
        audio_thread.start()
        
        # Start processing thread
        processing_thread = threading.Thread(target=self._process_commands)
        processing_thread.start()
        
    def _capture_audio(self):
        # Implementation for audio capture would go here
        pass
        
    def _process_commands(self):
        while self.is_listening:
            if not self.command_queue.empty():
                command = self.command_queue.get()
                self._execute_robot_command(command)
                
    def _execute_robot_command(self, command: VoiceCommand):
        # Map command to robot action
        action = extract_command(command.text)
        
        if action == "move_forward":
            self.ros_node.move_robot_forward()
        elif action == "turn_left":
            self.ros_node.turn_robot_left()
        elif action == "wave":
            self.ros_node.perform_wave_action()
        # ... additional mappings
```

## Integration with ROS 2

To integrate with ROS 2, we need to connect the voice system to ROS 2 nodes:

```python
import rclpy
from rclpy.node import Node
from std_msgs.msg import String
from geometry_msgs.msg import Twist

class VoiceControlNode(Node):
    def __init__(self):
        super().__init__('voice_control_node')
        
        # Publisher for robot movement commands
        self.cmd_vel_publisher = self.create_publisher(Twist, 'cmd_vel', 10)
        
        # Publisher for voice feedback
        self.voice_feedback_publisher = self.create_publisher(String, 'voice_feedback', 10)
        
        # Initialize voice command system
        self.voice_system = VoiceCommandSystem(self)
        
    def move_robot_forward(self):
        twist = Twist()
        twist.linear.x = 0.5  # Move forward at 0.5 m/s
        self.cmd_vel_publisher.publish(twist)
        
    def turn_robot_left(self):
        twist = Twist()
        twist.angular.z = 0.5  # Turn left at 0.5 rad/s
        self.cmd_vel_publisher.publish(twist)
        
    def perform_wave_action(self):
        # Publish to robot's action server
        # Implementation would depend on specific robot capabilities
        feedback_msg = String()
        feedback_msg.data = "Performing wave action"
        self.voice_feedback_publisher.publish(feedback_msg)
```

## Challenges in Voice-to-Action Systems

### Noise and Environment

- Background noise can affect recognition accuracy
- Robot's own sounds may interfere with recognition
- Room acoustics affect audio quality

### Language and Command Complexity

- Natural language varies greatly in how commands are expressed
- Intent recognition requires robust NLU systems
- Ambiguous commands need clarification

### Real-time Requirements

- Processing delay affects user experience
- Robot response time should match human expectations
- System should handle interruptions gracefully

## Summary

Voice-to-action systems provide a natural interface for human-robot interaction, making robots more accessible and intuitive to control. Implementing these systems requires integrating speech recognition, natural language understanding, and robot action execution.

## Exercises

1. Set up a basic audio capture system in Python
2. Install and run Whisper for speech recognition
3. Create a simple command mapping system

## Next Steps

In the next chapter, we'll explore cognitive planning systems that use Large Language Models (LLMs) to decompose complex tasks into executable subtasks.