Spaces:
Sleeping
Sleeping
File size: 8,932 Bytes
cc303f4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 | ---
sidebar_position: 2
---
# Chapter 1: Voice-to-Action with Whisper
## Learning Objectives
- Understand how speech recognition systems work in robotics
- Learn about OpenAI Whisper and its capabilities
- Implement a voice command pipeline for robot control
- Integrate speech recognition with action execution
- Create a complete voice-to-action system
## Introduction to Voice-to-Action Systems
Voice-to-action systems enable natural human-robot interaction by allowing users to control robots using spoken commands. These systems are particularly important for humanoid robots, as they enhance the natural interaction between humans and robotic systems.
### Key Components of Voice-to-Action Systems
1. **Speech Recognition**: Convert spoken language to text
2. **Natural Language Understanding**: Interpret the meaning of the text
3. **Action Mapping**: Map understood commands to robot actions
4. **Execution**: Perform the requested robot actions
5. **Feedback**: Provide confirmation of actions to the user
## OpenAI Whisper for Speech Recognition
OpenAI Whisper is a state-of-the-art speech recognition model that:
- Supports multiple languages
- Has robust performance across different accents and background noise
- Can be fine-tuned for specific applications
- Performs well with limited training data
### Whisper Architecture
Whisper is a transformer-based model that:
- Uses an encoder-decoder architecture
- Processes audio in 30-second chunks
- Outputs text in the detected language
- Can be prompted to focus on specific domains
### Whisper in Robotics Context
For robotics applications, Whisper can be used to:
- Convert voice commands to text that can be processed by NLP systems
- Handle background noise common in robot environments
- Support multiple languages for international applications
- Operate in real-time with appropriate computational resources
## Implementing Voice-to-Action Pipeline
The complete voice-to-action pipeline consists of:
```
[Microphone] → [Audio Preprocessing] → [Whisper ASR] → [NLU] → [Action Mapping] → [Robot Execution]
```
### Audio Preprocessing
Before sending audio to Whisper, preprocessing may include:
```python
import pyaudio
import numpy as np
import webrtcvad
from scipy.io import wavfile
# Initialize audio stream
audio = pyaudio.PyAudio()
stream = audio.open(
format=pyaudio.paInt16,
channels=1,
rate=16000, # Whisper expects 16kHz
input=True,
frames_per_buffer=1024
)
# Voice activity detection to identify speech segments
vad = webrtcvad.Vad()
vad.set_mode(1) # Aggressiveness mode
# Process audio in chunks
frames = []
for i in range(0, int(16000 / 1024 * 5)): # 5 seconds of audio
data = stream.read(1024)
frames.append(data)
# Check for voice activity if needed
```
### Integrating Whisper
```python
import whisper
# Load model (use 'base' or 'small' for real-time applications)
model = whisper.load_model("base")
# Transcribe audio
result = model.transcribe("audio_file.wav")
command_text = result["text"]
print(f"Recognized command: {command_text}")
```
## Natural Language Understanding
Once speech is converted to text, we need to understand the intent:
### Simple Command Recognition
```python
# Define command patterns
COMMAND_PATTERNS = {
"move_forward": ["move forward", "go forward", "walk forward"],
"turn_left": ["turn left", "left turn", "rotate left"],
"turn_right": ["turn right", "right turn", "rotate right"],
"stop": ["stop", "halt", "freeze"],
"wave": ["wave", "waving", "wave hello"],
"dance": ["dance", "dancing", "perform dance"]
}
def extract_command(text):
text_lower = text.lower()
for action, patterns in COMMAND_PATTERNS.items():
for pattern in patterns:
if pattern in text_lower:
return action
return None
```
### Using LLMs for Understanding
For more complex commands, we can use large language models:
```python
import openai
def parse_complex_command(text):
prompt = f"""
Parse the following human command to a robot and return the appropriate action(s):
Command: "{text}"
Available actions: move_forward, turn_left, turn_right, stop, wave, dance, pickup_object, place_object, speak_text, navigate_to, follow_person
Response format:
- action: <action_name>
- parameters: <dict with any needed parameters>
If the command cannot be parsed, respond with:
- action: "unknown"
- parameters: {{"text": "<original command>"}}
"""
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
```
## Voice Command System Architecture
### Complete System Implementation
```python
import asyncio
import threading
import queue
import time
from dataclasses import dataclass
from typing import Optional
@dataclass
class VoiceCommand:
text: str
timestamp: float
confidence: float
class VoiceCommandSystem:
def __init__(self, ros_node):
self.ros_node = ros_node
self.command_queue = queue.Queue()
self.is_listening = False
self.whisper_model = whisper.load_model("base")
def start_listening(self):
self.is_listening = True
# Start audio capture thread
audio_thread = threading.Thread(target=self._capture_audio)
audio_thread.start()
# Start processing thread
processing_thread = threading.Thread(target=self._process_commands)
processing_thread.start()
def _capture_audio(self):
# Implementation for audio capture would go here
pass
def _process_commands(self):
while self.is_listening:
if not self.command_queue.empty():
command = self.command_queue.get()
self._execute_robot_command(command)
def _execute_robot_command(self, command: VoiceCommand):
# Map command to robot action
action = extract_command(command.text)
if action == "move_forward":
self.ros_node.move_robot_forward()
elif action == "turn_left":
self.ros_node.turn_robot_left()
elif action == "wave":
self.ros_node.perform_wave_action()
# ... additional mappings
```
## Integration with ROS 2
To integrate with ROS 2, we need to connect the voice system to ROS 2 nodes:
```python
import rclpy
from rclpy.node import Node
from std_msgs.msg import String
from geometry_msgs.msg import Twist
class VoiceControlNode(Node):
def __init__(self):
super().__init__('voice_control_node')
# Publisher for robot movement commands
self.cmd_vel_publisher = self.create_publisher(Twist, 'cmd_vel', 10)
# Publisher for voice feedback
self.voice_feedback_publisher = self.create_publisher(String, 'voice_feedback', 10)
# Initialize voice command system
self.voice_system = VoiceCommandSystem(self)
def move_robot_forward(self):
twist = Twist()
twist.linear.x = 0.5 # Move forward at 0.5 m/s
self.cmd_vel_publisher.publish(twist)
def turn_robot_left(self):
twist = Twist()
twist.angular.z = 0.5 # Turn left at 0.5 rad/s
self.cmd_vel_publisher.publish(twist)
def perform_wave_action(self):
# Publish to robot's action server
# Implementation would depend on specific robot capabilities
feedback_msg = String()
feedback_msg.data = "Performing wave action"
self.voice_feedback_publisher.publish(feedback_msg)
```
## Challenges in Voice-to-Action Systems
### Noise and Environment
- Background noise can affect recognition accuracy
- Robot's own sounds may interfere with recognition
- Room acoustics affect audio quality
### Language and Command Complexity
- Natural language varies greatly in how commands are expressed
- Intent recognition requires robust NLU systems
- Ambiguous commands need clarification
### Real-time Requirements
- Processing delay affects user experience
- Robot response time should match human expectations
- System should handle interruptions gracefully
## Summary
Voice-to-action systems provide a natural interface for human-robot interaction, making robots more accessible and intuitive to control. Implementing these systems requires integrating speech recognition, natural language understanding, and robot action execution.
## Exercises
1. Set up a basic audio capture system in Python
2. Install and run Whisper for speech recognition
3. Create a simple command mapping system
## Next Steps
In the next chapter, we'll explore cognitive planning systems that use Large Language Models (LLMs) to decompose complex tasks into executable subtasks. |