suhail
add: course book markdown files for RAG ingestion
cc303f4
---
sidebar_position: 2
---
# Chapter 1: Voice-to-Action with Whisper
## Learning Objectives
- Understand how speech recognition systems work in robotics
- Learn about OpenAI Whisper and its capabilities
- Implement a voice command pipeline for robot control
- Integrate speech recognition with action execution
- Create a complete voice-to-action system
## Introduction to Voice-to-Action Systems
Voice-to-action systems enable natural human-robot interaction by allowing users to control robots using spoken commands. These systems are particularly important for humanoid robots, as they enhance the natural interaction between humans and robotic systems.
### Key Components of Voice-to-Action Systems
1. **Speech Recognition**: Convert spoken language to text
2. **Natural Language Understanding**: Interpret the meaning of the text
3. **Action Mapping**: Map understood commands to robot actions
4. **Execution**: Perform the requested robot actions
5. **Feedback**: Provide confirmation of actions to the user
## OpenAI Whisper for Speech Recognition
OpenAI Whisper is a state-of-the-art speech recognition model that:
- Supports multiple languages
- Has robust performance across different accents and background noise
- Can be fine-tuned for specific applications
- Performs well with limited training data
### Whisper Architecture
Whisper is a transformer-based model that:
- Uses an encoder-decoder architecture
- Processes audio in 30-second chunks
- Outputs text in the detected language
- Can be prompted to focus on specific domains
### Whisper in Robotics Context
For robotics applications, Whisper can be used to:
- Convert voice commands to text that can be processed by NLP systems
- Handle background noise common in robot environments
- Support multiple languages for international applications
- Operate in real-time with appropriate computational resources
## Implementing Voice-to-Action Pipeline
The complete voice-to-action pipeline consists of:
```
[Microphone] → [Audio Preprocessing] → [Whisper ASR] → [NLU] → [Action Mapping] → [Robot Execution]
```
### Audio Preprocessing
Before sending audio to Whisper, preprocessing may include:
```python
import pyaudio
import numpy as np
import webrtcvad
from scipy.io import wavfile
# Initialize audio stream
audio = pyaudio.PyAudio()
stream = audio.open(
format=pyaudio.paInt16,
channels=1,
rate=16000, # Whisper expects 16kHz
input=True,
frames_per_buffer=1024
)
# Voice activity detection to identify speech segments
vad = webrtcvad.Vad()
vad.set_mode(1) # Aggressiveness mode
# Process audio in chunks
frames = []
for i in range(0, int(16000 / 1024 * 5)): # 5 seconds of audio
data = stream.read(1024)
frames.append(data)
# Check for voice activity if needed
```
### Integrating Whisper
```python
import whisper
# Load model (use 'base' or 'small' for real-time applications)
model = whisper.load_model("base")
# Transcribe audio
result = model.transcribe("audio_file.wav")
command_text = result["text"]
print(f"Recognized command: {command_text}")
```
## Natural Language Understanding
Once speech is converted to text, we need to understand the intent:
### Simple Command Recognition
```python
# Define command patterns
COMMAND_PATTERNS = {
"move_forward": ["move forward", "go forward", "walk forward"],
"turn_left": ["turn left", "left turn", "rotate left"],
"turn_right": ["turn right", "right turn", "rotate right"],
"stop": ["stop", "halt", "freeze"],
"wave": ["wave", "waving", "wave hello"],
"dance": ["dance", "dancing", "perform dance"]
}
def extract_command(text):
text_lower = text.lower()
for action, patterns in COMMAND_PATTERNS.items():
for pattern in patterns:
if pattern in text_lower:
return action
return None
```
### Using LLMs for Understanding
For more complex commands, we can use large language models:
```python
import openai
def parse_complex_command(text):
prompt = f"""
Parse the following human command to a robot and return the appropriate action(s):
Command: "{text}"
Available actions: move_forward, turn_left, turn_right, stop, wave, dance, pickup_object, place_object, speak_text, navigate_to, follow_person
Response format:
- action: <action_name>
- parameters: <dict with any needed parameters>
If the command cannot be parsed, respond with:
- action: "unknown"
- parameters: {{"text": "<original command>"}}
"""
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
```
## Voice Command System Architecture
### Complete System Implementation
```python
import asyncio
import threading
import queue
import time
from dataclasses import dataclass
from typing import Optional
@dataclass
class VoiceCommand:
text: str
timestamp: float
confidence: float
class VoiceCommandSystem:
def __init__(self, ros_node):
self.ros_node = ros_node
self.command_queue = queue.Queue()
self.is_listening = False
self.whisper_model = whisper.load_model("base")
def start_listening(self):
self.is_listening = True
# Start audio capture thread
audio_thread = threading.Thread(target=self._capture_audio)
audio_thread.start()
# Start processing thread
processing_thread = threading.Thread(target=self._process_commands)
processing_thread.start()
def _capture_audio(self):
# Implementation for audio capture would go here
pass
def _process_commands(self):
while self.is_listening:
if not self.command_queue.empty():
command = self.command_queue.get()
self._execute_robot_command(command)
def _execute_robot_command(self, command: VoiceCommand):
# Map command to robot action
action = extract_command(command.text)
if action == "move_forward":
self.ros_node.move_robot_forward()
elif action == "turn_left":
self.ros_node.turn_robot_left()
elif action == "wave":
self.ros_node.perform_wave_action()
# ... additional mappings
```
## Integration with ROS 2
To integrate with ROS 2, we need to connect the voice system to ROS 2 nodes:
```python
import rclpy
from rclpy.node import Node
from std_msgs.msg import String
from geometry_msgs.msg import Twist
class VoiceControlNode(Node):
def __init__(self):
super().__init__('voice_control_node')
# Publisher for robot movement commands
self.cmd_vel_publisher = self.create_publisher(Twist, 'cmd_vel', 10)
# Publisher for voice feedback
self.voice_feedback_publisher = self.create_publisher(String, 'voice_feedback', 10)
# Initialize voice command system
self.voice_system = VoiceCommandSystem(self)
def move_robot_forward(self):
twist = Twist()
twist.linear.x = 0.5 # Move forward at 0.5 m/s
self.cmd_vel_publisher.publish(twist)
def turn_robot_left(self):
twist = Twist()
twist.angular.z = 0.5 # Turn left at 0.5 rad/s
self.cmd_vel_publisher.publish(twist)
def perform_wave_action(self):
# Publish to robot's action server
# Implementation would depend on specific robot capabilities
feedback_msg = String()
feedback_msg.data = "Performing wave action"
self.voice_feedback_publisher.publish(feedback_msg)
```
## Challenges in Voice-to-Action Systems
### Noise and Environment
- Background noise can affect recognition accuracy
- Robot's own sounds may interfere with recognition
- Room acoustics affect audio quality
### Language and Command Complexity
- Natural language varies greatly in how commands are expressed
- Intent recognition requires robust NLU systems
- Ambiguous commands need clarification
### Real-time Requirements
- Processing delay affects user experience
- Robot response time should match human expectations
- System should handle interruptions gracefully
## Summary
Voice-to-action systems provide a natural interface for human-robot interaction, making robots more accessible and intuitive to control. Implementing these systems requires integrating speech recognition, natural language understanding, and robot action execution.
## Exercises
1. Set up a basic audio capture system in Python
2. Install and run Whisper for speech recognition
3. Create a simple command mapping system
## Next Steps
In the next chapter, we'll explore cognitive planning systems that use Large Language Models (LLMs) to decompose complex tasks into executable subtasks.