--- sidebar_position: 2 --- # Chapter 1: Voice-to-Action with Whisper ## Learning Objectives - Understand how speech recognition systems work in robotics - Learn about OpenAI Whisper and its capabilities - Implement a voice command pipeline for robot control - Integrate speech recognition with action execution - Create a complete voice-to-action system ## Introduction to Voice-to-Action Systems Voice-to-action systems enable natural human-robot interaction by allowing users to control robots using spoken commands. These systems are particularly important for humanoid robots, as they enhance the natural interaction between humans and robotic systems. ### Key Components of Voice-to-Action Systems 1. **Speech Recognition**: Convert spoken language to text 2. **Natural Language Understanding**: Interpret the meaning of the text 3. **Action Mapping**: Map understood commands to robot actions 4. **Execution**: Perform the requested robot actions 5. **Feedback**: Provide confirmation of actions to the user ## OpenAI Whisper for Speech Recognition OpenAI Whisper is a state-of-the-art speech recognition model that: - Supports multiple languages - Has robust performance across different accents and background noise - Can be fine-tuned for specific applications - Performs well with limited training data ### Whisper Architecture Whisper is a transformer-based model that: - Uses an encoder-decoder architecture - Processes audio in 30-second chunks - Outputs text in the detected language - Can be prompted to focus on specific domains ### Whisper in Robotics Context For robotics applications, Whisper can be used to: - Convert voice commands to text that can be processed by NLP systems - Handle background noise common in robot environments - Support multiple languages for international applications - Operate in real-time with appropriate computational resources ## Implementing Voice-to-Action Pipeline The complete voice-to-action pipeline consists of: ``` [Microphone] → [Audio Preprocessing] → [Whisper ASR] → [NLU] → [Action Mapping] → [Robot Execution] ``` ### Audio Preprocessing Before sending audio to Whisper, preprocessing may include: ```python import pyaudio import numpy as np import webrtcvad from scipy.io import wavfile # Initialize audio stream audio = pyaudio.PyAudio() stream = audio.open( format=pyaudio.paInt16, channels=1, rate=16000, # Whisper expects 16kHz input=True, frames_per_buffer=1024 ) # Voice activity detection to identify speech segments vad = webrtcvad.Vad() vad.set_mode(1) # Aggressiveness mode # Process audio in chunks frames = [] for i in range(0, int(16000 / 1024 * 5)): # 5 seconds of audio data = stream.read(1024) frames.append(data) # Check for voice activity if needed ``` ### Integrating Whisper ```python import whisper # Load model (use 'base' or 'small' for real-time applications) model = whisper.load_model("base") # Transcribe audio result = model.transcribe("audio_file.wav") command_text = result["text"] print(f"Recognized command: {command_text}") ``` ## Natural Language Understanding Once speech is converted to text, we need to understand the intent: ### Simple Command Recognition ```python # Define command patterns COMMAND_PATTERNS = { "move_forward": ["move forward", "go forward", "walk forward"], "turn_left": ["turn left", "left turn", "rotate left"], "turn_right": ["turn right", "right turn", "rotate right"], "stop": ["stop", "halt", "freeze"], "wave": ["wave", "waving", "wave hello"], "dance": ["dance", "dancing", "perform dance"] } def extract_command(text): text_lower = text.lower() for action, patterns in COMMAND_PATTERNS.items(): for pattern in patterns: if pattern in text_lower: return action return None ``` ### Using LLMs for Understanding For more complex commands, we can use large language models: ```python import openai def parse_complex_command(text): prompt = f""" Parse the following human command to a robot and return the appropriate action(s): Command: "{text}" Available actions: move_forward, turn_left, turn_right, stop, wave, dance, pickup_object, place_object, speak_text, navigate_to, follow_person Response format: - action: - parameters: If the command cannot be parsed, respond with: - action: "unknown" - parameters: {{"text": ""}} """ response = openai.ChatCompletion.create( model="gpt-3.5-turbo", messages=[{"role": "user", "content": prompt}] ) return response.choices[0].message.content ``` ## Voice Command System Architecture ### Complete System Implementation ```python import asyncio import threading import queue import time from dataclasses import dataclass from typing import Optional @dataclass class VoiceCommand: text: str timestamp: float confidence: float class VoiceCommandSystem: def __init__(self, ros_node): self.ros_node = ros_node self.command_queue = queue.Queue() self.is_listening = False self.whisper_model = whisper.load_model("base") def start_listening(self): self.is_listening = True # Start audio capture thread audio_thread = threading.Thread(target=self._capture_audio) audio_thread.start() # Start processing thread processing_thread = threading.Thread(target=self._process_commands) processing_thread.start() def _capture_audio(self): # Implementation for audio capture would go here pass def _process_commands(self): while self.is_listening: if not self.command_queue.empty(): command = self.command_queue.get() self._execute_robot_command(command) def _execute_robot_command(self, command: VoiceCommand): # Map command to robot action action = extract_command(command.text) if action == "move_forward": self.ros_node.move_robot_forward() elif action == "turn_left": self.ros_node.turn_robot_left() elif action == "wave": self.ros_node.perform_wave_action() # ... additional mappings ``` ## Integration with ROS 2 To integrate with ROS 2, we need to connect the voice system to ROS 2 nodes: ```python import rclpy from rclpy.node import Node from std_msgs.msg import String from geometry_msgs.msg import Twist class VoiceControlNode(Node): def __init__(self): super().__init__('voice_control_node') # Publisher for robot movement commands self.cmd_vel_publisher = self.create_publisher(Twist, 'cmd_vel', 10) # Publisher for voice feedback self.voice_feedback_publisher = self.create_publisher(String, 'voice_feedback', 10) # Initialize voice command system self.voice_system = VoiceCommandSystem(self) def move_robot_forward(self): twist = Twist() twist.linear.x = 0.5 # Move forward at 0.5 m/s self.cmd_vel_publisher.publish(twist) def turn_robot_left(self): twist = Twist() twist.angular.z = 0.5 # Turn left at 0.5 rad/s self.cmd_vel_publisher.publish(twist) def perform_wave_action(self): # Publish to robot's action server # Implementation would depend on specific robot capabilities feedback_msg = String() feedback_msg.data = "Performing wave action" self.voice_feedback_publisher.publish(feedback_msg) ``` ## Challenges in Voice-to-Action Systems ### Noise and Environment - Background noise can affect recognition accuracy - Robot's own sounds may interfere with recognition - Room acoustics affect audio quality ### Language and Command Complexity - Natural language varies greatly in how commands are expressed - Intent recognition requires robust NLU systems - Ambiguous commands need clarification ### Real-time Requirements - Processing delay affects user experience - Robot response time should match human expectations - System should handle interruptions gracefully ## Summary Voice-to-action systems provide a natural interface for human-robot interaction, making robots more accessible and intuitive to control. Implementing these systems requires integrating speech recognition, natural language understanding, and robot action execution. ## Exercises 1. Set up a basic audio capture system in Python 2. Install and run Whisper for speech recognition 3. Create a simple command mapping system ## Next Steps In the next chapter, we'll explore cognitive planning systems that use Large Language Models (LLMs) to decompose complex tasks into executable subtasks.