suhail
add: course book markdown files for RAG ingestion
cc303f4
metadata
sidebar_position: 2

Chapter 1: Voice-to-Action with Whisper

Learning Objectives

  • Understand how speech recognition systems work in robotics
  • Learn about OpenAI Whisper and its capabilities
  • Implement a voice command pipeline for robot control
  • Integrate speech recognition with action execution
  • Create a complete voice-to-action system

Introduction to Voice-to-Action Systems

Voice-to-action systems enable natural human-robot interaction by allowing users to control robots using spoken commands. These systems are particularly important for humanoid robots, as they enhance the natural interaction between humans and robotic systems.

Key Components of Voice-to-Action Systems

  1. Speech Recognition: Convert spoken language to text
  2. Natural Language Understanding: Interpret the meaning of the text
  3. Action Mapping: Map understood commands to robot actions
  4. Execution: Perform the requested robot actions
  5. Feedback: Provide confirmation of actions to the user

OpenAI Whisper for Speech Recognition

OpenAI Whisper is a state-of-the-art speech recognition model that:

  • Supports multiple languages
  • Has robust performance across different accents and background noise
  • Can be fine-tuned for specific applications
  • Performs well with limited training data

Whisper Architecture

Whisper is a transformer-based model that:

  • Uses an encoder-decoder architecture
  • Processes audio in 30-second chunks
  • Outputs text in the detected language
  • Can be prompted to focus on specific domains

Whisper in Robotics Context

For robotics applications, Whisper can be used to:

  • Convert voice commands to text that can be processed by NLP systems
  • Handle background noise common in robot environments
  • Support multiple languages for international applications
  • Operate in real-time with appropriate computational resources

Implementing Voice-to-Action Pipeline

The complete voice-to-action pipeline consists of:

[Microphone] → [Audio Preprocessing] → [Whisper ASR] → [NLU] → [Action Mapping] → [Robot Execution]

Audio Preprocessing

Before sending audio to Whisper, preprocessing may include:

import pyaudio
import numpy as np
import webrtcvad
from scipy.io import wavfile

# Initialize audio stream
audio = pyaudio.PyAudio()
stream = audio.open(
    format=pyaudio.paInt16,
    channels=1,
    rate=16000,  # Whisper expects 16kHz
    input=True,
    frames_per_buffer=1024
)

# Voice activity detection to identify speech segments
vad = webrtcvad.Vad()
vad.set_mode(1)  # Aggressiveness mode

# Process audio in chunks
frames = []
for i in range(0, int(16000 / 1024 * 5)):  # 5 seconds of audio
    data = stream.read(1024)
    frames.append(data)
    # Check for voice activity if needed

Integrating Whisper

import whisper

# Load model (use 'base' or 'small' for real-time applications)
model = whisper.load_model("base")

# Transcribe audio
result = model.transcribe("audio_file.wav")
command_text = result["text"]
print(f"Recognized command: {command_text}")

Natural Language Understanding

Once speech is converted to text, we need to understand the intent:

Simple Command Recognition

# Define command patterns
COMMAND_PATTERNS = {
    "move_forward": ["move forward", "go forward", "walk forward"],
    "turn_left": ["turn left", "left turn", "rotate left"],
    "turn_right": ["turn right", "right turn", "rotate right"],
    "stop": ["stop", "halt", "freeze"],
    "wave": ["wave", "waving", "wave hello"],
    "dance": ["dance", "dancing", "perform dance"]
}

def extract_command(text):
    text_lower = text.lower()
    for action, patterns in COMMAND_PATTERNS.items():
        for pattern in patterns:
            if pattern in text_lower:
                return action
    return None

Using LLMs for Understanding

For more complex commands, we can use large language models:

import openai

def parse_complex_command(text):
    prompt = f"""
    Parse the following human command to a robot and return the appropriate action(s):
    
    Command: "{text}"
    
    Available actions: move_forward, turn_left, turn_right, stop, wave, dance, pickup_object, place_object, speak_text, navigate_to, follow_person
    
    Response format: 
    - action: <action_name>
    - parameters: <dict with any needed parameters>
    
    If the command cannot be parsed, respond with:
    - action: "unknown"
    - parameters: {{"text": "<original command>"}}
    """
    
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}]
    )
    
    return response.choices[0].message.content

Voice Command System Architecture

Complete System Implementation

import asyncio
import threading
import queue
import time
from dataclasses import dataclass
from typing import Optional

@dataclass
class VoiceCommand:
    text: str
    timestamp: float
    confidence: float

class VoiceCommandSystem:
    def __init__(self, ros_node):
        self.ros_node = ros_node
        self.command_queue = queue.Queue()
        self.is_listening = False
        self.whisper_model = whisper.load_model("base")
        
    def start_listening(self):
        self.is_listening = True
        # Start audio capture thread
        audio_thread = threading.Thread(target=self._capture_audio)
        audio_thread.start()
        
        # Start processing thread
        processing_thread = threading.Thread(target=self._process_commands)
        processing_thread.start()
        
    def _capture_audio(self):
        # Implementation for audio capture would go here
        pass
        
    def _process_commands(self):
        while self.is_listening:
            if not self.command_queue.empty():
                command = self.command_queue.get()
                self._execute_robot_command(command)
                
    def _execute_robot_command(self, command: VoiceCommand):
        # Map command to robot action
        action = extract_command(command.text)
        
        if action == "move_forward":
            self.ros_node.move_robot_forward()
        elif action == "turn_left":
            self.ros_node.turn_robot_left()
        elif action == "wave":
            self.ros_node.perform_wave_action()
        # ... additional mappings

Integration with ROS 2

To integrate with ROS 2, we need to connect the voice system to ROS 2 nodes:

import rclpy
from rclpy.node import Node
from std_msgs.msg import String
from geometry_msgs.msg import Twist

class VoiceControlNode(Node):
    def __init__(self):
        super().__init__('voice_control_node')
        
        # Publisher for robot movement commands
        self.cmd_vel_publisher = self.create_publisher(Twist, 'cmd_vel', 10)
        
        # Publisher for voice feedback
        self.voice_feedback_publisher = self.create_publisher(String, 'voice_feedback', 10)
        
        # Initialize voice command system
        self.voice_system = VoiceCommandSystem(self)
        
    def move_robot_forward(self):
        twist = Twist()
        twist.linear.x = 0.5  # Move forward at 0.5 m/s
        self.cmd_vel_publisher.publish(twist)
        
    def turn_robot_left(self):
        twist = Twist()
        twist.angular.z = 0.5  # Turn left at 0.5 rad/s
        self.cmd_vel_publisher.publish(twist)
        
    def perform_wave_action(self):
        # Publish to robot's action server
        # Implementation would depend on specific robot capabilities
        feedback_msg = String()
        feedback_msg.data = "Performing wave action"
        self.voice_feedback_publisher.publish(feedback_msg)

Challenges in Voice-to-Action Systems

Noise and Environment

  • Background noise can affect recognition accuracy
  • Robot's own sounds may interfere with recognition
  • Room acoustics affect audio quality

Language and Command Complexity

  • Natural language varies greatly in how commands are expressed
  • Intent recognition requires robust NLU systems
  • Ambiguous commands need clarification

Real-time Requirements

  • Processing delay affects user experience
  • Robot response time should match human expectations
  • System should handle interruptions gracefully

Summary

Voice-to-action systems provide a natural interface for human-robot interaction, making robots more accessible and intuitive to control. Implementing these systems requires integrating speech recognition, natural language understanding, and robot action execution.

Exercises

  1. Set up a basic audio capture system in Python
  2. Install and run Whisper for speech recognition
  3. Create a simple command mapping system

Next Steps

In the next chapter, we'll explore cognitive planning systems that use Large Language Models (LLMs) to decompose complex tasks into executable subtasks.