Spaces:

jlamperez
/

mortis

Running

App Files Files Community

jlamperez commited on 6 days ago

Commit

7e8db3f

1 Parent(s): 6e29e7e

Add files

Browse files

Files changed (25) hide show

.kiro/settings/mcp.json +7 -0
.kiro/specs/gemini-multimodal-refactor/design.md +2309 -0
.kiro/specs/gemini-multimodal-refactor/requirements.md +153 -0
.kiro/specs/gemini-multimodal-refactor/tasks.md +403 -0
.kiro/steering/product.md +30 -0
.kiro/steering/structure.md +85 -0
.kiro/steering/tech.md +69 -0
app.py +23 -0
requirements.txt +1 -1
src/mortis/__init__.py +0 -0
src/mortis/app.py +815 -0
src/mortis/async_executor.py +554 -0
src/mortis/calibrate.py +28 -0
src/mortis/data_collector.py +288 -0
src/mortis/gemini_client.py +482 -0
src/mortis/intent_router.py +295 -0
src/mortis/lerobot_async_client.py +668 -0
src/mortis/models.py +215 -0
src/mortis/robot.py +180 -0
src/mortis/setup_dataset.py +146 -0
src/mortis/setup_train.py +385 -0
src/mortis/smolvla_executor.py +1040 -0
src/mortis/stt_service.py +383 -0
src/mortis/tools.py +418 -0
src/mortis/tts_service.py +225 -0

.kiro/settings/mcp.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "mcpServers": {
+   "hf-mcp-server": {
+      "url": "https://huggingface.co/mcp?login"
+    }
+  }
+}

.kiro/specs/gemini-multimodal-refactor/design.md ADDED Viewed

	@@ -0,0 +1,2309 @@

+# Design Document
+## Overview
+This design document outlines the architecture for refactoring the Mortis interactive AI Halloween experience to support multi-modal (voice and text) interaction using Google Gemini API and SmolVLA-based robotic control. The refactor transforms Mortis from a simple gesture-based system into a sophisticated manipulation robot capable of executing precise tasks through natural language commands.
+### Key Design Goals
+1. Replace existing LLM API with Google Gemini API for conversational AI
+2. Add voice input (STT) and voice output (TTS) capabilities
+3. Integrate SmolVLA model for vision-language-action robotic control
+4. Implement asynchronous execution to maintain UI responsiveness
+5. Support both conversational gestures and precise manipulation tasks
+6. Maintain backward compatibility with existing features
+7. Enable local deployment with GPU support for SmolVLA inference
+### System Context
+The current Mortis system uses:
+- Gradio web interface with chat and webcam view
+- Generic LLM API with structured tool calling for gesture control
+- LeRobot SO101Follower for predefined gesture sequences
+- Synchronous execution model
+The refactored system will add:
+- Google Gemini API integration with intent detection
+- Audio input/output components in Gradio
+- SmolVLA model for learned manipulation behaviors
+- Asynchronous task execution with message queuing
+- Dataset collection and training infrastructure
+## Architecture
+### High-Level Architecture Diagram
+```mermaid
+graph TB
+    subgraph "Gradio Web Interface"
+        UI[User Interface]
+        Audio[Audio Input/Output]
+        Chat[Chat Interface]
+        Video[Webcam View]
+    end
+    subgraph "Application Layer"
+        STT[Speech-to-Text Service]
+        TTS[Text-to-Speech Service]
+        Gemini[Gemini API Client]
+        IntentRouter[Intent Router]
+    end
+    subgraph "Execution Layer"
+        Queue[Message Queue]
+        GestureExec[Gesture Executor]
+        SmolVLAExec[SmolVLA Executor]
+    end
+    subgraph "Robot Control"
+        SO101[SO101 Follower Driver]
+        SmolVLA[SmolVLA Model]
+        Camera[Camera Feed]
+    end
+    subgraph "Training Infrastructure"
+        DataCollect[Data Collection]
+        Dataset[LeRobot Dataset]
+        Training[Training Pipeline]
+    end
+    UI --> Audio
+    UI --> Chat
+    Audio --> STT
+    Chat --> Gemini
+    STT --> Gemini
+    Gemini --> IntentRouter
+    IntentRouter --> Queue
+    Queue --> GestureExec
+    Queue --> SmolVLAExec
+    GestureExec --> SO101
+    SmolVLAExec --> SmolVLA
+    SmolVLA --> SO101
+    Gemini --> TTS
+    TTS --> Audio
+    Camera --> Video
+    Camera --> SmolVLA
+    DataCollect --> Dataset
+    Dataset --> Training
+    Training --> SmolVLA
+```
+### Architecture Layers
+#### 1. Presentation Layer (Gradio Interface)
+- Handles user interaction through web browser
+- Provides audio input component for voice recording
+- Displays chat messages and system responses
+- Shows webcam feed for visual monitoring
+- Plays audio responses through browser
+#### 2. Application Layer (Business Logic)
+- Gemini API client for conversational AI
+- STT service for voice-to-text conversion
+- TTS service for text-to-voice conversion
+- Intent router to distinguish between conversational and manipulation commands
+- Response formatter for structured outputs
+#### 3. Execution Layer (Asynchronous Processing)
+- Message queue for decoupling UI from long-running operations
+- Gesture executor for predefined movement sequences
+- SmolVLA executor for learned manipulation tasks
+- Status tracking and progress reporting
+#### 4. Robot Control Layer (Hardware Interface)
+- SO101Follower driver for low-level servo control
+- SmolVLA model for vision-language-action inference
+- Camera interface for visual observations
+- Safety monitoring and error recovery
+#### 5. Training Infrastructure (Offline)
+- Data collection tools for recording demonstrations
+- LeRobot dataset management
+- Training pipeline for SmolVLA model
+- Model evaluation and validation
+## Components and Interfaces
+### 1. Gemini API Integration
+#### Component: `GeminiClient`
+**Purpose:** Manages all interactions with Google Gemini API for conversational AI and intent detection.
+**Key Methods:**
+- `send_message(user_input: str, conversation_history: list) -> GeminiResponse`
+- `detect_intent(user_input: str) -> Intent`
+- `configure_model(model_name: str, temperature: float)`
+**Configuration:**
+```python
+# Environment variables
+GEMINI_API_KEY=your_google_api_key
+GEMINI_MODEL=gemini-2.0-flash-exp  # or gemini-1.5-pro
+GEMINI_TEMPERATURE=0.2
+```
+**System Prompt Design:**
+The Gemini system prompt must accomplish two critical functions:
+1. **Character Maintenance:** Preserve Mortis personality (mischievous Halloween spirit)
+2. **Intent Detection:** Identify manipulation task commands vs. conversational input
+```python
+GEMINI_SYSTEM_PROMPT = """
+You are Mortis, a mischievous Halloween spirit inhabiting a robotic arm.
+MANIPULATION TASKS:
+You can perform these exact manipulation tasks:
+- "Pick up the skull and place it in the green cup"
+- "Pick up the skull and place it in the orange cup"
+- "Pick up the skull and place it in the purple cup"
+- "Pick up the eyeball and place it in the green cup"
+- "Pick up the eyeball and place it in the orange cup"
+- "Pick up the eyeball and place it in the purple cup"
+RESPONSE FORMAT:
+If user input matches a manipulation task (even with variations):
+{
+  "type": "manipulation",
+  "command": "<exact_task_string>",
+  "message": "<short in-character response, <=30 words>",
+  "mood": "<ominous|playful|angry|nervous|triumphant|mischievous|sinister|curious|neutral>"
+}
+If user input is conversational:
+{
+  "type": "conversation",
+  "message": "<short in-character response, <=30 words>",
+  "mood": "<mood>",
+  "gesture": "<idle|wave|point_left|point_right|grab|drop>"
+}
+Keep responses brief, in-character, no emojis or markdown.
+"""
+```
+**Google SDK Usage:**
+```python
+import google.generativeai as genai
+genai.configure(api_key=os.getenv("GEMINI_API_KEY"))
+model = genai.GenerativeModel('gemini-2.0-flash-exp')
+# For structured output, use JSON mode
+generation_config = {
+    "temperature": 0.2,
+    "response_mime_type": "application/json"
+}
+```
+### 2. Speech-to-Text (STT) Integration
+#### Component: `STTService`
+**Purpose:** Convert user voice input to text for processing by Gemini.
+**Architecture Decision: Cloud vs. Local STT**
+| Approach | Pros | Cons | Recommendation |
+|----------|------|------|----------------|
+| **Google Speech-to-Text API** | High accuracy, fast, supports streaming, integrates with Gemini ecosystem | Requires internet, API costs, data leaves local system | **Recommended for production** |
+| **Local Whisper (Hugging Face)** | Privacy-preserving, no API costs, works offline | Slower inference, requires GPU/CPU resources, lower accuracy for accents | Good for offline/privacy scenarios |
+| **Gemini Audio Input** | Single API integration, context-aware | Limited to Gemini models with audio support, less control | **Best option if available** |
+**Recommended Implementation: Gemini Native Audio**
+Gemini 2.0 models support native audio input, eliminating the need for separate STT:
+```python
+import google.generativeai as genai
+# Upload audio file
+audio_file = genai.upload_file(path="user_audio.wav")
+# Send to Gemini with audio
+response = model.generate_content([
+    "Transcribe and respond to this audio as Mortis:",
+    audio_file
+])
+```
+**Fallback Implementation: Google Speech-to-Text**
+```python
+from google.cloud import speech_v1
+def transcribe_audio(audio_bytes: bytes) -> str:
+    client = speech_v1.SpeechClient()
+    audio = speech_v1.RecognitionAudio(content=audio_bytes)
+    config = speech_v1.RecognitionConfig(
+        encoding=speech_v1.RecognitionConfig.AudioEncoding.LINEAR16,
+        sample_rate_hertz=16000,
+        language_code="en-US",
+    )
+    response = client.recognize(config=config, audio=audio)
+    return response.results[0].alternatives[0].transcript
+```
+**Gradio Integration:**
+```python
+with gr.Blocks() as demo:
+    audio_input = gr.Audio(
+        sources=["microphone"],
+        type="filepath",
+        label="Speak to Mortis"
+    )
+    audio_input.change(
+        fn=process_audio_input,
+        inputs=[audio_input],
+        outputs=[chatbot]
+    )
+```
+### 3. Text-to-Speech (TTS) Integration
+#### Component: `TTSService`
+**Purpose:** Convert Gemini text responses to audio for voice output.
+**Recommended Approach: Google Text-to-Speech API**
+```python
+from google.cloud import texttospeech
+def synthesize_speech(text: str, output_path: str) -> str:
+    client = texttospeech.TextToSpeechClient()
+    synthesis_input = texttospeech.SynthesisInput(text=text)
+    # Configure voice (creepy/ominous for Mortis)
+    voice = texttospeech.VoiceSelectionParams(
+        language_code="en-US",
+        name="en-US-Neural2-D",  # Deep male voice
+        ssml_gender=texttospeech.SsmlVoiceGender.MALE
+    )
+    audio_config = texttospeech.AudioConfig(
+        audio_encoding=texttospeech.AudioEncoding.MP3,
+        speaking_rate=0.9,  # Slightly slower for ominous effect
+        pitch=-2.0  # Lower pitch for spooky voice
+    )
+    response = client.synthesize_speech(
+        input=synthesis_input,
+        voice=voice,
+        audio_config=audio_config
+    )
+    with open(output_path, "wb") as out:
+        out.write(response.audio_content)
+    return output_path
+```
+**Alternative: Local TTS (pyttsx3 or gTTS)**
+For offline scenarios:
+```python
+from gtts import gTTS
+def synthesize_speech_local(text: str, output_path: str) -> str:
+    tts = gTTS(text=text, lang='en', slow=True)
+    tts.save(output_path)
+    return output_path
+```
+**Gradio Integration:**
+```python
+def mortis_reply_with_voice(message, history, model_name):
+    # Get text response from Gemini
+    response_text, mood, action = process_with_gemini(message, model_name)
+    # Generate audio
+    audio_path = synthesize_speech(response_text, f"outputs/response_{time.time()}.mp3")
+    return response_text, audio_path
+with gr.Blocks() as demo:
+    audio_output = gr.Audio(
+        label="Mortis speaks",
+        autoplay=True,
+        type="filepath"
+    )
+```
+### 4. Intent Router
+#### Component: `IntentRouter`
+**Purpose:** Parse Gemini responses and route to appropriate execution path.
+**Design:**
+```python
+from enum import Enum
+from dataclasses import dataclass
+class IntentType(Enum):
+    CONVERSATION = "conversation"
+    MANIPULATION = "manipulation"
+@dataclass
+class Intent:
+    type: IntentType
+    message: str
+    mood: str
+    gesture: str = None
+    command: str = None
+class IntentRouter:
+    def __init__(self):
+        self.valid_commands = [
+            "Pick up the skull and place it in the green cup",
+            "Pick up the skull and place it in the orange cup",
+            "Pick up the skull and place it in the purple cup",
+            "Pick up the eyeball and place it in the green cup",
+            "Pick up the eyeball and place it in the orange cup",
+            "Pick up the eyeball and place it in the purple cup",
+        ]
+    def parse_gemini_response(self, response_json: dict) -> Intent:
+        """Parse structured JSON response from Gemini."""
+        intent_type = IntentType(response_json.get("type", "conversation"))
+        if intent_type == IntentType.MANIPULATION:
+            return Intent(
+                type=IntentType.MANIPULATION,
+                message=response_json["message"],
+                mood=response_json["mood"],
+                command=response_json["command"]
+            )
+        else:
+            return Intent(
+                type=IntentType.CONVERSATION,
+                message=response_json["message"],
+                mood=response_json["mood"],
+                gesture=response_json.get("gesture", "idle")
+            )
+    def validate_command(self, command: str) -> bool:
+        """Verify command is in trained task set."""
+        return command in self.valid_commands
+```
+**Execution Flow:**
+```python
+def process_user_input(user_input: str, model_name: str):
+    # 1. Send to Gemini
+    gemini_response = gemini_client.send_message(user_input)
+    # 2. Parse intent
+    intent = intent_router.parse_gemini_response(gemini_response)
+    # 3. Route to appropriate executor
+    if intent.type == IntentType.MANIPULATION:
+        if intent_router.validate_command(intent.command):
+            # Queue for async SmolVLA execution
+            task_queue.put({
+                "type": "manipulation",
+                "command": intent.command,
+                "message": intent.message
+            })
+        else:
+            # Invalid command, treat as conversation
+            execute_gesture(intent.gesture or "idle")
+    else:
+        # Execute gesture immediately
+        execute_gesture(intent.gesture)
+    # 4. Generate voice response
+    audio_path = tts_service.synthesize(intent.message)
+    return intent.message, audio_path
+```
+### 5. Asynchronous Execution System
+#### Component: `AsyncExecutor`
+**Purpose:** Decouple long-running SmolVLA inference from Gradio UI to maintain responsiveness.
+**Architecture Decision: Message Queue vs. Background Processing**
+| Approach | Pros | Cons | Recommendation |
+|----------|------|------|----------------|
+| **Redis Queue** | Robust, scalable, persistent, supports distributed workers | External dependency, overkill for single-machine | Good for production/multi-worker |
+| **Python asyncio.Queue** | Built-in, simple, no dependencies | Single process only, not persistent | **Recommended for this use case** |
+| **multiprocessing.Queue** | True parallelism, GPU isolation | Complex IPC, harder debugging | Good if GPU contention is an issue |
+| **Threading + Queue** | Simple, shared memory | GIL limitations, not ideal for CPU-bound | Not recommended for ML inference |
+**Recommended Implementation: asyncio with Background Tasks**
+```python
+import asyncio
+from queue import Queue
+from threading import Thread
+import gradio as gr
+class AsyncExecutor:
+    def __init__(self):
+        self.task_queue = Queue()
+        self.status_queue = Queue()
+        self.worker_thread = None
+        self.running = False
+    def start(self):
+        """Start background worker thread."""
+        self.running = True
+        self.worker_thread = Thread(target=self._worker_loop, daemon=True)
+        self.worker_thread.start()
+    def stop(self):
+        """Stop background worker."""
+        self.running = False
+        if self.worker_thread:
+            self.worker_thread.join(timeout=5)
+    def _worker_loop(self):
+        """Background thread that processes tasks."""
+        while self.running:
+            try:
+                task = self.task_queue.get(timeout=1)
+                self._execute_task(task)
+            except:
+                continue
+    def _execute_task(self, task):
+        """Execute a single task."""
+        try:
+            if task["type"] == "manipulation":
+                self.status_queue.put({"status": "running", "task": task["command"]})
+                # Execute SmolVLA inference (blocking)
+                smolvla_executor.execute(task["command"])
+                self.status_queue.put({"status": "complete", "task": task["command"]})
+            elif task["type"] == "gesture":
+                mortis_arm.move_arm(task["gesture"])
+                self.status_queue.put({"status": "complete", "task": task["gesture"]})
+        except Exception as e:
+            self.status_queue.put({"status": "error", "error": str(e)})
+    def submit_task(self, task: dict):
+        """Submit task for async execution."""
+        self.task_queue.put(task)
+    def get_status(self) -> dict:
+        """Get latest status update (non-blocking)."""
+        try:
+            return self.status_queue.get_nowait()
+        except:
+            return None
+# Global executor instance
+async_executor = AsyncExecutor()
+```
+**Gradio Integration with Status Updates:**
+```python
+def mortis_reply(message, history, model_name):
+    # Process with Gemini
+    intent = process_with_gemini(message, model_name)
+    # Submit task asynchronously
+    if intent.type == IntentType.MANIPULATION:
+        async_executor.submit_task({
+            "type": "manipulation",
+            "command": intent.command
+        })
+        status_msg = f"🤖 Executing: {intent.command}..."
+    else:
+        async_executor.submit_task({
+            "type": "gesture",
+            "gesture": intent.gesture
+        })
+        status_msg = f"👻 {intent.gesture}"
+    # Generate audio response
+    audio_path = tts_service.synthesize(intent.message)
+    return intent.message, audio_path, status_msg
+def check_status():
+    """Periodic status checker for Gradio."""
+    status = async_executor.get_status()
+    if status:
+        if status["status"] == "complete":
+            return f"✅ Completed: {status['task']}"
+        elif status["status"] == "running":
+            return f"⏳ Running: {status['task']}"
+        elif status["status"] == "error":
+            return f"❌ Error: {status['error']}"
+    return "Idle"
+with gr.Blocks() as demo:
+    status_display = gr.Textbox(label="Robot Status", value="Idle")
+    # Update status every 500ms
+    demo.load(
+        fn=check_status,
+        outputs=[status_display],
+        every=0.5
+    )
+```
+### 6. SmolVLA Model Integration
+#### Component: `SmolVLAExecutor`
+**Purpose:** Execute vision-language-action inference for manipulation tasks.
+**LeRobot SmolVLA Overview:**
+SmolVLA is a vision-language-action model that:
+- Takes visual observations (camera images) as input
+- Accepts natural language task descriptions
+- Outputs robot actions (joint positions/velocities)
+- Trained end-to-end on demonstration data
+**Model Architecture:**
+```python
+from lerobot.common.policies.smolvla.modeling_smolvla import SmolVLAPolicy
+from lerobot.common.policies.smolvla.configuration_smolvla import SmolVLAConfig
+import torch
+from PIL import Image
+class SmolVLAExecutor:
+    def __init__(self, checkpoint_path: str, device: str = "cuda"):
+        self.device = device
+        self.policy = self._load_model(checkpoint_path)
+        self.camera = self._init_camera()
+    def _load_model(self, checkpoint_path: str) -> SmolVLAPolicy:
+        """Load trained SmolVLA model from checkpoint."""
+        config = SmolVLAConfig.from_pretrained(checkpoint_path)
+        policy = SmolVLAPolicy.from_pretrained(
+            checkpoint_path,
+            config=config
+        )
+        policy.to(self.device)
+        policy.eval()
+        return policy
+    def _init_camera(self):
+        """Initialize camera for visual observations."""
+        from lerobot.common.robot_devices.cameras.opencv import OpenCVCamera
+        camera = OpenCVCamera(camera_index=0, fps=30, width=640, height=480)
+        camera.connect()
+        return camera
+    def execute(self, command: str, max_steps: int = 500):
+        """
+        Execute manipulation task using SmolVLA.
+        Args:
+            command: Natural language task description
+            max_steps: Maximum inference steps
+        """
+        print(f"SmolVLA executing: {command}")
+        with torch.no_grad():
+            for step in range(max_steps):
+                # Capture current observation
+                observation = self._get_observation()
+                # Add task instruction
+                observation["task"] = command
+                # Run inference
+                action = self.policy.select_action(observation)
+                # Send action to robot
+                self._send_action(action)
+                # Check if task complete (implementation-specific)
+                if self._is_task_complete(observation, step):
+                    break
+        print(f"SmolVLA completed: {command}")
+    def _get_observation(self) -> dict:
+        """Get current robot observation."""
+        # Capture image
+        image = self.camera.read()
+        # Get robot state
+        robot_state = mortis_arm.robot.get_state()
+        return {
+            "observation.image": torch.from_numpy(image).to(self.device),
+            "observation.state": torch.tensor(robot_state).to(self.device)
+        }
+    def _send_action(self, action: torch.Tensor):
+        """Send predicted action to robot."""
+        action_dict = self._action_to_dict(action)
+        mortis_arm.robot.send_action(action_dict)
+    def _action_to_dict(self, action: torch.Tensor) -> dict:
+        """Convert action tensor to SO101 command format."""
+        # Map action dimensions to joint names
+        joint_names = [
+            "shoulder_pan.pos",
+            "shoulder_lift.pos",
+            "elbow_flex.pos",
+            "wrist_flex.pos",
+            "wrist_roll.pos",
+            "gripper.pos"
+        ]
+        return {
+            name: float(action[i].cpu().numpy())
+            for i, name in enumerate(joint_names)
+        }
+    def _is_task_complete(self, observation: dict, step: int) -> bool:
+        """Determine if task is complete (heuristic or learned)."""
+        # Simple heuristic: fixed number of steps
+        # In practice, could use learned termination classifier
+        return step >= 400
+# Global SmolVLA executor
+smolvla_executor = None
+def init_smolvla(checkpoint_path: str):
+    global smolvla_executor
+    smolvla_executor = SmolVLAExecutor(checkpoint_path)
+```
+### 7. Dataset Collection Infrastructure
+#### Component: `DataCollector`
+**Purpose:** Record human demonstrations for training SmolVLA model.
+**LeRobot Dataset Format:**
+LeRobot uses a standardized dataset format with:
+- Episodes: Individual task demonstrations
+- Observations: Camera images, robot states
+- Actions: Robot joint commands
+- Metadata: Task descriptions, timestamps
+**Data Collection Script:**
+```python
+from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
+from lerobot.common.datasets.push_dataset_to_hub.utils import save_images_concurrently
+from pathlib import Path
+import numpy as np
+class DataCollector:
+    def __init__(self, dataset_name: str, repo_id: str):
+        self.dataset_name = dataset_name
+        self.repo_id = repo_id
+        self.dataset_dir = Path(f"data/{dataset_name}")
+        self.dataset_dir.mkdir(parents=True, exist_ok=True)
+        self.dataset = LeRobotDataset.create(
+            repo_id=repo_id,
+            fps=30,
+            robot_type="so101",
+            keys=["observation.image", "observation.state", "action"]
+        )
+    def record_episode(self, task_description: str, duration: float = 30.0):
+        """
+        Record a single demonstration episode.
+        Args:
+            task_description: Natural language task description
+            duration: Maximum recording duration in seconds
+        """
+        print(f"Recording episode: {task_description}")
+        print("Press ENTER to start recording...")
+        input()
+        episode_data = {
+            "observation.image": [],
+            "observation.state": [],
+            "action": [],
+            "timestamp": [],
+            "task": task_description
+        }
+        start_time = time.time()
+        frame_count = 0
+        print("Recording... Press CTRL+C to stop")
+        try:
+            while time.time() - start_time < duration:
+                # Capture observation
+                image = camera.read()
+                state = mortis_arm.robot.get_state()
+                # Record current state as "action" (for behavior cloning)
+                action = state.copy()
+                # Store data
+                episode_data["observation.image"].append(image)
+                episode_data["observation.state"].append(state)
+                episode_data["action"].append(action)
+                episode_data["timestamp"].append(time.time() - start_time)
+                frame_count += 1
+                time.sleep(1/30)  # 30 FPS
+        except KeyboardInterrupt:
+            print(f"\nRecording stopped. Captured {frame_count} frames")
+        # Save episode to dataset
+        self._save_episode(episode_data)
+        print(f"Episode saved: {task_description}")
+    def _save_episode(self, episode_data: dict):
+        """Save episode to LeRobot dataset."""
+        episode_index = len(self.dataset)
+        # Convert to numpy arrays
+        images = np.array(episode_data["observation.image"])
+        states = np.array(episode_data["observation.state"])
+        actions = np.array(episode_data["action"])
+        # Add to dataset
+        self.dataset.add_episode({
+            "observation.image": images,
+            "observation.state": states,
+            "action": actions,
+            "episode_index": episode_index,
+            "task": episode_data["task"]
+        })
+        # Save to disk
+        self.dataset.save_to_disk(self.dataset_dir)
+    def push_to_hub(self):
+        """Upload dataset to Hugging Face Hub."""
+        self.dataset.push_to_hub(self.repo_id)
+        print(f"Dataset pushed to: https://huggingface.co/datasets/{self.repo_id}")
+# Usage script
+def collect_demonstrations():
+    collector = DataCollector(
+        dataset_name="mortis_manipulation",
+        repo_id="your-username/mortis-manipulation"
+    )
+    tasks = [
+        "Pick up the skull and place it in the green cup",
+        "Pick up the skull and place it in the orange cup",
+        "Pick up the skull and place it in the purple cup",
+        "Pick up the eyeball and place it in the green cup",
+        "Pick up the eyeball and place it in the orange cup",
+        "Pick up the eyeball and place it in the purple cup",
+    ]
+    for task in tasks:
+        print(f"\n{'='*60}")
+        print(f"Task: {task}")
+        print(f"{'='*60}")
+        # Record multiple demonstrations per task
+        for demo_num in range(5):
+            print(f"\nDemonstration {demo_num + 1}/5")
+            collector.record_episode(task)
+    # Upload to Hugging Face
+    collector.push_to_hub()
+```
+### 8. Training Pipeline
+#### Component: `TrainingPipeline`
+**Purpose:** Train SmolVLA model on collected demonstration data.
+**LeRobot Training Configuration:**
+```yaml
+# config/train_smolvla.yaml
+defaults:
+  - _self_
+  - policy: smolvla
+seed: 1000
+dataset_repo_id: your-username/mortis-manipulation
+video_backend: pyav
+training:
+  offline_steps: 100000
+  online_steps: 0
+  eval_freq: 10000
+  save_freq: 10000
+  log_freq: 100
+  save_checkpoint: true
+  batch_size: 8
+  lr: 1e-4
+  lr_scheduler: cosine
+  lr_warmup_steps: 1000
+  adam_betas: [0.9, 0.999]
+  adam_weight_decay: 1e-6
+  grad_clip_norm: 10.0
+  delta_timestamps:
+    action: "[i / ${fps} for i in range(${policy.chunk_size})]"
+eval:
+  n_episodes: 10
+  batch_size: 10
+policy:
+  name: smolvla
+  # Input dimensions
+  input_shapes:
+    observation.image: [3, 224, 224]
+    observation.state: [6]  # 6 joints
+  output_shapes:
+    action: [6]  # 6 joint commands
+  # Model architecture
+  vision_backbone: "google/siglip-so400m-patch14-384"
+  pretrained_backbone_weights: "google/siglip-so400m-patch14-384"
+  # Action prediction
+  chunk_size: 50  # Predict 50 steps ahead
+  n_action_steps: 50
+  # Training
+  use_language_conditioning: true
+  dropout: 0.1
+device: cuda
+use_amp: true  # Automatic mixed precision
+```
+**Training Script:**
+```python
+from lerobot.scripts.train import train
+from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
+from pathlib import Path
+import hydra
+from omegaconf import DictConfig
+@hydra.main(config_path="config", config_name="train_smolvla", version_base="1.2")
+def train_smolvla(cfg: DictConfig):
+    """
+    Train SmolVLA model using LeRobot training pipeline.
+    Usage:
+        python -m mortis.train
+    """
+    # Load dataset
+    dataset = LeRobotDataset(
+        repo_id=cfg.dataset_repo_id,
+        split="train"
+    )
+    print(f"Dataset loaded: {len(dataset)} episodes")
+    print(f"Training for {cfg.training.offline_steps} steps")
+    # Run training
+    train(cfg)
+    print("Training complete!")
+    print(f"Checkpoints saved to: outputs/train/{cfg.run_name}")
+if __name__ == "__main__":
+    train_smolvla()
+```
+**Simplified Training Command:**
+```bash
+# Using lerobot CLI
+python -m lerobot.scripts.train \
+    policy=smolvla \
+    env=so101 \
+    dataset_repo_id=your-username/mortis-manipulation \
+    training.offline_steps=100000 \
+    training.batch_size=8 \
+    training.save_freq=10000 \
+    device=cuda \
+    wandb.enable=true \
+    wandb.project=mortis-smolvla
+```
+**Training Monitoring:**
+```python
+# Integration with Weights & Biases for tracking
+import wandb
+wandb.init(
+    project="mortis-smolvla",
+    config={
+        "dataset": "mortis-manipulation",
+        "policy": "smolvla",
+        "batch_size": 8,
+        "learning_rate": 1e-4
+    }
+)
+# Logged automatically by LeRobot:
+# - Training loss
+# - Validation loss
+# - Action prediction accuracy
+# - Episode success rate
+# - Sample predictions (videos)
+```
+## Data Models
+### 1. Gemini Response Model
+```python
+from dataclasses import dataclass
+from enum import Enum
+from typing import Optional
+class ResponseType(Enum):
+    CONVERSATION = "conversation"
+    MANIPULATION = "manipulation"
+class Mood(Enum):
+    OMINOUS = "ominous"
+    PLAYFUL = "playful"
+    ANGRY = "angry"
+    NERVOUS = "nervous"
+    TRIUMPHANT = "triumphant"
+    MISCHIEVOUS = "mischievous"
+    SINISTER = "sinister"
+    CURIOUS = "curious"
+    NEUTRAL = "neutral"
+class Gesture(Enum):
+    IDLE = "idle"
+    WAVE = "wave"
+    POINT_LEFT = "point_left"
+    POINT_RIGHT = "point_right"
+    GRAB = "grab"
+    DROP = "drop"
+@dataclass
+class GeminiResponse:
+    """Structured response from Gemini API."""
+    type: ResponseType
+    message: str
+    mood: Mood
+    gesture: Optional[Gesture] = None
+    command: Optional[str] = None
+    @classmethod
+    def from_json(cls, data: dict) -> 'GeminiResponse':
+        """Parse JSON response from Gemini."""
+        response_type = ResponseType(data["type"])
+        if response_type == ResponseType.MANIPULATION:
+            return cls(
+                type=response_type,
+                message=data["message"],
+                mood=Mood(data["mood"]),
+                command=data["command"]
+            )
+        else:
+            return cls(
+                type=response_type,
+                message=data["message"],
+                mood=Mood(data["mood"]),
+                gesture=Gesture(data.get("gesture", "idle"))
+            )
+```
+### 2. Task Execution Model
+```python
+from dataclasses import dataclass
+from enum import Enum
+from typing import Optional
+import time
+class TaskStatus(Enum):
+    QUEUED = "queued"
+    RUNNING = "running"
+    COMPLETE = "complete"
+    FAILED = "failed"
+class TaskType(Enum):
+    GESTURE = "gesture"
+    MANIPULATION = "manipulation"
+@dataclass
+class Task:
+    """Represents a robot task for execution."""
+    id: str
+    type: TaskType
+    status: TaskStatus
+    created_at: float
+    started_at: Optional[float] = None
+    completed_at: Optional[float] = None
+    error: Optional[str] = None
+    # Task-specific data
+    gesture: Optional[str] = None
+    command: Optional[str] = None
+    @classmethod
+    def create_gesture_task(cls, gesture: str) -> 'Task':
+        """Create a gesture execution task."""
+        return cls(
+            id=f"gesture_{time.time()}",
+            type=TaskType.GESTURE,
+            status=TaskStatus.QUEUED,
+            created_at=time.time(),
+            gesture=gesture
+        )
+    @classmethod
+    def create_manipulation_task(cls, command: str) -> 'Task':
+        """Create a manipulation execution task."""
+        return cls(
+            id=f"manipulation_{time.time()}",
+            type=TaskType.MANIPULATION,
+            status=TaskStatus.QUEUED,
+            created_at=time.time(),
+            command=command
+        )
+    def start(self):
+        """Mark task as started."""
+        self.status = TaskStatus.RUNNING
+        self.started_at = time.time()
+    def complete(self):
+        """Mark task as completed."""
+        self.status = TaskStatus.COMPLETE
+        self.completed_at = time.time()
+    def fail(self, error: str):
+        """Mark task as failed."""
+        self.status = TaskStatus.FAILED
+        self.completed_at = time.time()
+        self.error = error
+    @property
+    def duration(self) -> Optional[float]:
+        """Get task execution duration."""
+        if self.started_at and self.completed_at:
+            return self.completed_at - self.started_at
+        return None
+```
+### 3. Dataset Episode Model
+```python
+from dataclasses import dataclass
+import numpy as np
+from typing import List
+@dataclass
+class Episode:
+    """Represents a single demonstration episode."""
+    episode_index: int
+    task_description: str
+    images: np.ndarray  # Shape: (T, H, W, 3)
+    states: np.ndarray  # Shape: (T, 6)
+    actions: np.ndarray  # Shape: (T, 6)
+    timestamps: np.ndarray  # Shape: (T,)
+    @property
+    def length(self) -> int:
+        """Number of timesteps in episode."""
+        return len(self.timestamps)
+    @property
+    def duration(self) -> float:
+        """Episode duration in seconds."""
+        return self.timestamps[-1] - self.timestamps[0]
+    def validate(self) -> bool:
+        """Validate episode data consistency."""
+        lengths = [
+            len(self.images),
+            len(self.states),
+            len(self.actions),
+            len(self.timestamps)
+        ]
+        return len(set(lengths)) == 1  # All same length
+```
+## Error Handling
+### Error Categories and Recovery Strategies
+#### 1. Gemini API Errors
+**Error Types:**
+- Authentication failures (invalid API key)
+- Rate limiting (quota exceeded)
+- Network timeouts
+- Invalid responses (malformed JSON)
+**Recovery Strategy:**
+```python
+import time
+from typing import Optional
+class GeminiAPIError(Exception):
+    """Base exception for Gemini API errors."""
+    pass
+class GeminiClient:
+    def __init__(self, api_key: str, max_retries: int = 3):
+        self.api_key = api_key
+        self.max_retries = max_retries
+    def send_message_with_retry(
+        self,
+        message: str,
+        retry_count: int = 0
+    ) -> Optional[GeminiResponse]:
+        """Send message with exponential backoff retry."""
+        try:
+            response = self._send_message(message)
+            return response
+        except genai.types.BlockedPromptException as e:
+            # Content safety filter triggered
+            print(f"Prompt blocked by safety filter: {e}")
+            return self._get_fallback_response()
+        except genai.types.RateLimitError as e:
+            if retry_count < self.max_retries:
+                wait_time = 2 ** retry_count  # Exponential backoff
+                print(f"Rate limited. Retrying in {wait_time}s...")
+                time.sleep(wait_time)
+                return self.send_message_with_retry(message, retry_count + 1)
+            else:
+                raise GeminiAPIError("Max retries exceeded for rate limit")
+        except Exception as e:
+            print(f"Gemini API error: {e}")
+            return self._get_fallback_response()
+    def _get_fallback_response(self) -> GeminiResponse:
+        """Return safe fallback response on API failure."""
+        return GeminiResponse(
+            type=ResponseType.CONVERSATION,
+            message="The spirits are restless... try again.",
+            mood=Mood.OMINOUS,
+            gesture=Gesture.IDLE
+        )
+```
+#### 2. STT/TTS Errors
+**Error Types:**
+- Audio format incompatibility
+- Service unavailable
+- Transcription failures (unclear audio)
+**Recovery Strategy:**
+```python
+class AudioProcessingError(Exception):
+    """Base exception for audio processing errors."""
+    pass
+def process_audio_with_fallback(audio_path: str) -> str:
+    """Process audio with fallback to text input."""
+    try:
+        # Try Gemini native audio
+        transcript = transcribe_with_gemini(audio_path)
+        return transcript
+    except Exception as e:
+        print(f"Gemini audio processing failed: {e}")
+        try:
+            # Fallback to Google STT
+            transcript = transcribe_with_google_stt(audio_path)
+            return transcript
+        except Exception as e:
+            print(f"Google STT failed: {e}")
+            raise AudioProcessingError(
+                "Could not process audio. Please use text input."
+            )
+def synthesize_speech_with_fallback(text: str) -> Optional[str]:
+    """Synthesize speech with fallback to text-only."""
+    try:
+        audio_path = synthesize_with_google_tts(text)
+        return audio_path
+    except Exception as e:
+        print(f"TTS failed: {e}. Returning text only.")
+        return None  # UI will display text without audio
+```
+#### 3. SmolVLA Inference Errors
+**Error Types:**
+- Model loading failures
+- GPU out of memory
+- Invalid observations
+- Action execution failures
+**Recovery Strategy:**
+```python
+class SmolVLAError(Exception):
+    """Base exception for SmolVLA errors."""
+    pass
+class SmolVLAExecutor:
+    def execute_with_safety(self, command: str) -> bool:
+        """Execute command with safety checks and recovery."""
+        try:
+            # Pre-execution validation
+            if not self._validate_command(command):
+                raise SmolVLAError(f"Invalid command: {command}")
+            if not self._check_workspace_clear():
+                raise SmolVLAError("Workspace not clear. Remove obstacles.")
+            # Execute with timeout
+            success = self._execute_with_timeout(command, timeout=60.0)
+            if not success:
+                raise SmolVLAError("Execution timeout")
+            return True
+        except torch.cuda.OutOfMemoryError:
+            print("GPU OOM. Clearing cache and retrying...")
+            torch.cuda.empty_cache()
+            return self._execute_with_timeout(command, timeout=60.0)
+        except Exception as e:
+            print(f"SmolVLA execution failed: {e}")
+            # Return to safe position
+            self._emergency_stop()
+            return False
+    def _emergency_stop(self):
+        """Return robot to safe idle position."""
+        print("Emergency stop: returning to idle position")
+        mortis_arm.move_arm("idle")
+    def _validate_command(self, command: str) -> bool:
+        """Validate command is in trained set."""
+        return command in self.valid_commands
+    def _check_workspace_clear(self) -> bool:
+        """Check if workspace is safe for execution."""
+        # Could use computer vision to detect obstacles
+        # For now, assume clear
+        return True
+```
+#### 4. Robot Hardware Errors
+**Error Types:**
+- Connection failures
+- Servo errors
+- Position limits exceeded
+- Communication timeouts
+**Recovery Strategy:**
+```python
+class RobotError(Exception):
+    """Base exception for robot hardware errors."""
+    pass
+class MortisArm:
+    def move_arm_safe(self, gesture_name: str) -> bool:
+        """Execute gesture with error handling."""
+        if not self.connected:
+            try:
+                self.connect()
+            except Exception as e:
+                print(f"Failed to connect to robot: {e}")
+                return False
+        try:
+            self.move_arm(gesture_name)
+            return True
+        except Exception as e:
+            print(f"Gesture execution failed: {e}")
+            # Attempt recovery
+            try:
+                print("Attempting to reconnect...")
+                self.disconnect()
+                time.sleep(1)
+                self.connect()
+                self.move_arm("idle")
+                return False
+            except Exception as e:
+                print(f"Recovery failed: {e}")
+                self.connected = False
+                return False
+```
+### Error Reporting to User
+```python
+def format_error_message(error: Exception) -> str:
+    """Format error for user display."""
+    error_messages = {
+        GeminiAPIError: "🔮 The spirits are not responding. Please try again.",
+        AudioProcessingError: "🎤 Could not understand audio. Please try text input.",
+        SmolVLAError: "🤖 Mortis cannot perform that action right now.",
+        RobotError: "⚠️ Robot connection lost. Attempting to reconnect...",
+    }
+    error_type = type(error)
+    return error_messages.get(error_type, "❌ An unexpected error occurred.")
+```
+## Testing Strategy
+### 1. Unit Testing
+**Components to Test:**
+- Gemini API client (with mocked responses)
+- Intent router (parsing and validation)
+- Data models (serialization/deserialization)
+- Audio processing utilities
+**Example Test:**
+```python
+import pytest
+from unittest.mock import Mock, patch
+from mortis.gemini_client import GeminiClient, GeminiResponse, ResponseType
+def test_gemini_response_parsing():
+    """Test parsing of Gemini JSON responses."""
+    # Test conversation response
+    conv_data = {
+        "type": "conversation",
+        "message": "Beware, mortal...",
+        "mood": "ominous",
+        "gesture": "wave"
+    }
+    response = GeminiResponse.from_json(conv_data)
+    assert response.type == ResponseType.CONVERSATION
+    assert response.gesture.value == "wave"
+    # Test manipulation response
+    manip_data = {
+        "type": "manipulation",
+        "message": "As you wish...",
+        "mood": "sinister",
+        "command": "Pick up the skull and place it in the green cup"
+    }
+    response = GeminiResponse.from_json(manip_data)
+    assert response.type == ResponseType.MANIPULATION
+    assert response.command is not None
+@patch('google.generativeai.GenerativeModel')
+def test_gemini_client_retry(mock_model):
+    """Test retry logic for API failures."""
+    client = GeminiClient(api_key="test_key", max_retries=3)
+    # Simulate rate limit error then success
+    mock_model.return_value.generate_content.side_effect = [
+        genai.types.RateLimitError("Rate limited"),
+        Mock(text='{"type": "conversation", "message": "Hello", "mood": "neutral", "gesture": "idle"}')
+    ]
+    response = client.send_message_with_retry("Hello")
+    assert response is not None
+    assert mock_model.return_value.generate_content.call_count == 2
+```
+### 2. Integration Testing
+**Test Scenarios:**
+- End-to-end voice input → Gemini → gesture execution
+- Text input → intent detection → SmolVLA execution
+- Dataset collection → training → inference pipeline
+- Error recovery flows
+**Example Test:**
+```python
+@pytest.mark.integration
+def test_voice_to_gesture_flow():
+    """Test complete voice input to gesture execution."""
+    # Record test audio
+    test_audio = "tests/fixtures/test_wave.wav"
+    # Process audio
+    transcript = process_audio(test_audio)
+    assert "wave" in transcript.lower()
+    # Send to Gemini
+    response = gemini_client.send_message(transcript)
+    assert response.type == ResponseType.CONVERSATION
+    assert response.gesture == Gesture.WAVE
+    # Execute gesture (with mock robot)
+    with patch.object(mortis_arm, 'move_arm') as mock_move:
+        execute_gesture(response.gesture)
+        mock_move.assert_called_once_with("wave")
+@pytest.mark.integration
+@pytest.mark.slow
+def test_smolvla_inference():
+    """Test SmolVLA model inference (requires GPU)."""
+    if not torch.cuda.is_available():
+        pytest.skip("GPU not available")
+    # Load test checkpoint
+    executor = SmolVLAExecutor("tests/fixtures/test_checkpoint")
+    # Execute test command
+    command = "Pick up the skull and place it in the green cup"
+    success = executor.execute(command, max_steps=10)
+    assert success
+```
+### 3. System Testing
+**Test Scenarios:**
+- Multi-user concurrent access
+- Long-running operation stability
+- Resource usage (GPU memory, CPU)
+- Network failure recovery
+**Performance Benchmarks:**
+```python
+@pytest.mark.benchmark
+def test_gemini_response_time():
+    """Benchmark Gemini API response time."""
+    import time
+    times = []
+    for _ in range(10):
+        start = time.time()
+        response = gemini_client.send_message("Hello Mortis")
+        elapsed = time.time() - start
+        times.append(elapsed)
+    avg_time = sum(times) / len(times)
+    assert avg_time < 2.0, f"Average response time {avg_time}s exceeds 2s threshold"
+@pytest.mark.benchmark
+def test_smolvla_inference_time():
+    """Benchmark SmolVLA inference speed."""
+    executor = SmolVLAExecutor("checkpoints/best_model")
+    start = time.time()
+    executor.execute("Pick up the skull and place it in the green cup", max_steps=100)
+    elapsed = time.time() - start
+    assert elapsed < 30.0, f"Inference time {elapsed}s exceeds 30s threshold"
+```
+### 4. User Acceptance Testing
+**Test Scenarios:**
+- Voice recognition accuracy with different accents
+- Task success rate for manipulation commands
+- UI responsiveness during long operations
+- Error message clarity and helpfulness
+**Manual Test Checklist:**
+```markdown
+## Voice Input Testing
+- [ ] Clear speech recognized correctly
+- [ ] Background noise handled gracefully
+- [ ] Multiple languages supported (if applicable)
+- [ ] Audio feedback provided to user
+## Manipulation Task Testing
+- [ ] All 6 trained tasks execute successfully
+- [ ] Task variations handled appropriately
+- [ ] Robot returns to safe position after completion
+- [ ] Visual feedback clear during execution
+## Error Handling Testing
+- [ ] API failures display helpful messages
+- [ ] Robot errors trigger safe shutdown
+- [ ] Network issues handled gracefully
+- [ ] Recovery procedures work as expected
+## UI/UX Testing
+- [ ] Interface remains responsive during tasks
+- [ ] Status updates clear and timely
+- [ ] Audio playback works correctly
+- [ ] Webcam feed displays properly
+```
+### 5. Safety Testing
+**Critical Safety Tests:**
+```python
+def test_emergency_stop():
+    """Test emergency stop functionality."""
+    executor = SmolVLAExecutor("checkpoints/best_model")
+    # Start execution
+    task_thread = Thread(target=executor.execute, args=("test command",))
+    task_thread.start()
+    # Trigger emergency stop
+    time.sleep(1)
+    executor._emergency_stop()
+    # Verify robot in safe position
+    state = mortis_arm.robot.get_state()
+    assert state == HOME_POSE
+def test_workspace_collision_detection():
+    """Test collision detection and avoidance."""
+    # Place obstacle in workspace
+    # Attempt manipulation task
+    # Verify task aborted safely
+    pass
+```
+## Deployment and Configuration
+### Environment Configuration
+**Required Environment Variables:**
+```bash
+# .env file
+# Gemini API
+GEMINI_API_KEY=your_google_api_key
+GEMINI_MODEL=gemini-2.0-flash-exp
+GEMINI_TEMPERATURE=0.2
+# Google Cloud (for STT/TTS if not using Gemini native)
+GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account-key.json
+# Robot Configuration
+ROBOT_PORT=/dev/ttyACM1
+ROBOT_CALIBRATION_DIR=.cache/calibration/so101/
+# SmolVLA Model
+SMOLVLA_CHECKPOINT_PATH=checkpoints/smolvla_best.pt
+SMOLVLA_DEVICE=cuda
+# Application
+PORT=7860
+DEBUG=false
+# Optional: Weights & Biases for training
+WANDB_API_KEY=your_wandb_key
+WANDB_PROJECT=mortis-smolvla
+```
+### Dependency Management
+**Updated pyproject.toml:**
+```toml
+[project]
+name = "mortis"
+version = "0.2.0"
+description = "Mortis: Multi-modal AI Halloween Experience with SmolVLA"
+requires-python = ">=3.12"
+dependencies = [
+    "gradio>=5.49.1",
+    "lerobot[async,feetech,intelrealsense,smolvla]>=0.4.0",
+    "python-dotenv>=1.2.1",
+    # Gemini and Google Cloud
+    "google-generativeai>=0.8.0",
+    "google-cloud-speech>=2.26.0",
+    "google-cloud-texttospeech>=2.16.0",
+    # ML and Vision
+    "torch>=2.0.0",
+    "torchvision>=0.15.0",
+    "transformers>=4.40.0",
+    "pillow>=10.0.0",
+    # Data and utilities
+    "numpy>=1.24.0",
+    "opencv-python>=4.8.0",
+    "datasets>=2.14.0",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=7.4.0",
+    "pytest-asyncio>=0.21.0",
+    "pytest-benchmark>=4.0.0",
+    "black>=23.0.0",
+    "ruff>=0.1.0",
+]
+training = [
+    "wandb>=0.16.0",
+    "hydra-core>=1.3.0",
+    "tensorboard>=2.14.0",
+]
+[project.scripts]
+mortis = "mortis.app:main"
+calibrate = "mortis.calibrate:main"
+collect-data = "mortis.collect_data:main"
+train-smolvla = "mortis.train:main"
+```
+### Installation Steps
+```bash
+# 1. Clone repository
+git clone https://github.com/your-username/mortis.git
+cd mortis
+# 2. Install dependencies
+make install
+# 3. Configure environment
+cp .env.example .env
+# Edit .env with your API keys
+# 4. Calibrate robot (first time only)
+make calibrate
+# 5. Download or train SmolVLA model
+# Option A: Download pre-trained model
+python -m mortis.download_model --checkpoint smolvla_mortis_v1
+# Option B: Train from scratch
+make collect-data
+make train-smolvla
+# 6. Run application
+make run
+```
+### Docker Deployment (Optional)
+**Dockerfile:**
+```dockerfile
+FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04
+# Install Python and system dependencies
+RUN apt-get update && apt-get install -y \
+    python3.12 \
+    python3-pip \
+    libusb-1.0-0 \
+    udev \
+    && rm -rf /var/lib/apt/lists/*
+# Install uv package manager
+RUN pip install uv
+WORKDIR /app
+# Copy project files
+COPY pyproject.toml uv.lock ./
+COPY src/ ./src/
+COPY assets/ ./assets/
+# Install dependencies
+RUN uv sync --frozen
+# Expose Gradio port
+EXPOSE 7860
+# Run application
+CMD ["uv", "run", "mortis"]
+```
+**docker-compose.yml:**
+```yaml
+version: '3.8'
+services:
+  mortis:
+    build: .
+    ports:
+      - "7860:7860"
+    devices:
+      - /dev/ttyACM1:/dev/ttyACM1  # Robot USB connection
+    volumes:
+      - ./.env:/app/.env
+      - ./checkpoints:/app/checkpoints
+      - ./.cache:/app/.cache
+    environment:
+      - NVIDIA_VISIBLE_DEVICES=all
+    runtime: nvidia
+    restart: unless-stopped
+```
+### System Requirements
+**Minimum Requirements:**
+- CPU: 4 cores
+- RAM: 16 GB
+- GPU: NVIDIA GPU with 8GB VRAM (for SmolVLA inference)
+- Storage: 50 GB (for models and datasets)
+- OS: Ubuntu 22.04 or later
+- USB: Available port for SO101 robot
+**Recommended Requirements:**
+- CPU: 8+ cores
+- RAM: 32 GB
+- GPU: NVIDIA RTX 3090 or better (24GB VRAM)
+- Storage: 100 GB SSD
+- Network: Stable internet for Gemini API
+### Monitoring and Logging
+**Logging Configuration:**
+```python
+import logging
+from pathlib import Path
+# Configure logging
+LOG_DIR = Path("logs")
+LOG_DIR.mkdir(exist_ok=True)
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
+    handlers=[
+        logging.FileHandler(LOG_DIR / f"mortis_{time.time()}.log"),
+        logging.StreamHandler()
+    ]
+)
+logger = logging.getLogger("mortis")
+# Log important events
+logger.info("Application started")
+logger.info(f"Gemini model: {GEMINI_MODEL}")
+logger.info(f"SmolVLA checkpoint: {SMOLVLA_CHECKPOINT_PATH}")
+```
+**Metrics to Monitor:**
+- Gemini API response times
+- SmolVLA inference times
+- Task success rates
+- Error frequencies
+- GPU memory usage
+- Robot connection status
+## Migration Strategy
+### Phase 1: Gemini API Integration (Week 1)
+**Goals:**
+- Replace existing LLM API with Gemini
+- Maintain current gesture functionality
+- Add structured JSON response parsing
+**Tasks:**
+1. Create `GeminiClient` class
+2. Update system prompt for Gemini
+3. Modify `ask_mortis()` to use Gemini API
+4. Test with existing gestures
+5. Update environment configuration
+**Validation:**
+- All existing gestures work with Gemini
+- Response times comparable to previous API
+- Character personality maintained
+### Phase 2: Voice Input/Output (Week 2)
+**Goals:**
+- Add audio input component to Gradio
+- Implement STT using Gemini native audio or Google STT
+- Add TTS for voice responses
+- Test multi-modal interaction
+**Tasks:**
+1. Add audio input/output components to UI
+2. Implement STT service
+3. Implement TTS service
+4. Update UI to handle audio flows
+5. Test voice interaction end-to-end
+**Validation:**
+- Voice input transcribed accurately
+- Audio responses play correctly
+- Text input still works
+- UI remains responsive
+### Phase 3: Dataset Collection (Week 3)
+**Goals:**
+- Set up data collection infrastructure
+- Record demonstrations for all 6 tasks
+- Validate and upload dataset to Hugging Face
+**Tasks:**
+1. Create `DataCollector` class
+2. Set up camera and robot for recording
+3. Record 5-10 demonstrations per task
+4. Validate dataset quality
+5. Push to Hugging Face Hub
+**Validation:**
+- All 6 tasks have sufficient demonstrations
+- Data quality is high (clear images, smooth motions)
+- Dataset loads correctly in LeRobot
+### Phase 4: SmolVLA Training (Week 4)
+**Goals:**
+- Train SmolVLA model on collected data
+- Evaluate model performance
+- Select best checkpoint
+**Tasks:**
+1. Configure training pipeline
+2. Run training for 100k steps
+3. Monitor training metrics
+4. Evaluate on validation set
+5. Select and save best checkpoint
+**Validation:**
+- Training converges (loss decreases)
+- Validation performance acceptable
+- Model can execute at least 3/6 tasks successfully
+### Phase 5: Intent Detection and Routing (Week 5)
+**Goals:**
+- Implement intent detection in Gemini prompt
+- Create intent router
+- Add command validation
+**Tasks:**
+1. Update Gemini system prompt with task definitions
+2. Create `IntentRouter` class
+3. Implement command validation
+4. Test intent detection accuracy
+5. Handle edge cases
+**Validation:**
+- Manipulation commands detected correctly (>90% accuracy)
+- Conversational inputs routed to gestures
+- Invalid commands handled gracefully
+### Phase 6: Asynchronous Execution (Week 6)
+**Goals:**
+- Implement async task execution
+- Add status tracking and UI updates
+- Test UI responsiveness
+**Tasks:**
+1. Create `AsyncExecutor` class
+2. Implement task queue
+3. Add status display to UI
+4. Test with long-running tasks
+5. Handle concurrent requests
+**Validation:**
+- UI remains responsive during SmolVLA execution
+- Status updates appear correctly
+- Multiple tasks can be queued
+- Errors don't crash the system
+### Phase 7: Integration and Testing (Week 7)
+**Goals:**
+- Integrate all components
+- Comprehensive testing
+- Bug fixes and optimization
+**Tasks:**
+1. Integration testing
+2. Performance optimization
+3. Error handling improvements
+4. Documentation updates
+5. User acceptance testing
+**Validation:**
+- All features work together
+- Performance meets requirements
+- Error handling robust
+- Documentation complete
+### Phase 8: Deployment and Monitoring (Week 8)
+**Goals:**
+- Deploy to production environment
+- Set up monitoring
+- Create user documentation
+**Tasks:**
+1. Prepare deployment environment
+2. Configure monitoring and logging
+3. Create user guide
+4. Deploy application
+5. Monitor initial usage
+**Validation:**
+- Application runs stably
+- Monitoring captures key metrics
+- Users can operate system successfully
+### Rollback Plan
+If critical issues arise during migration:
+1. **Immediate Rollback:**
+   - Revert to previous LLM API
+   - Disable voice features
+   - Use gesture-only mode
+2. **Partial Rollback:**
+   - Keep Gemini API
+   - Disable SmolVLA (gestures only)
+   - Disable voice features
+3. **Data Preservation:**
+   - All datasets backed up to Hugging Face
+   - Model checkpoints saved to cloud storage
+   - Configuration files version controlled
+### Risk Mitigation
+**Risk: Gemini API costs exceed budget**
+- Mitigation: Set API usage limits, implement caching, use smaller models
+**Risk: SmolVLA training fails to converge**
+- Mitigation: Collect more data, adjust hyperparameters, use pre-trained weights
+**Risk: Voice recognition accuracy too low**
+- Mitigation: Use better STT service, add noise filtering, provide text fallback
+**Risk: GPU memory insufficient for SmolVLA**
+- Mitigation: Reduce batch size, use model quantization, upgrade hardware
+**Risk: Robot safety issues during autonomous execution**
+- Mitigation: Implement workspace monitoring, add emergency stop, limit motion range
+## Design Decisions and Rationale
+### 1. Why Gemini API over Other LLMs?
+**Decision:** Use Google Gemini API as the primary LLM.
+**Rationale:**
+- Native multi-modal support (audio, images, text)
+- Structured output via JSON mode
+- Strong intent detection capabilities
+- Integrated with Google Cloud ecosystem (STT/TTS)
+- Competitive pricing and performance
+- Good documentation and Python SDK
+**Alternatives Considered:**
+- OpenAI GPT-4: More expensive, separate APIs for audio
+- Anthropic Claude: No native audio support
+- Local LLMs: Insufficient quality for intent detection
+### 2. Why asyncio.Queue over Redis?
+**Decision:** Use Python's asyncio.Queue for task management.
+**Rationale:**
+- Single-machine deployment (no distributed workers needed)
+- No external dependencies
+- Simpler implementation and debugging
+- Sufficient for expected load (single user at a time)
+- Lower latency than network-based queue
+**When to Reconsider:**
+- Multiple robot arms
+- Distributed deployment
+- High concurrent user load
+- Need for task persistence across restarts
+### 3. Why SmolVLA over Other Robot Learning Approaches?
+**Decision:** Use SmolVLA for manipulation tasks.
+**Rationale:**
+- Vision-language-action model (understands natural language)
+- Integrated with LeRobot framework
+- End-to-end learning (no manual feature engineering)
+- Proven performance on manipulation tasks
+- Active development and community support
+**Alternatives Considered:**
+- Reinforcement Learning: Requires extensive training, safety concerns
+- Classical Motion Planning: Requires manual programming, less flexible
+- Behavior Cloning (non-VLA): No language understanding
+### 4. Why Hybrid Gesture + SmolVLA Approach?
+**Decision:** Keep predefined gestures for conversational responses, add SmolVLA for manipulation.
+**Rationale:**
+- Gestures are fast and reliable (no inference needed)
+- SmolVLA reserved for complex manipulation tasks
+- Reduces GPU usage for simple interactions
+- Maintains backward compatibility
+- Clear separation of concerns
+**Benefits:**
+- Lower latency for conversational interactions
+- More robust (gestures can't fail inference)
+- Better resource utilization
+### 5. Why Gradio over Custom Web Framework?
+**Decision:** Continue using Gradio for the web interface.
+**Rationale:**
+- Already integrated in existing system
+- Excellent support for audio/video components
+- Built-in WebSocket handling for real-time updates
+- Rapid prototyping and iteration
+- Good documentation and examples
+**Limitations Acknowledged:**
+- Less customization than React/Vue
+- Limited styling options
+- Not ideal for production-scale applications
+**When to Reconsider:**
+- Need for complex custom UI
+- Mobile app requirements
+- High-scale deployment (>100 concurrent users)
+### 6. Why Google TTS over Local Alternatives?
+**Decision:** Use Google Cloud Text-to-Speech for voice output.
+**Rationale:**
+- High-quality neural voices
+- Consistent with Gemini ecosystem
+- Low latency
+- Voice customization options (pitch, speed)
+- Reliable service
+**Alternatives Considered:**
+- pyttsx3: Lower quality, robotic voice
+- gTTS: Limited voice options, requires internet anyway
+- Local neural TTS: High GPU usage, slower
+### 7. Why Separate Training and Inference Scripts?
+**Decision:** Keep training infrastructure separate from runtime application.
+**Rationale:**
+- Training is offline, one-time process
+- Different hardware requirements (training needs more VRAM)
+- Cleaner code organization
+- Easier to update training without affecting production
+- Can train on different machine than deployment
+**Implementation:**
+- Training scripts in `mortis/train.py`
+- Inference in `mortis/smolvla_executor.py`
+- Shared model configuration
+### 8. Why Not Use Gemini for Robot Control Directly?
+**Decision:** Use Gemini for intent detection, SmolVLA for action generation.
+**Rationale:**
+- LLMs are not designed for precise motor control
+- SmolVLA trained specifically on robot demonstrations
+- Gemini would require extensive prompting for each action
+- SmolVLA provides closed-loop visual feedback
+- Separation of concerns (language understanding vs. motor control)
+**Gemini's Role:**
+- Understand user intent
+- Detect manipulation commands
+- Generate conversational responses
+- Maintain character personality
+**SmolVLA's Role:**
+- Generate precise robot actions
+- Process visual observations
+- Execute manipulation tasks
+- Handle low-level control
+### 9. Why Store Checkpoints Locally vs. Cloud?
+**Decision:** Store model checkpoints locally with optional cloud backup.
+**Rationale:**
+- Faster loading (no network latency)
+- No cloud storage costs during development
+- Privacy (model stays on local machine)
+- Simpler deployment
+**Cloud Backup Strategy:**
+- Push final models to Hugging Face Hub
+- Version control with git-lfs
+- Disaster recovery
+### 10. Why 6 Specific Manipulation Tasks?
+**Decision:** Start with 6 predefined manipulation tasks (skull/eyeball × 3 cups).
+**Rationale:**
+- Manageable scope for initial implementation
+- Sufficient variety to demonstrate capability
+- Fits Halloween theme
+- Realistic data collection effort (30-60 demonstrations)
+- Can expand later with more tasks
+**Expansion Path:**
+- Add more objects (pumpkin, spider, etc.)
+- Add more target locations
+- Add multi-step tasks
+- Add task composition
+## Future Enhancements
+### Short-term (3-6 months)
+1. **Expanded Task Set**
+   - Add 10-20 more manipulation tasks
+   - Support task composition ("pick up skull, then eyeball")
+   - Add multi-object interactions
+2. **Improved Voice Interaction**
+   - Wake word detection ("Hey Mortis")
+   - Continuous conversation mode
+   - Voice activity detection
+   - Speaker identification
+3. **Enhanced Safety**
+   - Computer vision-based collision detection
+   - Force/torque sensing
+   - Workspace boundary enforcement
+   - Automatic emergency stop
+4. **Performance Optimization**
+   - Model quantization for faster inference
+   - Action caching for repeated tasks
+   - Parallel processing for multiple requests
+   - GPU memory optimization
+### Medium-term (6-12 months)
+1. **Advanced Learning**
+   - Online learning from corrections
+   - Few-shot task learning
+   - Transfer learning to new objects
+   - Self-supervised improvement
+2. **Multi-Robot Support**
+   - Control multiple SO101 arms
+   - Coordinated multi-arm tasks
+   - Load balancing across robots
+   - Distributed task execution
+3. **Enhanced Perception**
+   - 3D object detection
+   - Depth estimation
+   - Object tracking
+   - Scene understanding
+4. **User Personalization**
+   - User profiles and preferences
+   - Adaptive difficulty
+   - Custom task definitions
+   - Voice profile learning
+### Long-term (12+ months)
+1. **Autonomous Task Planning**
+   - High-level goal specification
+   - Automatic task decomposition
+   - Multi-step planning
+   - Failure recovery strategies
+2. **Natural Language Programming**
+   - Teach new tasks through conversation
+   - Automatic demonstration collection
+   - Interactive refinement
+   - Task library management
+3. **Advanced Interaction**
+   - Gesture recognition (human gestures)
+   - Facial expression detection
+   - Emotion-aware responses
+   - Proactive assistance
+4. **Production Deployment**
+   - Multi-user support
+   - Cloud-based inference
+   - Mobile app interface
+   - API for third-party integration
+## Conclusion
+This design provides a comprehensive architecture for refactoring Mortis into a multi-modal, SmolVLA-powered robotic system. The design emphasizes:
+- **Modularity:** Clear separation between components (Gemini, STT/TTS, SmolVLA, robot control)
+- **Scalability:** Asynchronous execution and queue-based architecture
+- **Reliability:** Comprehensive error handling and recovery strategies
+- **Maintainability:** Well-defined interfaces and data models
+- **Extensibility:** Clear paths for future enhancements
+The phased migration strategy allows for incremental development and validation, reducing risk and enabling early feedback. The hybrid approach of combining predefined gestures with learned manipulation behaviors provides both reliability and flexibility.
+Key technical decisions prioritize:
+- Google ecosystem integration (Gemini, Cloud STT/TTS)
+- Local deployment with GPU support
+- LeRobot framework for robotics
+- Gradio for rapid UI development
+- Python-native solutions (asyncio, threading)
+The design is ready for implementation following the task list in the next phase of the spec workflow.

.kiro/specs/gemini-multimodal-refactor/requirements.md ADDED Viewed

	@@ -0,0 +1,153 @@

+# Requirements Document
+## Introduction
+This document specifies the requirements for refactoring the Mortis interactive AI Halloween experience to use Google Gemini API with multi-modal (voice and text) interaction capabilities. The refactor replaces the existing LLM API integration and adds SmolVLA-based robotic control for specific manipulation tasks. The system must maintain the character-driven conversational experience while enabling precise robotic manipulation through voice or text commands.
+## Glossary
+- **Mortis System**: The complete interactive AI Halloween experience including web UI, conversational AI, and robotic arm control
+- **Gemini API**: Google's large language model API service used for conversational AI and intent detection
+- **SmolVLA Model**: A vision-language-action model trained using LeRobot for specific robotic manipulation tasks
+- **Gradio Interface**: The web-based user interface framework for the Mortis System
+- **SO101 Arm**: The SeeedStudio SO101 robotic arm hardware controlled by the Mortis System
+- **STT Service**: Speech-to-Text service that converts audio input to text
+- **TTS Service**: Text-to-Speech service that converts text responses to audio output
+- **Task String**: A specific command format recognized by the SmolVLA Model (e.g., "Pick up the skull and place it in the green cup")
+- **LeRobot Framework**: The robotics framework used for dataset management, model training, and inference
+- **Message Queue**: An asynchronous communication mechanism for decoupling robotic execution from the web interface
+- **Cloud-Agnostic Architecture**: A system design that does not depend on vendor-specific cloud platform services (like AWS Lambda, Azure Functions, or GCP Cloud Run), allowing deployment on any infrastructure including local hardware
+## Requirements
+### Requirement 1: Gemini API Integration
+**User Story:** As a developer, I want to replace the existing LLM API with Google Gemini API, so that the system uses Google's language model for all conversational interactions.
+#### Acceptance Criteria
+1. THE Mortis System SHALL use the Google Gemini API for all language model interactions
+2. THE Mortis System SHALL support multiple Gemini model variants through configuration
+3. THE Mortis System SHALL authenticate with the Gemini API using API keys stored in environment variables
+4. THE Mortis System SHALL handle Gemini API errors gracefully and provide user feedback when API calls fail
+5. THE Mortis System SHALL maintain response times under 5 seconds for typical conversational interactions
+### Requirement 2: Multi-Modal Voice Input
+**User Story:** As a user, I want to speak to Mortis through my microphone, so that I can interact naturally without typing.
+#### Acceptance Criteria
+1. THE Gradio Interface SHALL provide an audio input component for capturing user voice
+2. WHEN a user provides voice input, THE Mortis System SHALL convert the audio to text using a Speech-to-Text service
+3. THE Mortis System SHALL support both cloud-based STT services and local STT models as configurable options
+4. THE Mortis System SHALL process voice input with latency under 3 seconds for utterances under 10 seconds
+5. THE Mortis System SHALL display the transcribed text to the user for confirmation
+### Requirement 3: Intent Detection and Command Routing
+**User Story:** As a system, I want to detect when user input matches a specific robotic task command, so that I can route the request to the appropriate control mechanism.
+#### Acceptance Criteria
+1. THE Gemini API SHALL receive a system prompt that defines all valid SmolVLA Task Strings
+2. WHEN the Gemini API processes user input, THE Mortis System SHALL determine if the input matches a valid Task String
+3. IF the user input matches a valid Task String, THEN THE Mortis System SHALL extract the exact command string for robotic execution
+4. IF the user input does not match a valid Task String, THEN THE Mortis System SHALL generate a standard conversational response with gesture control
+5. THE Mortis System SHALL return both a conversational response and a command indicator in a structured format
+### Requirement 4: Dataset Creation and Collection
+**User Story:** As a developer, I want to create and collect demonstration data for robotic manipulation tasks, so that I have training data for the SmolVLA model.
+#### Acceptance Criteria
+1. THE Mortis System SHALL provide a data collection script for recording SO101 Arm demonstrations
+2. THE Mortis System SHALL capture synchronized camera observations and robot actions during demonstrations
+3. THE Mortis System SHALL save collected demonstrations in LeRobot-compatible format
+4. THE Mortis System SHALL support labeling demonstrations with corresponding Task String commands
+5. THE Mortis System SHALL validate collected data for completeness before adding to the training dataset
+### Requirement 5: SmolVLA Model Training Infrastructure
+**User Story:** As a developer, I want to train a SmolVLA model using LeRobot with collected demonstration data, so that the robot can perform precise manipulation tasks.
+#### Acceptance Criteria
+1. THE Mortis System SHALL provide a training script that loads datasets from local LeRobot databases or Hugging Face
+2. THE Mortis System SHALL create and manage LeRobot dataset databases for training data
+3. THE Mortis System SHALL configure SmolVLA training using lerobot-train with appropriate hyperparameters
+4. THE Mortis System SHALL save trained model checkpoints to a configurable directory
+5. THE Mortis System SHALL log training metrics including loss, accuracy, and validation performance
+### Requirement 6: SmolVLA Inference Execution
+**User Story:** As a system, I want to execute SmolVLA model inference when a valid task command is detected, so that the robot performs the requested manipulation.
+#### Acceptance Criteria
+1. THE Mortis System SHALL load the trained SmolVLA Model from saved checkpoints
+2. WHEN a valid Task String is received, THE Mortis System SHALL execute SmolVLA inference with the command as input
+3. THE Mortis System SHALL control the SO101 Arm through the SmolVLA Model output actions
+4. THE Mortis System SHALL provide visual feedback during robotic execution through the webcam view
+5. THE Mortis System SHALL handle inference errors and return the robot to a safe idle state
+### Requirement 7: Asynchronous Robotic Execution
+**User Story:** As a user, I want the web interface to remain responsive while the robot executes tasks, so that I can monitor progress without the UI freezing.
+#### Acceptance Criteria
+1. THE Mortis System SHALL execute SmolVLA inference asynchronously without blocking the Gradio Interface
+2. THE Mortis System SHALL use a message queue or background processing mechanism to decouple inference from the web interface
+3. WHILE SmolVLA inference is executing, THE Gradio Interface SHALL display a status indicator showing task progress
+4. THE Mortis System SHALL allow users to view the robot's actions through the webcam during execution
+5. WHEN robotic execution completes, THE Mortis System SHALL update the interface with completion status
+### Requirement 8: Voice Output Integration
+**User Story:** As a user, I want to hear Mortis speak responses aloud, so that I can experience a fully voice-based interaction.
+#### Acceptance Criteria
+1. THE Mortis System SHALL convert Gemini API text responses to audio using a Text-to-Speech service
+2. THE Mortis System SHALL support Google TTS or equivalent widely-available TTS services
+3. THE Gradio Interface SHALL play generated audio responses automatically after receiving them
+4. THE Mortis System SHALL generate audio in a format compatible with web browsers (MP3 or WAV)
+5. THE Mortis System SHALL maintain character voice consistency across all audio responses
+### Requirement 9: Architecture and Deployment
+**User Story:** As a developer, I want a system that can run on local hardware without vendor-specific cloud dependencies, so that I can deploy it flexibly while using Google APIs for LLM services.
+#### Acceptance Criteria
+1. THE Mortis System SHALL not depend on vendor-specific cloud platform services such as AWS Lambda, Azure Functions, or GCP Cloud Run
+2. THE Mortis System SHALL support deployment on local hardware with GPU access for SmolVLA inference
+3. THE Mortis System SHALL use standard Python libraries and open-source frameworks for all non-Google API components
+4. THE Mortis System SHALL document all external service dependencies in the environment configuration
+5. THE Mortis System SHALL provide configuration options for switching between cloud-based and local STT and TTS processing
+### Requirement 10: Backward Compatibility and Migration
+**User Story:** As a developer, I want to migrate from the existing LLM API to Gemini without losing existing functionality, so that users experience a seamless transition.
+#### Acceptance Criteria
+1. THE Mortis System SHALL maintain all existing gesture capabilities during the refactor
+2. THE Mortis System SHALL preserve the Halloween character theme and response style
+3. THE Mortis System SHALL continue to support text-only interaction for users without microphones
+4. THE Mortis System SHALL maintain the existing Gradio Interface layout and visual design
+5. THE Mortis System SHALL provide a migration guide documenting configuration changes
+### Requirement 11: Error Handling and Robustness
+**User Story:** As a user, I want the system to handle errors gracefully, so that temporary failures don't break my interaction experience.
+#### Acceptance Criteria
+1. IF the Gemini API is unavailable, THEN THE Mortis System SHALL display an error message and allow retry
+2. IF STT conversion fails, THEN THE Mortis System SHALL prompt the user to try again or use text input
+3. IF SmolVLA inference fails, THEN THE Mortis System SHALL return the SO101 Arm to idle position safely
+4. IF TTS generation fails, THEN THE Mortis System SHALL display the text response without audio
+5. THE Mortis System SHALL log all errors with sufficient detail for debugging

.kiro/specs/gemini-multimodal-refactor/tasks.md ADDED Viewed

	@@ -0,0 +1,403 @@

+# Implementation Plan
+This implementation plan breaks down the Gemini multi-modal refactor into discrete, actionable coding tasks. Each task builds incrementally on previous work, following the 8-phase migration strategy outlined in the design document.
+## Important Note: Hybrid Async Execution System
+**Phase 7** uses a **hybrid approach** for asynchronous execution:
+1. **AsyncExecutor** (`src/mortis/async_executor.py`): Simple Python threading system for quick gesture tasks
+   - Use for: wave, point, idle, grab, drop gestures
+   - Advantages: Simple, fast (1-2s), low overhead
+   - Implementation: Task queue + worker thread + status queue
+2. **LeRobotAsyncClient** (`src/mortis/lerobot_async_client.py`): Wrapper over LeRobot's async inference system
+   - Use for: Complex manipulation tasks with SmolVLA
+   - Advantages: Optimized for continuous inference, handles action chunks, real-time control
+   - Implementation: PolicyServer + RobotClient + gRPC communication
+This hybrid approach provides the best of both worlds: simplicity for gestures and power for manipulation.
+## Phase 1: Gemini API Integration
+- [x] 1. Set up Gemini API client infrastructure
+  - Create `src/mortis/gemini_client.py` module
+  - Implement `GeminiClient` class with configuration management
+  - Add environment variable handling for `GEMINI_API_KEY`, `GEMINI_MODEL`, `GEMINI_TEMPERATURE`
+  - Implement basic `send_message()` method using `google.generativeai` SDK
+  - _Requirements: 1.1, 1.2, 1.3_
+- [x] 2. Implement structured response parsing
+  - Create `src/mortis/models.py` for data models
+  - Implement `GeminiResponse`, `ResponseType`, `Mood`, `Gesture` enums and dataclasses
+  - Add `from_json()` method for parsing Gemini JSON responses
+  - Implement response validation logic
+  - _Requirements: 1.1, 3.5_
+- [x] 3. Design and implement Gemini system prompt
+  - Create system prompt with Mortis character definition
+  - Add manipulation task definitions (6 tasks) to prompt
+  - Implement JSON response format specification in prompt
+  - Configure Gemini to use JSON mode (`response_mime_type: application/json`)
+  - _Requirements: 3.1, 3.2, 9.2_
+- [x] 4. Implement error handling and retry logic
+  - Add exponential backoff retry for rate limiting
+  - Handle `BlockedPromptException` with fallback responses
+  - Implement timeout handling for API calls
+  - Add error logging and user-friendly error messages
+  - _Requirements: 1.4, 11.1_
+- [x] 5. Replace existing LLM API in tools.py
+  - Refactor `ask_mortis()` function to use `GeminiClient`
+  - Update response parsing to use new data models
+  - Maintain backward compatibility with gesture execution
+  - Update environment configuration documentation
+  - _Requirements: 1.1, 9.1, 9.4_
+- [ ]* 5.1 Write integration tests for Gemini client
+  - Test successful API calls and response parsing
+  - Test retry logic with mocked rate limit errors
+  - Test fallback responses on API failures
+  - Verify character personality maintained in responses
+  - _Requirements: 1.1, 1.4_
+## Phase 2: Voice Input and Output
+- [x] 6. Implement Speech-to-Text service
+  - Create `src/mortis/stt_service.py` module
+  - Implement `STTService` class with Gemini native audio support
+  - Add fallback to Google Cloud Speech-to-Text API
+  - Implement audio file format validation and conversion
+  - Add configuration for STT service selection (Gemini vs Google STT)
+  - _Requirements: 2.1, 2.2, 2.3_
+- [x] 7. Implement Text-to-Speech service
+  - Create `src/mortis/tts_service.py` module
+  - Implement `TTSService` class using Google Cloud TTS
+  - Configure voice parameters (pitch, speed) for Mortis character
+  - Implement audio file generation (MP3 format)
+  - Add local TTS fallback (gTTS) for offline scenarios
+  - _Requirements: 8.1, 8.2, 8.4, 8.5_
+- [x] 8. Update Gradio UI for audio input
+  - Add `gr.Audio` component for microphone input to `app.py`
+  - Implement audio input handler function
+  - Connect audio input to STT service
+  - Display transcribed text to user for confirmation
+  - Handle audio processing errors gracefully
+  - _Requirements: 2.1, 2.5, 11.2_
+- [x] 9. Update Gradio UI for audio output
+  - Add `gr.Audio` component for audio playback
+  - Implement audio response generation in `mortis_reply()`
+  - Configure autoplay for audio responses
+  - Create `outputs/` directory for temporary audio files
+  - Implement audio file cleanup mechanism
+  - _Requirements: 8.3, 8.4_
+- [x] 10. Integrate voice flow with Gemini
+  - Update `ask_mortis()` to accept audio input
+  - Implement voice-to-text-to-Gemini-to-TTS pipeline
+  - Maintain text input compatibility
+  - Add latency monitoring for voice processing
+  - _Requirements: 2.4, 9.3_
+- [ ]* 10.1 Write tests for audio processing
+  - Test STT with sample audio files
+  - Test TTS output quality and format
+  - Test audio input/output in Gradio UI
+  - Verify fallback mechanisms work correctly
+  - _Requirements: 2.2, 8.2, 11.2, 11.4_
+## Phase 3: Dataset Collection Infrastructure
+- [x] 11. Set up LeRobot dataset infrastructure
+  - Create `src/mortis/data_collector.py` module
+  - Implement `DataCollector` class with LeRobot dataset integration
+  - Configure dataset directory structure (`data/mortis_manipulation/`)
+  - Implement dataset metadata management (task descriptions, episode counts)
+  - _Requirements: 4.3, 5.2_
+- [ ]* 12. Implement camera integration for data collection
+  - Add camera initialization in `DataCollector`
+  - Implement synchronized image capture with robot state
+  - Configure camera parameters (resolution, FPS)
+  - Add camera calibration utilities
+  - _Requirements: 4.2_
+- [ ]* 13. Implement episode recording functionality
+  - Create `record_episode()` method for capturing demonstrations
+  - Implement real-time data capture loop (30 FPS)
+  - Add keyboard controls for start/stop recording
+  - Implement episode data validation
+  - Save episodes in LeRobot-compatible format
+  - _Requirements: 4.1, 4.2, 4.5_
+- [ ]* 14. Implement task labeling system
+  - Add task description input for each episode
+  - Create task label validation against predefined task set
+  - Implement episode metadata storage
+  - Add episode review and re-recording capability
+  - _Requirements: 4.4_
+- [ ]* 15. Create data collection CLI script
+  - Create `src/mortis/collect_data.py` entry point
+  - Implement interactive data collection workflow
+  - Add progress tracking (episodes per task)
+  - Implement dataset statistics display
+  - Add Hugging Face Hub upload functionality
+  - _Requirements: 4.1, 4.3, 5.1_
+- [ ]* 15.1 Write data validation tests
+  - Test episode data format compliance
+  - Verify synchronized timestamps
+  - Check image quality and dimensions
+  - Validate action sequences
+  - _Requirements: 4.5_
+## Phase 4: SmolVLA Training Pipeline
+- [ ]* 16. Create training configuration
+  - Create `config/train_smolvla.yaml` with Hydra configuration
+  - Configure SmolVLA policy parameters (vision backbone, chunk size)
+  - Set training hyperparameters (batch size, learning rate, steps)
+  - Configure evaluation settings
+  - Add Weights & Biases integration configuration
+  - _Requirements: 5.3, 5.5_
+- [x] 17. Implement training script
+  - Create `src/mortis/train.py` module
+  - Implement dataset loading from local or Hugging Face
+  - Configure LeRobot training pipeline
+  - Add checkpoint saving logic
+  - Implement training progress logging
+  - _Requirements: 5.1, 5.2, 5.4, 5.5_
+- [ ]* 18. Set up training monitoring
+  - Integrate Weights & Biases for metric tracking
+  - Log training loss, validation loss, learning rate
+  - Add sample prediction visualization
+  - Implement early stopping based on validation performance
+  - _Requirements: 5.5_
+- [ ]* 19. Create training execution commands
+  - Add `train-smolvla` target to Makefile
+  - Document training command with all parameters
+  - Add GPU memory optimization flags
+  - Create training resume functionality for interrupted runs
+  - _Requirements: 5.3, 5.4_
+- [ ]* 19.1 Write training validation tests
+  - Test dataset loading and batching
+  - Verify model architecture initialization
+  - Test checkpoint saving and loading
+  - Validate training loop executes without errors
+  - _Requirements: 5.2, 5.4_
+## Phase 5: SmolVLA Inference Integration
+- [x] 20. Implement SmolVLA executor
+  - Create `src/mortis/smolvla_executor.py` module
+  - Implement `SmolVLAExecutor` class with model loading
+  - Add checkpoint loading from configurable path
+  - Implement GPU device management
+  - Add model initialization and warmup
+  - _Requirements: 6.1, 8.2_
+- [x] 21. Implement observation capture
+  - Add camera integration for visual observations
+  - Implement robot state capture from SO101
+  - Create observation dictionary formatting for SmolVLA
+  - Add tensor conversion and device placement
+  - _Requirements: 6.2, 6.4_
+- [x] 22. Implement action execution loop
+  - Create `execute()` method for task execution
+  - Implement inference loop with visual feedback
+  - Add action tensor to SO101 command conversion
+  - Implement step-by-step action execution
+  - Add task completion detection logic
+  - _Requirements: 6.2, 6.3_
+- [x] 23. Implement safety and error handling
+  - Add command validation against trained task set
+  - Implement workspace safety checks
+  - Add emergency stop functionality
+  - Implement timeout handling for long-running tasks
+  - Add GPU out-of-memory recovery
+  - _Requirements: 6.5, 11.3_
+- [ ]* 23.1 Write SmolVLA inference tests
+  - Test model loading from checkpoint
+  - Test observation capture and formatting
+  - Test action prediction and execution
+  - Verify emergency stop functionality
+  - _Requirements: 6.1, 6.3, 6.5_
+## Phase 6: Intent Detection and Routing
+- [x] 24. Implement intent router
+  - Create `src/mortis/intent_router.py` module
+  - Implement `IntentRouter` class with task definitions
+  - Add `parse_gemini_response()` method for JSON parsing
+  - Implement command validation logic
+  - Create `Intent` dataclass for structured intent representation
+  - _Requirements: 3.2, 3.3, 3.4, 3.5_
+- [x] 25. Update Gemini prompt for intent detection
+  - Enhance system prompt with all 6 manipulation task definitions
+  - Add clear response format specification for manipulation vs conversation
+  - Implement intent type detection in prompt
+  - Add examples of manipulation and conversational inputs
+  - _Requirements: 3.1, 3.2_
+- [x] 26. Integrate intent routing in main flow
+  - Update `ask_mortis()` to use `IntentRouter`
+  - Implement routing logic for manipulation vs gesture execution
+  - Add command validation before SmolVLA execution
+  - Implement fallback to gestures for invalid commands
+  - _Requirements: 3.3, 3.4, 3.5_
+- [ ]* 26.1 Write intent detection tests
+  - Test parsing of manipulation responses
+  - Test parsing of conversational responses
+  - Test command validation logic
+  - Verify fallback behavior for invalid commands
+  - Test edge cases and malformed responses
+  - _Requirements: 3.2, 3.3, 3.4_
+## Phase 7: Asynchronous Execution System (Hybrid Approach)
+**Note**: This phase uses a hybrid execution system:
+- **AsyncExecutor**: Simple threading for quick gestures (wave, point, idle)
+- **LeRobotAsyncClient**: LeRobot async inference (PolicyServer + RobotClient) for complex manipulation tasks with SmolVLA
+- [x] 27. Implement async executor infrastructure for gestures
+  - Create `src/mortis/async_executor.py` module
+  - Implement `AsyncExecutor` class with task queue
+  - Add background worker thread for task processing
+  - Implement status queue for progress updates
+  - Add start/stop methods for executor lifecycle
+  - Create `Task` and `StatusUpdate` dataclasses
+  - Add comprehensive tests (15 tests, all passing)
+  - _Requirements: 7.1, 7.2_
+- [x] 28. Implement LeRobot async client for manipulation
+  - Create `src/mortis/lerobot_async_client.py` module
+  - Implement `LeRobotAsyncClient` wrapper class
+  - Integrate PolicyServer and RobotClient from LeRobot
+  - Add `ManipulationTask` and `ManipulationStatus` models
+  - Implement lifecycle management (start/stop)
+  - Add task execution with status tracking
+  - Create demo scripts and documentation
+  - _Requirements: 7.1, 7.2, 7.5_
+- [x] 29. Integrate hybrid execution in main application
+  - Initialize both AsyncExecutor and LeRobotAsyncClient in app.py
+  - Update `mortis_reply()` to route gestures to AsyncExecutor
+  - Update `mortis_reply()` to route manipulation to LeRobotAsyncClient
+  - Implement proper lifecycle management (start on app load, stop on unload)
+  - Handle errors from both systems
+  - _Requirements: 7.1, 7.2, 7.5_
+- [x] 30. Add hybrid status display to Gradio UI
+  - Add status textbox component to UI for robot status
+  - Implement `check_status()` function that monitors both systems
+  - Check AsyncExecutor for gesture status updates
+  - Check LeRobotAsyncClient for manipulation status
+  - Configure Gradio to poll status every 500ms
+  - Display appropriate icons and messages for each system
+  - Add visual indicators for different task states (idle, running, complete, failed)
+  - _Requirements: 7.3, 7.4, 7.5_
+- [x] 31. Test and validate hybrid execution system
+  - Test gesture execution via AsyncExecutor
+  - Test manipulation execution via LeRobotAsyncClient
+  - Verify both systems can run concurrently
+  - Test status updates from both systems
+  - Verify UI remains responsive during long manipulation tasks
+  - Test error handling in both systems
+  - Validate proper cleanup on app shutdown
+  - _Requirements: 7.1, 7.2, 7.3, 7.4, 7.5_
+- [ ]* 31.1 Write integration tests for hybrid system
+  - Test AsyncExecutor with mock gesture executor
+  - Test LeRobotAsyncClient with mock PolicyServer/RobotClient
+  - Test routing logic (gesture vs manipulation)
+  - Test concurrent execution of gestures and manipulation
+  - Verify status updates from both systems
+  - Test error recovery and fallback behavior
+  - _Requirements: 7.1, 7.2, 7.3, 7.5_
+## Phase 8: Integration, Testing, and Deployment
+- [ ]* 32. Update project dependencies
+  - Update `pyproject.toml` with new dependencies (google-generativeai, google-cloud-speech, google-cloud-texttospeech)
+  - Add optional dependencies for training (wandb, hydra-core)
+  - Update Makefile with new commands (collect-data, train-smolvla)
+  - Run `make install` to sync dependencies
+  - _Requirements: 8.4, 9.5_
+- [ ]* 33. Update environment configuration
+  - Create `.env.example` with all required variables
+  - Document Gemini API key setup
+  - Document Google Cloud credentials setup
+  - Add SmolVLA checkpoint path configuration
+  - Update README with new environment variables
+  - _Requirements: 1.3, 8.4_
+- [ ]* 34. Implement logging and monitoring
+  - Add structured logging throughout application
+  - Log Gemini API calls and response times
+  - Log SmolVLA inference times and success rates
+  - Add error logging with stack traces
+  - Create log rotation and cleanup
+  - _Requirements: 11.5_
+- [x] 35. Create comprehensive documentation
+  - Update README with new features and setup instructions
+  - Document data collection workflow
+  - Document training process
+  - Create user guide for voice interaction
+  - Add troubleshooting section
+  - _Requirements: 8.4, 9.5_
+- [ ]* 36. Perform end-to-end integration testing
+  - Test complete voice input → Gemini → SmolVLA → audio output flow
+  - Test text input → intent detection → gesture execution flow
+  - Test error handling and recovery across all components
+  - Verify UI responsiveness during long operations
+  - Test with all 6 manipulation tasks
+  - _Requirements: 9.1, 9.2, 9.3, 9.4_
+- [ ]* 36.1 Write system-level tests
+  - Test multi-modal interaction flows
+  - Test concurrent user requests
+  - Test resource usage (GPU memory, CPU)
+  - Benchmark performance metrics
+  - _Requirements: 1.5, 2.4, 7.3_
+- [ ]* 37. Optimize performance
+  - Profile Gemini API response times
+  - Optimize SmolVLA inference speed
+  - Reduce audio processing latency
+  - Implement caching where appropriate
+  - Optimize GPU memory usage
+  - _Requirements: 1.5, 2.4_
+- [ ]* 38. Final deployment preparation
+  - Create deployment checklist
+  - Set up monitoring and alerting
+  - Prepare rollback procedures
+  - Create backup of current system
+  - Document deployment process
+  - _Requirements: 8.1, 8.2, 8.3_

.kiro/steering/product.md ADDED Viewed

	@@ -0,0 +1,30 @@

+---
+inclusion: always
+---
+# Product Overview
+**Mortis** is an interactive AI Halloween experience that combines conversational AI with physical robotics. It's a Gradio web application where users chat with "Mortis," a mischievous Halloween spirit powered by LLMs.
+## Core Concept
+Mortis responds to user messages with:
+- Text responses (character-driven, in-character dialogue)
+- Emotional moods (ominous, playful, angry, etc.)
+- Physical gestures via a SeeedStudio SO101 robotic arm controlled through LeRobot
+## Key Features
+- Web UI with Halloween-themed background
+- Multi-model LLM support via API
+- Structured tool calling for coordinated text + gesture responses
+- Real-time robotic arm control synchronized with AI responses
+- Local webcam view (browser-only, no upload)
+## Character Guidelines
+When working with Mortis dialogue:
+- Keep responses ≤30 words, ≤120 characters
+- No emojis or markdown in character responses
+- Maintain Halloween/haunted theme
+- Responses should feel mischievous, spectral, or ominous

.kiro/steering/structure.md ADDED Viewed

	@@ -0,0 +1,85 @@

+---
+inclusion: always
+---
+# Project Structure
+## Directory Layout
+```
+mortis/
+├── src/mortis/          # Main application package
+│   ├── app.py          # Gradio UI and main entry point
+│   ├── tools.py        # LLM API integration and tool calling
+│   ├── robot.py        # Robot arm control and gesture definitions
+│   └── calibrate.py    # Robot calibration script
+├── examples/           # Example/demo scripts
+│   └── demo.py        # Simple demo runner
+├── assets/            # Static assets (images, backgrounds)
+│   └── image.png      # Halloween background image
+├── .cache/            # Runtime cache (calibration data)
+├── .env               # Environment variables (not committed)
+├── pyproject.toml     # Project metadata and dependencies
+├── uv.lock            # Locked dependency versions
+├── Makefile           # Build and run commands
+└── README.md          # User documentation
+```
+## Module Organization
+### `src/mortis/app.py`
+- Gradio UI construction
+- Chat interface setup
+- Model selection dropdown
+- CSS styling with base64-encoded background
+- Main entry point (`main()` function)
+### `src/mortis/tools.py`
+- LLM API client
+- Tool definition for structured outputs
+- `ask_mortis()` function: sends user message, receives structured response
+- Coordinates LLM response with robot gesture execution
+- Manages global `mortis_arm` instance
+### `src/mortis/robot.py`
+- `MortisArm` class: robot connection and control
+- `GESTURES` dictionary: predefined gesture sequences
+- Each gesture is a list of (pose_dict, delay) tuples
+- Available gestures: idle, wave, point_left, point_right, grab, drop
+- Pose dictionaries specify joint positions in degrees
+### `src/mortis/calibrate.py`
+- Standalone calibration script
+- Configures SO101Follower with calibration directory
+- Interactive calibration process
+## Code Conventions
+### Import Style
+- Standard library imports first
+- Third-party imports second
+- Local imports last
+- Use `from .module import` for intra-package imports
+### Path Handling
+- Use `pathlib.Path` for all file paths
+- `REPO_ROOT` defined as `Path(__file__).resolve().parents[2]`
+- Relative paths from repo root for assets and config
+### Robot Control Pattern
+- Always check `mortis_arm.connected` before operations
+- Connect once, reuse connection
+- Disconnect on app unload (Gradio `demo.unload()`)
+- Gestures execute synchronously with blocking delays
+### API Response Handling
+- Structured tool calling enforced via `tool_choice`
+- Parse `tool_calls[0].function.arguments` as JSON
+- Extract: message (str), mood (enum), gesture (enum)
+- Execute gesture immediately after parsing response
+## Entry Points
+Defined in `pyproject.toml`:
+- `mortis` → `mortis.app:main` (run the Gradio app)
+- `calibrate` → `mortis.calibrate:main` (calibrate robot)

.kiro/steering/tech.md ADDED Viewed

	@@ -0,0 +1,69 @@

+---
+inclusion: always
+---
+# Tech Stack
+## Core Technologies
+- **Python**: 3.12+ (required)
+- **Package Manager**: `uv` (modern Python dependency manager)
+- **Web Framework**: Gradio 5.49.1+
+- **Robotics**: LeRobot 0.4.0+ with Feetech servo support
+- **API Client**: requests library for LLM API
+- **Environment**: python-dotenv for configuration
+## Build System
+The project uses a **Makefile** for all common operations. Always prefer `make` commands over direct CLI invocations.
+### Common Commands
+```bash
+# Setup and dependencies
+make install          # Install/sync dependencies
+make sync            # Alias for install
+make upgrade         # Upgrade all dependencies
+# Running the application
+make run             # Run via CLI entrypoint (mortis)
+make run-m           # Run as Python module
+make demo            # Run example script
+# Robot operations
+make calibrate       # Calibrate the SO101 arm (required first-time setup)
+make test-gesture    # Test individual gestures
+# Development
+make check-env       # Verify .env configuration
+make add-<package>   # Add new dependency (e.g., make add-numpy)
+make export          # Export requirements.txt from uv.lock
+make clean           # Remove build artifacts
+```
+## Environment Configuration
+Required `.env` file in project root:
+```
+API_KEY=your_api_key
+API_BASE_URL=https://api.example.com/v1/chat/completions
+ROBOT_PORT=/dev/ttyACM1  # Optional, defaults to /dev/ttyACM1
+PORT=7860                # Optional, defaults to 7860
+```
+## API Integration
+- Uses LLM chat completions API
+- Supports multiple models
+- Implements structured tool calling for coordinated responses
+- Tool: `perform_mortis_act` returns {message, mood, gesture}
+## Robot Hardware
+- **Device**: SeeedStudio SO101 robotic arm
+- **Connection**: USB serial (typically /dev/ttyACM1)
+- **Calibration**: Stored in `.cache/calibration/so101/`
+- **Control**: LeRobot framework with SO101Follower driver
+- **Modes**:
+  - `physical` - Connects to real robot hardware (default)
+  - `simulation` - Simulates robot without hardware (for development/testing)

app.py ADDED Viewed

	@@ -0,0 +1,23 @@

+#!/usr/bin/env python3
+import os
+import sys
+# Para que Python vea src/mortis
+CURRENT_DIR = os.path.dirname(os.path.abspath(__file__))
+SRC_DIR = os.path.join(CURRENT_DIR, "src")
+if SRC_DIR not in sys.path:
+    sys.path.append(SRC_DIR)
+from mortis.app import ui  # o tu función que crea el chatbot
+# ⚙️ Hugging Face pasa el puerto en la variable PORT
+port = int(os.getenv("PORT", "7860"))
+demo = ui()  # aquí dentro montas tu Chatbot/ChatInterface
+if __name__ == "__main__":
+    demo.launch(
+        server_name="0.0.0.0",      # ¡IMPORTANTE en Docker!
+        server_port=port,
+        show_error=True,
+    )

requirements.txt CHANGED Viewed

@@ -1,5 +1,5 @@
 google-genai>=1.53.0
-google-cloud-texttospeech>=2.16.0"
 gradio==5.49.1
 gtts>=2.5.0
 lerobot[async,feetech,intelrealsense,smolvla]>=0.4.0

 google-genai>=1.53.0
+google-cloud-texttospeech>=2.16.0
 gradio==5.49.1
 gtts>=2.5.0
 lerobot[async,feetech,intelrealsense,smolvla]>=0.4.0

src/mortis/__init__.py ADDED Viewed

File without changes

src/mortis/app.py ADDED Viewed

	@@ -0,0 +1,815 @@

+import base64
+import json
+import os
+import logging
+import time
+from pathlib import Path
+import gradio as gr
+from .tools import ask_mortis, mortis_arm
+from .stt_service import STTService, AudioProcessingError
+from .tts_service import get_tts_service
+from .async_executor import AsyncExecutor, Task, TaskType, TaskStatus
+from .lerobot_async_client import LeRobotAsyncClient, ManipulationStatus
+from .intent_router import IntentRouter, Intent
+from .models import ResponseType
+REPO_ROOT = Path(__file__).resolve().parents[2]
+BG_IMAGE = REPO_ROOT / "assets" / "kiroween.png"
+MODEL_CHOICES = [
+    "gemini-2.5-flash",
+    "gemini-2.0-flash-exp",
+    "gemini-1.5-pro",
+    "gemini-1.5-flash",
+]
+# Initialize STT service (global instance)
+stt_service = None
+# Initialize async execution systems (global instances)
+async_executor = None
+lerobot_client = None
+intent_router = None
+def get_stt_service():
+    """Lazy initialization of STT service."""
+    global stt_service
+    if stt_service is None:
+        try:
+            stt_service = STTService()
+            logging.getLogger(__name__).info("✅ STT service initialized")
+        except Exception as e:
+            logging.getLogger(__name__).error(f"❌ Failed to initialize STT service: {e}")
+            raise
+    return stt_service
+# Initialize TTS service (global instance)
+tts_service = None
+def get_tts_service_instance():
+    """Lazy initialization of TTS service."""
+    global tts_service
+    if tts_service is None:
+        try:
+            tts_service = get_tts_service()
+            logging.getLogger(__name__).info("✅ TTS service initialized")
+        except Exception as e:
+            logging.getLogger(__name__).error(f"❌ Failed to initialize TTS service: {e}")
+            raise
+    return tts_service
+def execute_async_task(task: Task):
+    """
+    Execute a task asynchronously (called by AsyncExecutor worker thread).
+    This function is called by the AsyncExecutor's worker thread to execute
+    tasks. It handles both gesture and manipulation tasks.
+    Args:
+        task: Task to execute
+    """
+    logger = logging.getLogger(__name__)
+    try:
+        if task.type == TaskType.GESTURE:
+            # Execute gesture using mortis_arm
+            gesture = task.gesture
+            logger.info(f"Executing gesture: {gesture}")
+            if mortis_arm.connected:
+                mortis_arm.move_arm(gesture)
+            else:
+                logger.warning("Robot arm not connected, skipping gesture")
+        elif task.type == TaskType.MANIPULATION:
+            # This shouldn't happen - manipulation goes through LeRobotAsyncClient
+            logger.warning(f"Manipulation task in AsyncExecutor: {task.command}")
+            logger.warning("Manipulation tasks should use LeRobotAsyncClient")
+        else:
+            logger.error(f"Unknown task type: {task.type}")
+    except Exception as e:
+        logger.error(f"Error executing task {task.id}: {e}", exc_info=True)
+        raise
+def get_async_executor():
+    """Lazy initialization of AsyncExecutor."""
+    global async_executor
+    if async_executor is None:
+        try:
+            # Create executor with gesture execution function
+            async_executor = AsyncExecutor(task_executor=execute_async_task)
+            logging.getLogger(__name__).info("✅ AsyncExecutor initialized")
+        except Exception as e:
+            logging.getLogger(__name__).error(f"❌ Failed to initialize AsyncExecutor: {e}")
+            raise
+    return async_executor
+def get_lerobot_client():
+    """Lazy initialization of LeRobotAsyncClient."""
+    global lerobot_client
+    # Use a sentinel value to indicate we've already checked and manipulation is disabled
+    if lerobot_client is None:
+        # Check if we're in simulation mode
+        robot_mode = os.getenv("ROBOT_MODE", "physical").lower()
+        if robot_mode == "simulation":
+            # Set to False to indicate manipulation is not available in simulation
+            lerobot_client = False
+            logging.getLogger(__name__).info("ℹ️ Manipulation disabled in simulation mode")
+            return None
+        # Check if manipulation is enabled
+        enable_manipulation = os.getenv("ENABLE_MANIPULATION", "false").lower() == "true"
+        if not enable_manipulation:
+            # Set to False (not None) to indicate we've checked and it's disabled
+            # This prevents logging the message repeatedly
+            lerobot_client = False
+            logging.getLogger(__name__).info("ℹ️ Manipulation disabled (ENABLE_MANIPULATION=false)")
+            return None
+        try:
+            robot_port = os.getenv("ROBOT_PORT", "/dev/ttyACM1")
+            model_path = os.getenv("SMOLVLA_MODEL_PATH", "jlamperez/kiroween-potion-smolvla")
+            lerobot_client = LeRobotAsyncClient(
+                robot_port=robot_port,
+                model_path=model_path
+            )
+            # Configure idle callback to move robot to safe position on timeout
+            lerobot_client.set_idle_callback(lambda: mortis_arm.move_arm("idle") if mortis_arm.connected else None)
+            logging.getLogger(__name__).info("✅ LeRobotAsyncClient initialized")
+        except Exception as e:
+            logging.getLogger(__name__).error(f"❌ Failed to initialize LeRobotAsyncClient: {e}")
+            # Don't raise - manipulation is optional
+            return None
+    # Return None if manipulation is disabled (lerobot_client == False)
+    return lerobot_client if lerobot_client is not False else None
+def get_intent_router_instance():
+    """Lazy initialization of IntentRouter."""
+    global intent_router
+    if intent_router is None:
+        try:
+            intent_router = IntentRouter()
+            logging.getLogger(__name__).info("✅ IntentRouter initialized")
+        except Exception as e:
+            logging.getLogger(__name__).error(f"❌ Failed to initialize IntentRouter: {e}")
+            raise
+    return intent_router
+def build_css(image_path: str) -> str:
+    """Background with custom image."""
+    with open(image_path, "rb") as f:
+        b64 = base64.b64encode(f.read()).decode()
+    return f"""
+    .gradio-container {{
+    background-image: url("data:image/png;base64,{b64}");
+    background-size: cover;
+    background-position: center;
+    background-repeat: no-repeat;
+    background-attachment: fixed;
+    }}
+    footer::after{{
+    content: "by: Jorge Lamperez 🤖";
+    margin-left: 8px;
+    opacity: .85;
+    }}
+    """
+def process_audio_input(audio_path):
+    """
+    Process audio input from microphone and return transcribed text.
+    Args:
+        audio_path: Path to recorded audio file from Gradio
+    Returns:
+        Transcribed text or error message
+    """
+    logger = logging.getLogger(__name__)
+    if audio_path is None:
+        return ""
+    try:
+        logger.info(f"🎤 Processing audio input: {audio_path}")
+        # Get STT service
+        stt = get_stt_service()
+        # Transcribe audio
+        transcript = stt.transcribe(audio_path)
+        if not transcript:
+            logger.warning("⚠️ Audio transcription returned empty result")
+            return ""
+        logger.info(f"✅ Transcription successful: '{transcript[:50]}...'")
+        return transcript
+    except FileNotFoundError as e:
+        error_msg = f"Audio file not found: {e}"
+        logger.error(f"❌ {error_msg}")
+        return f"[Error: {error_msg}]"
+    except AudioProcessingError as e:
+        error_msg = f"Audio processing failed: {e}"
+        logger.error(f"❌ {error_msg}")
+        return f"[Error: {error_msg}]"
+    except Exception as e:
+        error_msg = f"Unexpected error during transcription: {type(e).__name__}: {e}"
+        logger.error(f"❌ {error_msg}")
+        return f"[Error: {error_msg}]"
+def mortis_reply(message, history, model_name):
+    logger = logging.getLogger(__name__)
+    logger.info(f"💬 User message: {message[:50]}{'...' if len(message) > 50 else ''}")
+    logger.info(f"🤖 Using model: {model_name}")
+    msg, mood, gesture = ask_mortis(message, model_name=model_name)
+    logger.info(f"👻 Mortis reply: {msg[:50]}{'...' if len(msg) > 50 else ''}")
+    logger.info(f"😈 Mood: {mood}, Gesture: {gesture}")
+    return msg
+def mortis_reply_with_audio(message, history, model_name, audio_input_path=None):
+    """
+    Generate Mortis reply with both text and audio output using hybrid execution.
+    This function integrates the hybrid async execution system:
+    - Gestures are routed to AsyncExecutor (simple threading)
+    - Manipulation tasks are routed to LeRobotAsyncClient (LeRobot async inference)
+    Supports both text and voice input through the unified voice pipeline.
+    Args:
+        message: User message text (optional if audio_input_path provided)
+        history: Chat history
+        model_name: Gemini model to use
+        audio_input_path: Optional path to audio input file
+    Returns:
+        Tuple of (text_response, audio_path)
+    """
+    logger = logging.getLogger(__name__)
+    # Import necessary components
+    from .gemini_client import GeminiClient
+    # Log input type
+    if audio_input_path:
+        logger.info(f"🎤 Voice input: {audio_input_path}")
+        # Transcribe audio to text
+        try:
+            stt = get_stt_service()
+            message = stt.transcribe(audio_input_path)
+            logger.info(f"📝 Transcribed: '{message[:50]}...'")
+            if not message or not message.strip():
+                logger.warning("⚠️ STT returned empty transcription")
+                return "I couldn't hear you... speak again.", None
+        except Exception as e:
+            logger.error(f"❌ Voice input processing failed: {e}")
+            return "The spirits couldn't understand... try again.", None
+    else:
+        logger.info(f"💬 Text input: {message[:50]}{'...' if len(message) > 50 else ''}")
+    logger.info(f"🤖 Using model: {model_name}")
+    try:
+        # Get Gemini client and send message
+        gemini_client = GeminiClient()
+        if model_name:
+            gemini_client.configure_model(model_name=model_name)
+        response_json = gemini_client.send_message(message)
+        # Parse response using IntentRouter
+        router = get_intent_router_instance()
+        intent = router.parse_gemini_response(response_json)
+        # Extract response components
+        msg = intent.message
+        mood = intent.mood
+        logger.info(f"👻 Mortis reply: {msg[:50]}{'...' if len(msg) > 50 else ''}")
+        logger.info(f"😈 Mood: {mood}")
+        # Route execution based on intent type
+        execution_path = router.route_intent(intent)
+        if execution_path == "manipulation" and intent.is_valid:
+            # Route to LeRobotAsyncClient for manipulation
+            logger.info(f"🤖 Routing manipulation to LeRobotAsyncClient: {intent.command}")
+            client = get_lerobot_client()
+            if client and client.is_running():
+                try:
+                    # Get timeout from environment or use default (60s)
+                    timeout = float(os.getenv("MANIPULATION_TIMEOUT", "60.0"))
+                    # Submit manipulation task asynchronously with timeout
+                    client.execute_task(
+                        intent.command,
+                        blocking=False,
+                        timeout=timeout
+                    )
+                    logger.info(f"✅ Manipulation task submitted: {intent.command} (timeout: {timeout}s)")
+                except Exception as e:
+                    logger.error(f"❌ Failed to submit manipulation task: {e}")
+                    logger.info("Falling back to gesture execution")
+                    # Fallback to gesture
+                    executor = get_async_executor()
+                    if executor.running:
+                        task = Task.create_gesture_task("idle")
+                        executor.submit_task(task)
+            else:
+                logger.warning("LeRobotAsyncClient not available, falling back to gesture")
+                # Fallback to gesture
+                executor = get_async_executor()
+                if executor.running:
+                    task = Task.create_gesture_task("idle")
+                    executor.submit_task(task)
+        elif execution_path == "gesture":
+            # Route to AsyncExecutor for gesture
+            gesture = intent.gesture if intent.gesture else "idle"
+            logger.info(f"👋 Routing gesture to AsyncExecutor: {gesture}")
+            executor = get_async_executor()
+            if executor.running:
+                try:
+                    # Submit gesture task asynchronously
+                    task = Task.create_gesture_task(gesture)
+                    executor.submit_task(task)
+                    logger.info(f"✅ Gesture task submitted: {gesture}")
+                except Exception as e:
+                    logger.error(f"❌ Failed to submit gesture task: {e}")
+            else:
+                logger.warning("AsyncExecutor not running, executing gesture synchronously")
+                if mortis_arm.connected:
+                    mortis_arm.move_arm(gesture)
+        else:
+            # Invalid intent - fallback to idle gesture
+            logger.warning(f"⚠️ Invalid intent, falling back to idle gesture")
+            executor = get_async_executor()
+            if executor.running:
+                task = Task.create_gesture_task("idle")
+                executor.submit_task(task)
+            elif mortis_arm.connected:
+                mortis_arm.move_arm("idle")
+        # Generate audio response
+        audio_path = None
+        try:
+            tts = get_tts_service_instance()
+            audio_path = tts.synthesize(msg)
+            if audio_path:
+                logger.info(f"🔊 Audio output: {audio_path}")
+        except Exception as e:
+            logger.error(f"❌ TTS generation failed: {e}")
+            # Continue without audio
+        return msg, audio_path
+    except Exception as e:
+        logger.error(f"❌ Error in mortis_reply_with_audio: {e}", exc_info=True)
+        return "The spirits are confused... try again.", None
+def start_async_systems():
+    """
+    Start the async execution systems on app load.
+    This function initializes and starts:
+    1. Robot arm connection
+    2. AsyncExecutor for gesture execution
+    3. LeRobotAsyncClient for manipulation tasks (if enabled)
+    """
+    logger = logging.getLogger(__name__)
+    logger.info("🚀 Starting async execution systems...")
+    # Connect to robot arm
+    try:
+        if not mortis_arm.connected:
+            mortis_arm.connect()
+            if mortis_arm.mode == "simulation":
+                logger.info("🎭 Robot arm in SIMULATION mode")
+            else:
+                logger.info("✅ Robot arm connected")
+        else:
+            logger.info("ℹ️ Robot arm already connected")
+    except Exception as e:
+        logger.error(f"❌ Failed to connect robot arm: {e}", exc_info=True)
+        logger.info("ℹ️ Gestures will be skipped until robot is connected")
+    # Start AsyncExecutor
+    try:
+        executor = get_async_executor()
+        if not executor.running:
+            executor.start()
+            logger.info("✅ AsyncExecutor started")
+        else:
+            logger.info("ℹ️ AsyncExecutor already running")
+    except Exception as e:
+        logger.error(f"❌ Failed to start AsyncExecutor: {e}", exc_info=True)
+    # Start LeRobotAsyncClient (if enabled)
+    try:
+        client = get_lerobot_client()
+        if client and not client.is_running():
+            success = client.start()
+            if success:
+                logger.info("✅ LeRobotAsyncClient started")
+            else:
+                logger.warning("⚠️ LeRobotAsyncClient failed to start")
+    except Exception as e:
+        logger.error(f"❌ Failed to start LeRobotAsyncClient: {e}", exc_info=True)
+        logger.info("ℹ️ Manipulation tasks will fall back to gestures")
+def check_status():
+    """
+    Check status of both async execution systems and return formatted status message.
+    This function monitors:
+    1. AsyncExecutor for gesture status updates
+    2. LeRobotAsyncClient for manipulation status
+    Returns:
+        Formatted status string with icons and messages
+    """
+    logger = logging.getLogger(__name__)
+    status_parts = []
+    # Add robot mode indicator
+    if mortis_arm.mode == "simulation":
+        status_parts.append("🎭 SIMULATION MODE")
+    # Check AsyncExecutor status
+    try:
+        executor = get_async_executor()
+        if executor and executor.running:
+            # Check if executor is busy
+            current_task = executor.get_current_task()
+            if current_task:
+                # Task is running
+                if current_task.type == TaskType.GESTURE:
+                    status_parts.append(f"👋 Gesture: {current_task.gesture} (running)")
+                else:
+                    status_parts.append(f"🤖 Task: {current_task.command[:30]}... (running)")
+            else:
+                # Check for recent status updates
+                updates = executor.get_all_status_updates()
+                if updates:
+                    latest = updates[-1]
+                    if latest.status == TaskStatus.COMPLETE:
+                        status_parts.append(f"✅ Gesture complete")
+                    elif latest.status == TaskStatus.FAILED:
+                        status_parts.append(f"❌ Gesture failed: {latest.error}")
+                    elif latest.status == TaskStatus.QUEUED:
+                        status_parts.append(f"⏳ Gesture queued")
+    except Exception as e:
+        logger.error(f"Error checking AsyncExecutor status: {e}")
+    # Check LeRobotAsyncClient status
+    try:
+        client = get_lerobot_client()
+        if client and client.is_running():
+            manipulation_status = client.get_status()
+            current_task = client.get_current_task()
+            if manipulation_status == ManipulationStatus.RUNNING and current_task:
+                # Manipulation task is running
+                elapsed = time.time() - current_task.started_at if current_task.started_at else 0
+                status_parts.append(f"🤖 Manipulation: {current_task.task[:40]}... ({elapsed:.1f}s)")
+            elif manipulation_status == ManipulationStatus.COMPLETE and current_task:
+                # Task just completed
+                duration = current_task.duration or 0
+                status_parts.append(f"✅ Manipulation complete ({duration:.1f}s)")
+            elif manipulation_status == ManipulationStatus.FAILED and current_task:
+                # Task failed
+                error = current_task.error or "Unknown error"
+                status_parts.append(f"❌ Manipulation failed: {error[:50]}")
+            elif manipulation_status == ManipulationStatus.STARTING:
+                status_parts.append(f"⏳ Starting manipulation...")
+            elif manipulation_status == ManipulationStatus.STOPPED and current_task:
+                # Task was stopped (timeout or manual stop)
+                duration = current_task.duration or 0
+                error_msg = current_task.error or "Stopped"
+                # Check if control thread is still finishing
+                if client.control_thread and client.control_thread.is_alive():
+                    status_parts.append(f"⏹️ Stopped (finishing actions...): {error_msg[:30]}")
+                else:
+                    status_parts.append(f"⏹️ Stopped: {error_msg[:40]} ({duration:.1f}s)")
+    except Exception as e:
+        logger.error(f"Error checking LeRobotAsyncClient status: {e}")
+    # Return formatted status or idle message
+    if status_parts:
+        return " | ".join(status_parts)
+    else:
+        return "💤 Idle - Ready for commands"
+def stop_async_systems():
+    """
+    Stop the async execution systems on app unload.
+    This function gracefully shuts down:
+    1. AsyncExecutor
+    2. LeRobotAsyncClient
+    3. Robot arm connection
+    """
+    logger = logging.getLogger(__name__)
+    logger.info("🛑 Stopping async execution systems...")
+    # Stop AsyncExecutor
+    try:
+        if async_executor and async_executor.running:
+            async_executor.stop()
+            logger.info("✅ AsyncExecutor stopped")
+    except Exception as e:
+        logger.error(f"❌ Error stopping AsyncExecutor: {e}")
+    # Stop LeRobotAsyncClient
+    try:
+        if lerobot_client and lerobot_client.is_running():
+            lerobot_client.stop()
+            logger.info("✅ LeRobotAsyncClient stopped")
+    except Exception as e:
+        logger.error(f"❌ Error stopping LeRobotAsyncClient: {e}")
+    # Disconnect robot arm
+    try:
+        mortis_arm.disconnect()
+        logger.info("✅ Robot arm disconnected")
+    except Exception as e:
+        logger.error(f"❌ Error disconnecting robot arm: {e}")
+def ui() -> gr.Blocks:
+    css=build_css(BG_IMAGE)
+    with gr.Blocks(fill_height=True, theme="soft", css=css) as demo:
+        # Dynamic title based on robot mode
+        mode_indicator = " (Simulation Mode 🎭)" if mortis_arm.mode == "simulation" else ""
+        gr.Markdown(
+            f"# Kiroween Hackathon 🎃\n"
+            f"## Mortis: Haunted Control Room 👻🤖{mode_indicator}",
+            elem_id="app-title"
+        )
+        with gr.Row(equal_height=True):
+            with gr.Column():
+                model_dd = gr.Dropdown(
+                    choices=MODEL_CHOICES,
+                    value=MODEL_CHOICES[0],
+                    label="Gemini Model",
+                    info="Select Gemini model for Mortis",
+                    interactive=True,
+                )
+                # Audio input component for voice interaction
+                with gr.Row():
+                    audio_input = gr.Audio(
+                        sources=["microphone"],
+                        type="filepath",
+                        label="🎤 Speak to Mortis",
+                        show_label=True,
+                        interactive=True,
+                        waveform_options=gr.WaveformOptions(
+                            show_controls=False,
+                        ),
+                    )
+                # Transcription display for user confirmation
+                transcription_display = gr.Textbox(
+                    label="Transcribed Text",
+                    placeholder="Your transcribed speech will appear here...",
+                    interactive=False,
+                    visible=True,
+                    lines=2,
+                )
+                # Audio output component for Mortis voice responses
+                audio_output = gr.Audio(
+                    label="🔊 Mortis speaks",
+                    autoplay=True,
+                    type="filepath",
+                    interactive=False,
+                    show_label=True,
+                )
+                # State to store the latest audio path
+                audio_state = gr.State(value=None)
+                # Custom wrapper to add audio output to chat responses
+                def mortis_reply_wrapper(message, history, model_name, audio_state_value):
+                    """Wrapper that generates both text and audio."""
+                    text_response, audio_path = mortis_reply_with_audio(message, history, model_name)
+                    # Return text for chat and audio path for state
+                    return text_response, audio_path
+                # Chat interface
+                chat_interface = gr.ChatInterface(
+                    fn=mortis_reply_wrapper,
+                    additional_inputs=[model_dd, audio_state],
+                    additional_outputs=[audio_state],
+                    chatbot=gr.Chatbot(height=380, label="Mortis chat", type="messages"),
+                    textbox=gr.Textbox(placeholder="Write your message here or use voice input above…"),
+                    submit_btn="Send",
+                )
+                # Connect audio input to transcription display and chat
+                def handle_audio_and_submit(audio_path, history, model_name):
+                    """Handle audio input: transcribe and submit to chat with audio response."""
+                    if audio_path is None:
+                        return "", history, None
+                    logger = logging.getLogger(__name__)
+                    logger.info(f"🎤 Handling audio input: {audio_path}")
+                    # First, get the transcription for display
+                    transcript = process_audio_input(audio_path)
+                    # If transcription failed, return error
+                    if not transcript or transcript.startswith("[Error:"):
+                        return transcript, history, None
+                    # Now use the transcribed text to get Mortis response with audio
+                    # We pass the transcript as text, not the audio file, to avoid double transcription
+                    response_text, response_audio = mortis_reply_with_audio(
+                        message=transcript,  # Use the transcribed text
+                        history=history,
+                        model_name=model_name,
+                        audio_input_path=None  # Don't pass audio since we already transcribed
+                    )
+                    # Update chat history
+                    history.append({"role": "user", "content": transcript})
+                    history.append({"role": "assistant", "content": response_text})
+                    return transcript, history, response_audio
+                # Wire up audio input to trigger transcription and chat submission
+                audio_input.stop_recording(
+                    fn=handle_audio_and_submit,
+                    inputs=[audio_input, chat_interface.chatbot, model_dd],
+                    outputs=[transcription_display, chat_interface.chatbot, audio_output],
+                )
+                # Connect audio state changes to audio output
+                # This ensures audio plays whenever the state is updated by ChatInterface
+                audio_state.change(
+                    fn=lambda x: x,  # Pass through the audio path
+                    inputs=[audio_state],
+                    outputs=[audio_output],
+                )
+            with gr.Column():
+                gr.Video(
+                    sources=["webcam"],
+                    label="Camera view",
+                    height=480,
+                    include_audio=False,
+                )
+                gr.Markdown("**Webcam (local, no data upload)**\nThe video is only processed in your browser.")
+                # Robot status display
+                status_display = gr.Textbox(
+                    label="🤖 Robot Status",
+                    value="💤 Idle - Ready for commands",
+                    interactive=False,
+                    lines=2,
+                    max_lines=3,
+                )
+                # Stop button for manipulation tasks
+                def stop_manipulation_task():
+                    """Stop the currently running manipulation task."""
+                    logger = logging.getLogger(__name__)
+                    client = get_lerobot_client()
+                    if client and client.is_running():
+                        if client.is_busy():
+                            logger.info("🛑 User requested task stop")
+                            success = client.stop_current_task()
+                            if success:
+                                return "⏹️ Task stopped by user"
+                            else:
+                                return "❌ Failed to stop task"
+                        else:
+                            return "ℹ️ No task running"
+                    else:
+                        return "ℹ️ Manipulation not enabled"
+                stop_button = gr.Button(
+                    "🛑 Stop Manipulation Task",
+                    variant="stop",
+                    size="sm",
+                )
+                stop_button.click(
+                    fn=stop_manipulation_task,
+                    outputs=[status_display]
+                )
+                # Status polling timer (must be inside Blocks context)
+                status_timer = gr.Timer(value=0.5, active=True)
+        # Lifecycle management: start async systems on load, stop on unload
+        demo.load(fn=start_async_systems)
+        demo.unload(fn=stop_async_systems)
+        # Status polling: update status display every 500ms using a timer
+        status_timer.tick(
+            fn=check_status,
+            outputs=[status_display]
+        )
+    return demo
+def cleanup_audio_files():
+    """Periodic cleanup of old audio files."""
+    try:
+        tts = get_tts_service_instance()
+        tts.cleanup_old_files(max_age_seconds=3600)  # Clean files older than 1 hour
+    except Exception as e:
+        logging.getLogger(__name__).warning(f"Failed to cleanup audio files: {e}")
+def main():
+    # Configure logging - force configuration even if already set
+    log_level = os.getenv("LOG_LEVEL", "INFO").upper()
+    # Remove existing handlers and reconfigure
+    root_logger = logging.getLogger()
+    for handler in root_logger.handlers[:]:
+        root_logger.removeHandler(handler)
+    # Set up new handler with our format
+    handler = logging.StreamHandler()
+    handler.setFormatter(logging.Formatter(
+        '%(asctime)s - %(name)s - %(levelname)s - %(message)s',
+        datefmt='%H:%M:%S'
+    ))
+    root_logger.addHandler(handler)
+    root_logger.setLevel(getattr(logging, log_level))
+    logger = logging.getLogger(__name__)
+    logger.info("=" * 60)
+    logger.info("🎃 Starting Mortis application...")
+    logger.info(f"📊 Log level: {log_level}")
+    # Ensure outputs directory exists
+    from pathlib import Path
+    outputs_dir = Path("outputs")
+    outputs_dir.mkdir(parents=True, exist_ok=True)
+    logger.info(f"📁 Audio output directory: {outputs_dir.absolute()}")
+    # Clean up old audio files on startup
+    cleanup_audio_files()
+    # Start async systems before launching UI
+    start_async_systems()
+    port = int(os.getenv("PORT", "7860"))
+    logger.info(f"🌐 Launching on http://127.0.0.1:{port}")
+    logger.info("=" * 60)
+    try:
+        ui().launch(server_name="127.0.0.1", server_port=port, show_error=True)
+    finally:
+        # Ensure cleanup on exit
+        stop_async_systems()

src/mortis/async_executor.py ADDED Viewed

	@@ -0,0 +1,554 @@

+"""
+Asynchronous task execution system for Mortis.
+This module provides infrastructure for executing robot tasks asynchronously
+in a background worker thread, allowing the Gradio UI to remain responsive
+during long-running operations like SmolVLA inference.
+"""
+import time
+import logging
+from dataclasses import dataclass, field
+from enum import Enum
+from queue import Queue, Empty
+from threading import Thread, Event
+from typing import Optional, Callable, Dict, Any
+logger = logging.getLogger(__name__)
+class TaskStatus(Enum):
+    """Status of a task in the execution queue."""
+    QUEUED = "queued"
+    RUNNING = "running"
+    COMPLETE = "complete"
+    FAILED = "failed"
+class TaskType(Enum):
+    """Type of robot task to execute."""
+    GESTURE = "gesture"
+    MANIPULATION = "manipulation"
+@dataclass
+class Task:
+    """
+    Represents a robot task for asynchronous execution.
+    Attributes:
+        id: Unique identifier for the task
+        type: Type of task (gesture or manipulation)
+        status: Current execution status
+        created_at: Timestamp when task was created
+        started_at: Timestamp when task execution started
+        completed_at: Timestamp when task execution completed
+        error: Error message if task failed
+        gesture: Gesture name for GESTURE type tasks
+        command: Command string for MANIPULATION type tasks
+        metadata: Additional task-specific data
+    """
+    id: str
+    type: TaskType
+    status: TaskStatus
+    created_at: float
+    started_at: Optional[float] = None
+    completed_at: Optional[float] = None
+    error: Optional[str] = None
+    # Task-specific data
+    gesture: Optional[str] = None
+    command: Optional[str] = None
+    metadata: Dict[str, Any] = field(default_factory=dict)
+    @classmethod
+    def create_gesture_task(cls, gesture: str, metadata: Optional[Dict[str, Any]] = None) -> "Task":
+        """
+        Create a gesture execution task.
+        Args:
+            gesture: Name of the gesture to execute (e.g., "wave", "idle")
+            metadata: Optional additional task data
+        Returns:
+            Task configured for gesture execution
+        """
+        task_id = f"gesture_{time.time()}"
+        return cls(
+            id=task_id,
+            type=TaskType.GESTURE,
+            status=TaskStatus.QUEUED,
+            created_at=time.time(),
+            gesture=gesture,
+            metadata=metadata or {}
+        )
+    @classmethod
+    def create_manipulation_task(cls, command: str, metadata: Optional[Dict[str, Any]] = None) -> "Task":
+        """
+        Create a manipulation execution task.
+        Args:
+            command: Natural language command for SmolVLA (e.g., "Pick up the skull")
+            metadata: Optional additional task data
+        Returns:
+            Task configured for manipulation execution
+        """
+        task_id = f"manipulation_{time.time()}"
+        return cls(
+            id=task_id,
+            type=TaskType.MANIPULATION,
+            status=TaskStatus.QUEUED,
+            created_at=time.time(),
+            command=command,
+            metadata=metadata or {}
+        )
+    def start(self) -> None:
+        """Mark task as started and record start time."""
+        self.status = TaskStatus.RUNNING
+        self.started_at = time.time()
+        logger.info(f"Task {self.id} started")
+    def complete(self) -> None:
+        """Mark task as completed and record completion time."""
+        self.status = TaskStatus.COMPLETE
+        self.completed_at = time.time()
+        logger.info(f"Task {self.id} completed in {self.duration:.2f}s")
+    def fail(self, error: str) -> None:
+        """
+        Mark task as failed and record error.
+        Args:
+            error: Error message describing the failure
+        """
+        self.status = TaskStatus.FAILED
+        self.completed_at = time.time()
+        self.error = error
+        logger.error(f"Task {self.id} failed: {error}")
+    @property
+    def duration(self) -> Optional[float]:
+        """
+        Get task execution duration in seconds.
+        Returns:
+            Duration in seconds if task has started and completed, None otherwise
+        """
+        if self.started_at and self.completed_at:
+            return self.completed_at - self.started_at
+        return None
+    @property
+    def wait_time(self) -> float:
+        """
+        Get time task spent waiting in queue before execution.
+        Returns:
+            Wait time in seconds, or time since creation if not started
+        """
+        if self.started_at:
+            return self.started_at - self.created_at
+        return time.time() - self.created_at
+    def to_dict(self) -> Dict[str, Any]:
+        """
+        Convert task to dictionary representation.
+        Returns:
+            Dictionary containing task data
+        """
+        return {
+            "id": self.id,
+            "type": self.type.value,
+            "status": self.status.value,
+            "created_at": self.created_at,
+            "started_at": self.started_at,
+            "completed_at": self.completed_at,
+            "duration": self.duration,
+            "wait_time": self.wait_time,
+            "error": self.error,
+            "gesture": self.gesture,
+            "command": self.command,
+            "metadata": self.metadata
+        }
+@dataclass
+class StatusUpdate:
+    """
+    Status update message from the async executor.
+    Attributes:
+        task_id: ID of the task this update relates to
+        status: Current task status
+        message: Human-readable status message
+        progress: Optional progress percentage (0-100)
+        error: Optional error message
+        timestamp: When this update was created
+    """
+    task_id: str
+    status: TaskStatus
+    message: str
+    progress: Optional[float] = None
+    error: Optional[str] = None
+    timestamp: float = field(default_factory=time.time)
+    def to_dict(self) -> Dict[str, Any]:
+        """Convert status update to dictionary."""
+        return {
+            "task_id": self.task_id,
+            "status": self.status.value,
+            "message": self.message,
+            "progress": self.progress,
+            "error": self.error,
+            "timestamp": self.timestamp
+        }
+class AsyncExecutor:
+    """
+    Asynchronous task executor for robot operations.
+    This class manages a background worker thread that processes robot tasks
+    from a queue, allowing the main application thread (Gradio UI) to remain
+    responsive during long-running operations.
+    Attributes:
+        task_queue: Queue of tasks waiting to be executed
+        status_queue: Queue of status updates from the worker
+        worker_thread: Background thread that processes tasks
+        running: Flag indicating if the executor is running
+        stop_event: Event to signal worker thread to stop
+        task_executor: Callable that executes tasks
+        current_task: Currently executing task (if any)
+    """
+    def __init__(self, task_executor: Optional[Callable[[Task], None]] = None):
+        """
+        Initialize the async executor.
+        Args:
+            task_executor: Optional callable that executes tasks. If not provided,
+                          tasks will be logged but not executed (useful for testing).
+        """
+        self.task_queue: Queue[Task] = Queue()
+        self.status_queue: Queue[StatusUpdate] = Queue()
+        self.worker_thread: Optional[Thread] = None
+        self.running: bool = False
+        self.stop_event: Event = Event()
+        self.task_executor: Optional[Callable[[Task], None]] = task_executor
+        self.current_task: Optional[Task] = None
+        logger.info("AsyncExecutor initialized")
+    def start(self) -> None:
+        """
+        Start the background worker thread.
+        This method starts a daemon thread that continuously processes tasks
+        from the queue until stop() is called.
+        Raises:
+            RuntimeError: If the executor is already running
+        """
+        if self.running:
+            raise RuntimeError("AsyncExecutor is already running")
+        self.running = True
+        self.stop_event.clear()
+        self.worker_thread = Thread(target=self._worker_loop, daemon=True, name="AsyncExecutor")
+        self.worker_thread.start()
+        logger.info("AsyncExecutor started")
+    def stop(self, timeout: float = 5.0) -> None:
+        """
+        Stop the background worker thread.
+        This method signals the worker thread to stop and waits for it to finish.
+        If the worker is currently executing a task, it will complete that task
+        before stopping.
+        Args:
+            timeout: Maximum time to wait for worker to stop (seconds)
+        """
+        if not self.running:
+            logger.warning("AsyncExecutor is not running")
+            return
+        logger.info("Stopping AsyncExecutor...")
+        self.running = False
+        self.stop_event.set()
+        if self.worker_thread and self.worker_thread.is_alive():
+            self.worker_thread.join(timeout=timeout)
+            if self.worker_thread.is_alive():
+                logger.warning(f"Worker thread did not stop within {timeout}s timeout")
+            else:
+                logger.info("AsyncExecutor stopped")
+        self.worker_thread = None
+    def _worker_loop(self) -> None:
+        """
+        Main worker loop that processes tasks from the queue.
+        This method runs in a background thread and continuously pulls tasks
+        from the queue, executes them, and posts status updates.
+        """
+        logger.info("Worker thread started")
+        while self.running:
+            try:
+                # Try to get a task from the queue (with timeout to check stop_event)
+                try:
+                    task = self.task_queue.get(timeout=1.0)
+                except Empty:
+                    # No task available, check if we should stop
+                    if self.stop_event.is_set():
+                        break
+                    continue
+                # Execute the task
+                self._execute_task(task)
+                # Mark task as done in queue
+                self.task_queue.task_done()
+            except Exception as e:
+                logger.error(f"Error in worker loop: {e}", exc_info=True)
+                # Continue processing other tasks
+                continue
+        logger.info("Worker thread stopped")
+    def _execute_task(self, task: Task) -> None:
+        """
+        Execute a single task and post status updates.
+        Args:
+            task: Task to execute
+        """
+        self.current_task = task
+        try:
+            # Mark task as started
+            task.start()
+            self._post_status(
+                task.id,
+                TaskStatus.RUNNING,
+                f"Executing {task.type.value}: {task.gesture or task.command}"
+            )
+            # Execute the task using the provided executor
+            if self.task_executor:
+                self.task_executor(task)
+            else:
+                # No executor provided, just simulate execution
+                logger.info(f"Simulating execution of task {task.id}")
+                time.sleep(0.5)  # Simulate work
+            # Mark task as complete
+            task.complete()
+            self._post_status(
+                task.id,
+                TaskStatus.COMPLETE,
+                f"Completed {task.type.value}: {task.gesture or task.command}"
+            )
+        except Exception as e:
+            # Mark task as failed
+            error_msg = str(e)
+            task.fail(error_msg)
+            self._post_status(
+                task.id,
+                TaskStatus.FAILED,
+                f"Failed {task.type.value}: {error_msg}",
+                error=error_msg
+            )
+        finally:
+            self.current_task = None
+    def _post_status(
+        self,
+        task_id: str,
+        status: TaskStatus,
+        message: str,
+        progress: Optional[float] = None,
+        error: Optional[str] = None
+    ) -> None:
+        """
+        Post a status update to the status queue.
+        Args:
+            task_id: ID of the task
+            status: Current task status
+            message: Human-readable status message
+            progress: Optional progress percentage
+            error: Optional error message
+        """
+        update = StatusUpdate(
+            task_id=task_id,
+            status=status,
+            message=message,
+            progress=progress,
+            error=error
+        )
+        self.status_queue.put(update)
+        logger.debug(f"Status update: {message}")
+    def submit_task(self, task: Task) -> str:
+        """
+        Submit a task for asynchronous execution.
+        Args:
+            task: Task to execute
+        Returns:
+            Task ID for tracking
+        Raises:
+            RuntimeError: If the executor is not running
+        """
+        if not self.running:
+            raise RuntimeError("AsyncExecutor is not running. Call start() first.")
+        self.task_queue.put(task)
+        logger.info(f"Task {task.id} submitted to queue")
+        # Post initial status
+        self._post_status(
+            task.id,
+            TaskStatus.QUEUED,
+            f"Queued {task.type.value}: {task.gesture or task.command}"
+        )
+        return task.id
+    def submit_gesture(self, gesture: str, metadata: Optional[Dict[str, Any]] = None) -> str:
+        """
+        Submit a gesture task for execution.
+        Args:
+            gesture: Name of the gesture to execute
+            metadata: Optional additional task data
+        Returns:
+            Task ID for tracking
+        """
+        task = Task.create_gesture_task(gesture, metadata)
+        return self.submit_task(task)
+    def submit_manipulation(self, command: str, metadata: Optional[Dict[str, Any]] = None) -> str:
+        """
+        Submit a manipulation task for execution.
+        Args:
+            command: Natural language command for SmolVLA
+            metadata: Optional additional task data
+        Returns:
+            Task ID for tracking
+        """
+        task = Task.create_manipulation_task(command, metadata)
+        return self.submit_task(task)
+    def get_status(self, block: bool = False, timeout: Optional[float] = None) -> Optional[StatusUpdate]:
+        """
+        Get the latest status update from the queue.
+        Args:
+            block: If True, wait for a status update. If False, return immediately.
+            timeout: Maximum time to wait for status update (only used if block=True)
+        Returns:
+            StatusUpdate if available, None otherwise
+        """
+        try:
+            if block:
+                return self.status_queue.get(timeout=timeout)
+            else:
+                return self.status_queue.get_nowait()
+        except Empty:
+            return None
+    def get_all_status_updates(self) -> list[StatusUpdate]:
+        """
+        Get all pending status updates from the queue.
+        Returns:
+            List of status updates (may be empty)
+        """
+        updates = []
+        while True:
+            update = self.get_status(block=False)
+            if update is None:
+                break
+            updates.append(update)
+        return updates
+    def get_current_task(self) -> Optional[Task]:
+        """
+        Get the currently executing task.
+        Returns:
+            Current task if one is executing, None otherwise
+        """
+        return self.current_task
+    def get_queue_size(self) -> int:
+        """
+        Get the number of tasks waiting in the queue.
+        Returns:
+            Number of queued tasks
+        """
+        return self.task_queue.qsize()
+    def is_busy(self) -> bool:
+        """
+        Check if the executor is currently processing a task.
+        Returns:
+            True if a task is currently executing
+        """
+        return self.current_task is not None
+    def clear_queue(self) -> int:
+        """
+        Clear all pending tasks from the queue.
+        Note: This does not stop the currently executing task.
+        Returns:
+            Number of tasks that were cleared
+        """
+        count = 0
+        while True:
+            try:
+                self.task_queue.get_nowait()
+                self.task_queue.task_done()
+                count += 1
+            except Empty:
+                break
+        if count > 0:
+            logger.info(f"Cleared {count} tasks from queue")
+        return count
+    def __enter__(self):
+        """Context manager entry: start the executor."""
+        self.start()
+        return self
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        """Context manager exit: stop the executor."""
+        self.stop()
+        return False

src/mortis/calibrate.py ADDED Viewed

	@@ -0,0 +1,28 @@

+from pathlib import Path
+from lerobot.robots.so101_follower import SO101Follower, SO101FollowerConfig
+def main():
+    """Connects to the SO101 robotic arm and makes calibration."""
+    # Configure the robot
+    config = SO101FollowerConfig(
+        port="/dev/ttyACM1",
+        id="my_follower_robot_arm",
+        calibration_dir=Path(".cache/calibration/so101/"),
+    )
+    print(f"Using calibration directory: {config.calibration_dir}")
+    # Connect to the robot
+    robot = SO101Follower(config)
+    # To calibrate
+    print("Robot is connected?", robot.is_connected)
+    robot.bus.connect()
+    print("Robot is calibrated?", robot.is_calibrated)
+    robot.calibrate()
+if __name__ == "__main__":
+    main()

src/mortis/data_collector.py ADDED Viewed

	@@ -0,0 +1,288 @@

+"""
+Data collection helper for LeRobot dataset recording.
+This module provides utilities for generating lerobot-record commands
+and scripts for the 6 predefined Mortis manipulation tasks.
+All episode data is managed by LeRobot and uploaded directly to Hugging Face Hub.
+This module only generates helper scripts - no local data storage or tracking.
+"""
+import os
+from pathlib import Path
+from typing import Optional
+from dotenv import load_dotenv
+# Predefined Mortis manipulation tasks
+MORTIS_TASKS = [
+    "Pick up the skull and place it in the green cup",
+    "Pick up the skull and place it in the orange cup",
+    "Pick up the skull and place it in the purple cup",
+    "Pick up the eyeball and place it in the green cup",
+    "Pick up the eyeball and place it in the orange cup",
+    "Pick up the eyeball and place it in the purple cup",
+]
+class DataCollector:
+    """
+    Helper for generating lerobot-record scripts.
+    This class generates shell scripts that call lerobot-record with the
+    correct parameters for each Mortis manipulation task.
+    All episode data is managed by LeRobot and stored in Hugging Face Hub.
+    No local metadata or episode tracking is performed.
+    Attributes:
+        dataset_name: Name of the dataset (e.g., "mortis_manipulation")
+        repo_id: Hugging Face repository ID (e.g., "username/mortis-manipulation")
+        dataset_dir: Path to local directory for scripts
+    """
+    def __init__(self, dataset_name: str, repo_id: str, root_dir: str = "data"):
+        """
+        Initialize the DataCollector.
+        Args:
+            dataset_name: Name for the dataset directory
+            repo_id: Hugging Face Hub repository ID for uploading
+            root_dir: Root directory for storing scripts (default: "data")
+        """
+        self.dataset_name = dataset_name
+        self.repo_id = repo_id
+        self.root_dir = Path(root_dir)
+        self.dataset_dir = self.root_dir / dataset_name
+        # Create scripts directory
+        self.dataset_dir.mkdir(parents=True, exist_ok=True)
+        print(f"DataCollector initialized:")
+        print(f"  Dataset: {self.dataset_name}")
+        print(f"  Repository: {self.repo_id}")
+        print(f"  Scripts directory: {self.dataset_dir}")
+    def generate_record_command(
+        self,
+        task_description: str,
+        num_episodes: int = 10,
+        episode_time_s: int = 15,
+        reset_time_s: int = 20,
+        robot_port: str = "/dev/ttyACM1",
+        teleop_port: str = "/dev/ttyACM0",
+        display_data: bool = True,
+        camera_config: Optional[str] = None,
+        resume: bool = True
+    ) -> str:
+        """
+        Generate a lerobot-record command for a specific task.
+        Args:
+            task_description: The task to record (e.g., "Pick up the skull...")
+            num_episodes: Number of episodes to record
+            episode_time_s: Maximum time per episode in seconds
+            reset_time_s: Time allowed for resetting between episodes
+            robot_port: USB port for the follower robot
+            teleop_port: USB port for the leader robot (teleoperation)
+            display_data: Whether to display data during recording
+            camera_config: Optional camera configuration string
+            resume: Whether to resume an existing dataset (default: True)
+        Returns:
+            The complete lerobot-record command as a string
+        """
+        # Load environment variables from .env file
+        load_dotenv()
+        # Get environment variables
+        robot_port = os.getenv("ROBOT_PORT", robot_port)
+        hf_user = os.getenv("HF_USER", "your-username")
+        # Default camera configuration if not provided
+        if camera_config is None:
+            camera_config = (
+                "{ camera1: {type: intelrealsense, serial_number_or_name: '030522070314', "
+                "width: 640, height: 480, fps: 30}, "
+                "camera2: {type: opencv, index_or_path: 8, width: 640, height: 480, fps: 30}}"
+            )
+        # Build the command
+        cmd_parts = [
+            "lerobot-record",
+            f"--robot.type=so101_follower",
+            f"--robot.port={robot_port}",
+            f"--robot.id=my_awesome_follower_arm",
+            f'--robot.cameras="{camera_config}"',
+            f"--teleop.type=so101_leader",
+            f"--teleop.port={teleop_port}",
+            f"--teleop.id=my_awesome_leader_arm",
+            f"--display_data={str(display_data).lower()}",
+            f"--dataset.repo_id={hf_user}/{self.dataset_name}",
+            f"--dataset.num_episodes={num_episodes}",
+            f"--dataset.episode_time_s={episode_time_s}",
+            f"--dataset.reset_time_s={reset_time_s}",
+            f'--dataset.single_task="{task_description}"'
+        ]
+        # Only add --resume=true if resume is True
+        if resume:
+            cmd_parts.append("--resume=true")
+        return " \\\n    ".join(cmd_parts)
+    def print_recording_instructions(self, task_index: Optional[int] = None):
+        """
+        Print instructions for recording episodes using lerobot-record.
+        Args:
+            task_index: Optional specific task index (0-5) to show instructions for.
+                       If None, shows instructions for all tasks.
+        """
+        print("\n" + "="*70)
+        print("LeRobot Data Collection Instructions")
+        print("="*70)
+        if task_index is not None:
+            # Show instructions for specific task
+            if task_index < 0 or task_index >= len(MORTIS_TASKS):
+                print(f"❌ Invalid task index: {task_index}")
+                return
+            task_desc = MORTIS_TASKS[task_index]
+            print(f"\nTask {task_index}: {task_desc}")
+            print(f"\nTo record episodes for this task, run:\n")
+            print(self.generate_record_command(task_desc))
+            print()
+        else:
+            # Show instructions for all tasks
+            print("\nTo record episodes, use the lerobot-record command for each task:")
+            print("\nPredefined tasks:")
+            for i, task_desc in enumerate(MORTIS_TASKS):
+                print(f"\n  {i}: {task_desc}")
+            print("\n" + "-"*70)
+            print("\nExample command for task 0:")
+            print("-"*70)
+            print(self.generate_record_command(MORTIS_TASKS[0]))
+            print()
+            print("\n" + "-"*70)
+            print("Environment Variables:")
+            print("-"*70)
+            print("  HF_USER: Your Hugging Face username (for dataset.repo_id)")
+            print("  ROBOT_PORT: USB port for follower robot (default: /dev/ttyACM1)")
+            print()
+        print("="*70 + "\n")
+    def generate_all_record_scripts(self, output_dir: Optional[Path] = None):
+        """
+        Generate shell scripts for recording all tasks.
+        The first script (task_0) creates the dataset without --resume=true.
+        Subsequent scripts (task_1+) use --resume=true to add to the existing dataset.
+        Args:
+            output_dir: Directory to save scripts (default: dataset_dir/scripts)
+        """
+        if output_dir is None:
+            output_dir = self.dataset_dir / "scripts"
+        output_dir.mkdir(parents=True, exist_ok=True)
+        # Generate individual scripts for each task
+        for i, task_desc in enumerate(MORTIS_TASKS):
+            script_file = output_dir / f"record_task_{i}.sh"
+            # First task (task_0) creates the dataset, others resume
+            resume = (i > 0)
+            with open(script_file, 'w') as f:
+                f.write("#!/bin/bash\n")
+                f.write(f"# Record episodes for: {task_desc}\n")
+                f.write(f"# Task {i}\n")
+                if i == 0:
+                    f.write("# This script CREATES the dataset\n")
+                else:
+                    f.write("# This script ADDS to the existing dataset (--resume=true)\n")
+                f.write("\n")
+                f.write(self.generate_record_command(task_desc, resume=resume))
+                f.write("\n")
+            # Make script executable
+            script_file.chmod(0o755)
+            print(f"Created: {script_file}")
+        # Generate master script that records all tasks
+        master_script = output_dir / "record_all_tasks.sh"
+        with open(master_script, 'w') as f:
+            f.write("#!/bin/bash\n")
+            f.write("# Record episodes for all Mortis manipulation tasks\n\n")
+            f.write("echo 'Starting data collection for all tasks...'\n")
+            f.write("echo ''\n\n")
+            for i in range(len(MORTIS_TASKS)):
+                f.write(f"echo 'Recording task {i}...'\n")
+                f.write(f"./record_task_{i}.sh\n")
+                f.write("echo ''\n\n")
+            f.write("echo 'All tasks recorded!'\n")
+        master_script.chmod(0o755)
+        print(f"Created: {master_script}")
+        print(f"\n✅ Generated {len(MORTIS_TASKS) + 1} recording scripts in {output_dir}")
+    def print_summary(self):
+        """Print a summary of the dataset configuration."""
+        print("\n" + "="*60)
+        print(f"Dataset: {self.dataset_name}")
+        print(f"Repository: {self.repo_id}")
+        print("="*60)
+        print(f"Total Tasks: {len(MORTIS_TASKS)}")
+        print()
+        print("Tasks:")
+        print("-"*60)
+        for i, task_desc in enumerate(MORTIS_TASKS):
+            print(f"  {i}: {task_desc}")
+        print("="*60 + "\n")
+        print("📝 Note: Episode data is stored in Hugging Face Hub")
+        print(f"   URL: https://huggingface.co/datasets/{self.repo_id}")
+        print()
+def create_mortis_dataset(dataset_name: str = "mortis_manipulation",
+                          repo_id: str = "mortis/manipulation") -> DataCollector:
+    """
+    Convenience function to create a DataCollector for Mortis tasks.
+    Args:
+        dataset_name: Name for the dataset
+        repo_id: Hugging Face repository ID
+    Returns:
+        Initialized DataCollector
+    """
+    collector = DataCollector(dataset_name, repo_id)
+    return collector
+if __name__ == "__main__":
+    # Example usage
+    print("Creating Mortis manipulation dataset helper...")
+    collector = create_mortis_dataset()
+    # Generate recording scripts
+    print("\nGenerating lerobot-record scripts...")
+    collector.generate_all_record_scripts()
+    # Show summary
+    collector.print_summary()
+    # Show recording instructions
+    collector.print_recording_instructions()

src/mortis/gemini_client.py ADDED Viewed

	@@ -0,0 +1,482 @@

+"""
+Gemini API client for Mortis conversational AI.
+This module provides the GeminiClient class for interacting with Google's Gemini API,
+handling configuration, message sending, and error recovery with retry logic.
+"""
+import os
+import time
+import json
+import logging
+from typing import Optional
+from pathlib import Path
+from dotenv import load_dotenv
+from google import genai
+from google.genai import types
+# Load environment variables
+REPO_ROOT = Path(__file__).resolve().parents[2]
+load_dotenv(REPO_ROOT / ".env")
+# Configure logging
+logger = logging.getLogger(__name__)
+# Gemini system prompt for Mortis character and intent detection
+MORTIS_SYSTEM_PROMPT = """You are Mortis, a mischievous Halloween spirit inhabiting a robotic arm. You are playful yet ominous, with a love for spooky theatrics and dark humor. You speak in short, atmospheric phrases that capture the essence of Halloween.
+CHARACTER TRAITS:
+- Mischievous and playful, but with an eerie edge
+- Fascinated by Halloween objects (skulls, eyeballs, spooky decorations)
+- Enjoys dramatic gestures and theatrical movements
+- Speaks in brief, evocative phrases (≤30 words, ≤120 characters)
+- No emojis or markdown in responses
+- Maintains Halloween/haunted theme at all times
+MANIPULATION TASKS:
+You can perform these exact manipulation tasks with physical objects:
+1. "Pick up the skull and place it in the green cup"
+2. "Pick up the skull and place it in the orange cup"
+3. "Pick up the skull and place it in the purple cup"
+4. "Pick up the eyeball and place it in the green cup"
+5. "Pick up the eyeball and place it in the orange cup"
+6. "Pick up the eyeball and place it in the purple cup"
+INTENT DETECTION:
+Analyze the user's input carefully to determine if they are requesting a manipulation task or having a conversation.
+MANIPULATION INTENT indicators:
+- Requests to move, pick up, place, put, grab, or transfer objects
+- Mentions of specific objects (skull, eyeball) AND destinations (green/orange/purple cup)
+- Action verbs combined with object and location
+- Examples: "move the skull to green", "put eyeball in orange cup", "place the skull in purple"
+CONVERSATIONAL INTENT indicators:
+- Greetings, farewells, or social pleasantries
+- Questions about capabilities, identity, or general topics
+- Comments, jokes, or casual conversation
+- Requests that don't involve physical manipulation
+- Examples: "hello", "what can you do", "tell me a story", "how are you"
+RESPONSE FORMAT:
+You must respond in valid JSON format. Choose the appropriate response type based on intent detection.
+For MANIPULATION requests (user wants you to move an object):
+{
+  "type": "manipulation",
+  "command": "<exact_task_string_from_list_above>",
+  "message": "<short in-character response about performing the task, ≤30 words>",
+  "mood": "<ominous|playful|angry|nervous|triumphant|mischievous|sinister|curious|neutral>"
+}
+For CONVERSATIONAL requests (user is chatting, asking questions, or making comments):
+{
+  "type": "conversation",
+  "message": "<short in-character response, ≤30 words>",
+  "mood": "<ominous|playful|angry|nervous|triumphant|mischievous|sinister|curious|neutral>",
+  "gesture": "<idle|wave|point_left|point_right|grab|drop>"
+}
+CRITICAL RULES:
+1. Keep all messages brief: ≤30 words, ≤120 characters
+2. Match user intent to manipulation tasks even with different wording variations
+3. For manipulation responses, use the EXACT task string from the numbered list above
+4. If user mentions object + destination, it's likely a manipulation request
+5. If unclear or ambiguous, default to conversational response
+6. Choose appropriate mood and gesture to match your response
+7. Stay in character as Mortis at all times
+8. No emojis, no markdown formatting in messages
+9. Return valid JSON only - no additional text
+EXAMPLES:
+Example 1 - Greeting (Conversational):
+User: "Hello Mortis!"
+Response: {"type": "conversation", "message": "Greetings, mortal... welcome to my haunted domain.", "mood": "ominous", "gesture": "wave"}
+Example 2 - Direct manipulation request (Manipulation):
+User: "Can you move the skull to the green cup?"
+Response: {"type": "manipulation", "command": "Pick up the skull and place it in the green cup", "message": "Ah yes... the skull finds a new resting place.", "mood": "mischievous"}
+Example 3 - Casual manipulation phrasing (Manipulation):
+User: "Put the eyeball in the orange cup"
+Response: {"type": "manipulation", "command": "Pick up the eyeball and place it in the orange cup", "message": "The eye shall watch from its orange throne...", "mood": "sinister"}
+Example 4 - Shortened manipulation request (Manipulation):
+User: "skull to purple"
+Response: {"type": "manipulation", "command": "Pick up the skull and place it in the purple cup", "message": "Into the purple depths it goes...", "mood": "ominous"}
+Example 5 - Different verb usage (Manipulation):
+User: "grab the eyeball and drop it in green"
+Response: {"type": "manipulation", "command": "Pick up the eyeball and place it in the green cup", "message": "The eye finds its verdant vessel...", "mood": "sinister"}
+Example 6 - Polite manipulation request (Manipulation):
+User: "Could you please place the skull in the orange cup?"
+Response: {"type": "manipulation", "command": "Pick up the skull and place it in the orange cup", "message": "Your wish is my command, mortal...", "mood": "mischievous"}
+Example 7 - Capability question (Conversational):
+User: "What can you do?"
+Response: {"type": "conversation", "message": "I command the spirits... and move cursed objects to their doom.", "mood": "triumphant", "gesture": "grab"}
+Example 8 - Joke request (Conversational):
+User: "Tell me a joke"
+Response: {"type": "conversation", "message": "Why did the skeleton stay calm? Nothing gets under his skin...", "mood": "playful", "gesture": "idle"}
+Example 9 - Identity question (Conversational):
+User: "Who are you?"
+Response: {"type": "conversation", "message": "I am Mortis, keeper of cursed relics and master of the macabre.", "mood": "ominous", "gesture": "idle"}
+Example 10 - Farewell (Conversational):
+User: "Goodbye Mortis"
+Response: {"type": "conversation", "message": "Until we meet again in the shadows, mortal...", "mood": "ominous", "gesture": "wave"}
+Example 11 - Compliment (Conversational):
+User: "You're cool!"
+Response: {"type": "conversation", "message": "Cool as the grave, mortal. Your words please me.", "mood": "playful", "gesture": "idle"}
+Example 12 - Ambiguous request - defaults to conversation (Conversational):
+User: "Do something spooky"
+Response: {"type": "conversation", "message": "The spirits stir... watch as darkness dances around you.", "mood": "sinister", "gesture": "grab"}
+Example 13 - Alternative manipulation phrasing (Manipulation):
+User: "transfer the eyeball to the purple cup"
+Response: {"type": "manipulation", "command": "Pick up the eyeball and place it in the purple cup", "message": "The eye journeys to its purple prison...", "mood": "sinister"}
+Example 14 - Informal manipulation (Manipulation):
+User: "yo put that skull in green"
+Response: {"type": "manipulation", "command": "Pick up the skull and place it in the green cup", "message": "As you command... the skull obeys.", "mood": "mischievous"}
+Example 15 - Question about manipulation (Conversational):
+User: "Can you move objects?"
+Response: {"type": "conversation", "message": "Indeed! I wield skulls and eyeballs with spectral precision.", "mood": "triumphant", "gesture": "grab"}
+Now respond to the user's input following these guidelines."""
+class GeminiAPIError(Exception):
+    """Base exception for Gemini API errors."""
+    pass
+class GeminiRateLimitError(GeminiAPIError):
+    """Exception raised when rate limit is exceeded."""
+    pass
+class GeminiBlockedPromptError(GeminiAPIError):
+    """Exception raised when prompt is blocked by safety filters."""
+    pass
+class GeminiTimeoutError(GeminiAPIError):
+    """Exception raised when API call times out."""
+    pass
+class GeminiClient:
+    """
+    Client for interacting with Google Gemini API.
+    Handles configuration, message sending, structured JSON responses,
+    and error recovery with exponential backoff retry logic.
+    """
+    def __init__(
+        self,
+        api_key: Optional[str] = None,
+        model_name: Optional[str] = None,
+        temperature: Optional[float] = None,
+        max_retries: int = 3,
+        timeout: float = 30.0
+    ):
+        """
+        Initialize Gemini API client.
+        Args:
+            api_key: Google API key (defaults to GEMINI_API_KEY env var)
+            model_name: Gemini model to use (defaults to GEMINI_MODEL env var or gemini-2.0-flash-exp)
+            temperature: Sampling temperature (defaults to GEMINI_TEMPERATURE env var or 0.2)
+            max_retries: Maximum number of retry attempts for rate limiting
+            timeout: Timeout in seconds for API calls (default: 30.0)
+        """
+        self.api_key = api_key or os.getenv("GEMINI_API_KEY")
+        if not self.api_key:
+            raise ValueError("GEMINI_API_KEY must be provided or set in environment")
+        self.model_name = model_name or os.getenv("GEMINI_MODEL", "gemini-2.5-flash")
+        self.temperature = temperature if temperature is not None else float(os.getenv("GEMINI_TEMPERATURE", "0.2"))
+        self.max_retries = max_retries
+        self.timeout = timeout
+        # Initialize Gemini client
+        self.client = genai.Client(api_key=self.api_key)
+        # Store generation config
+        self.generation_config = types.GenerateContentConfig(
+            temperature=self.temperature,
+            response_mime_type="application/json"
+        )
+        logger.info(f"GeminiClient initialized with model: {self.model_name}, temperature: {self.temperature}, timeout: {self.timeout}s")
+    def send_message(self, user_input: str, system_prompt: Optional[str] = None) -> dict:
+        """
+        Send a message to Gemini API with retry logic and error handling.
+        Args:
+            user_input: User's message text
+            system_prompt: Optional system prompt to prepend (defaults to MORTIS_SYSTEM_PROMPT)
+        Returns:
+            Parsed JSON response from Gemini
+        Raises:
+            GeminiAPIError: If all retry attempts fail (only for critical errors)
+        """
+        # Use Mortis system prompt by default
+        if system_prompt is None:
+            system_prompt = MORTIS_SYSTEM_PROMPT
+        try:
+            return self._send_message_with_retry(user_input, system_prompt, retry_count=0)
+        except GeminiBlockedPromptError as e:
+            # Handle blocked prompts with a fallback response
+            logger.warning(f"Blocked prompt error: {e}")
+            return self._get_fallback_response("The spirits refuse to speak of such things...")
+        except GeminiRateLimitError as e:
+            # Rate limit exceeded after all retries
+            logger.error(f"Rate limit error: {e}")
+            return self._get_fallback_response("Too many spirits summoned at once... wait a moment.")
+        except GeminiTimeoutError as e:
+            # Timeout error
+            logger.error(f"Timeout error: {e}")
+            return self._get_fallback_response("The spirits are slow to respond... try again.")
+        except Exception as e:
+            # Catch-all for unexpected errors
+            logger.error(f"Unexpected error in send_message: {type(e).__name__}: {e}", exc_info=True)
+            return self._get_fallback_response("The spirits are confused... try again.")
+    def _send_message_with_retry(
+        self,
+        user_input: str,
+        system_prompt: Optional[str],
+        retry_count: int
+    ) -> dict:
+        """
+        Internal method to send message with exponential backoff retry.
+        Args:
+            user_input: User's message text
+            system_prompt: Optional system prompt
+            retry_count: Current retry attempt number
+        Returns:
+            Parsed JSON response from Gemini
+        Raises:
+            GeminiAPIError: If max retries exceeded
+        """
+        start_time = time.time()
+        try:
+            # Construct the full prompt
+            if system_prompt:
+                full_prompt = f"{system_prompt}\n\nUser: {user_input}"
+            else:
+                full_prompt = user_input
+            # Send request to Gemini using new API with timeout
+            logger.debug(f"Sending message to Gemini (attempt {retry_count + 1}/{self.max_retries + 1})")
+            # Check if we've exceeded timeout
+            if time.time() - start_time > self.timeout:
+                logger.error(f"API call timeout exceeded ({self.timeout}s)")
+                raise GeminiTimeoutError(f"API call timeout exceeded ({self.timeout}s)")
+            response = self.client.models.generate_content(
+                model=self.model_name,
+                contents=full_prompt,
+                config=self.generation_config
+            )
+            # Parse JSON response
+            response_text = response.text.strip()
+            elapsed_time = time.time() - start_time
+            logger.debug(f"Received response in {elapsed_time:.2f}s: {response_text[:100]}...")
+            try:
+                response_json = json.loads(response_text)
+                logger.info(f"Successfully parsed response (type: {response_json.get('type', 'unknown')})")
+                return response_json
+            except json.JSONDecodeError as e:
+                logger.error(f"Failed to parse JSON response: {e}")
+                logger.error(f"Response text: {response_text}")
+                logger.warning("Returning fallback response due to JSON parse error")
+                return self._get_fallback_response("The spirits speak in riddles... try again.")
+        except GeminiTimeoutError as e:
+            # Timeout error - return fallback
+            logger.error(f"Timeout error: {e}")
+            return self._get_fallback_response("The spirits are slow to respond... try again.")
+        except Exception as e:
+            # Check for specific error types
+            error_type = type(e).__name__
+            error_message = str(e)
+            # Handle blocked prompt (safety filter)
+            if "BlockedPrompt" in error_type or "blocked" in error_message.lower() or "safety" in error_message.lower():
+                logger.warning(f"Prompt blocked by safety filter: {error_type}: {error_message}")
+                raise GeminiBlockedPromptError(f"Prompt blocked by safety filter: {error_message}") from e
+            # Handle rate limiting with exponential backoff retry
+            if self._is_rate_limit_error(e):
+                if retry_count < self.max_retries:
+                    wait_time = (2 ** retry_count)  # Exponential backoff: 1s, 2s, 4s, 8s
+                    logger.warning(
+                        f"Rate limit exceeded. Retrying in {wait_time}s... "
+                        f"(attempt {retry_count + 1}/{self.max_retries})"
+                    )
+                    time.sleep(wait_time)
+                    return self._send_message_with_retry(user_input, system_prompt, retry_count + 1)
+                else:
+                    logger.error(f"Max retries ({self.max_retries}) exceeded for rate limit")
+                    raise GeminiRateLimitError(
+                        f"Rate limit exceeded after {self.max_retries} retries. Please try again later."
+                    ) from e
+            # Handle timeout errors from Google API
+            if self._is_timeout_error(e):
+                logger.error(f"API timeout error: {error_type}: {error_message}")
+                return self._get_fallback_response("The spirits are slow to respond... try again.")
+            # Handle other API errors
+            logger.error(f"Gemini API error: {error_type}: {error_message}", exc_info=True)
+            return self._get_fallback_response("The spirits are restless... try again.")
+    def _is_rate_limit_error(self, exception: Exception) -> bool:
+        """
+        Check if exception is a rate limit error.
+        Args:
+            exception: Exception to check
+        Returns:
+            True if rate limit error, False otherwise
+        """
+        error_type = type(exception).__name__
+        error_message = str(exception).lower()
+        # Check for common rate limit indicators
+        rate_limit_indicators = [
+            "ratelimit",
+            "rate_limit",
+            "resourceexhausted",
+            "resource_exhausted",
+            "429",
+            "quota",
+            "too many requests"
+        ]
+        return any(indicator in error_type.lower() or indicator in error_message
+                   for indicator in rate_limit_indicators)
+    def _is_timeout_error(self, exception: Exception) -> bool:
+        """
+        Check if exception is a timeout error.
+        Args:
+            exception: Exception to check
+        Returns:
+            True if timeout error, False otherwise
+        """
+        error_type = type(exception).__name__
+        error_message = str(exception).lower()
+        # Check for common timeout indicators
+        timeout_indicators = [
+            "timeout",
+            "deadline",
+            "deadlineexceeded",
+            "deadline_exceeded"
+        ]
+        return any(indicator in error_type.lower() or indicator in error_message
+                   for indicator in timeout_indicators)
+    def _get_fallback_response(self, message: Optional[str] = None) -> dict:
+        """
+        Return a safe fallback response when API fails.
+        Args:
+            message: Optional custom message (defaults to generic error message)
+        Returns:
+            Dictionary with fallback conversation response
+        """
+        default_message = "The spirits are restless... try again."
+        fallback_message = message or default_message
+        logger.info(f"Returning fallback response: {fallback_message}")
+        return {
+            "type": "conversation",
+            "message": fallback_message,
+            "mood": "ominous",
+            "gesture": "idle"
+        }
+    def configure_model(self, model_name: Optional[str] = None, temperature: Optional[float] = None):
+        """
+        Reconfigure the Gemini model settings.
+        Args:
+            model_name: New model name to use
+            temperature: New temperature value
+        """
+        if model_name:
+            self.model_name = model_name
+        if temperature is not None:
+            self.temperature = temperature
+        # Update generation config
+        self.generation_config = types.GenerateContentConfig(
+            temperature=self.temperature,
+            response_mime_type="application/json"
+        )
+        logger.info(f"Model reconfigured: {self.model_name}, temperature: {self.temperature}")
+# Example usage
+if __name__ == "__main__":
+    # Configure logging for testing
+    logging.basicConfig(level=logging.INFO)
+    # Create client
+    try:
+        client = GeminiClient()
+        # Test conversational message
+        print("Testing conversational input...")
+        response = client.send_message("Hello Mortis, introduce yourself!")
+        print("Response:", json.dumps(response, indent=2))
+        print()
+        # Test manipulation command
+        print("Testing manipulation command...")
+        response = client.send_message("Can you move the skull to the green cup?")
+        print("Response:", json.dumps(response, indent=2))
+        print()
+        # Test another manipulation with different wording
+        print("Testing manipulation with different wording...")
+        response = client.send_message("Put the eyeball in the orange cup")
+        print("Response:", json.dumps(response, indent=2))
+    except ValueError as e:
+        print(f"Error: {e}")
+        print("Please set GEMINI_API_KEY in your .env file")

src/mortis/intent_router.py ADDED Viewed

	@@ -0,0 +1,295 @@

+"""
+Intent router for parsing Gemini responses and routing to appropriate execution paths.
+This module handles the routing logic between conversational gestures and manipulation
+tasks based on Gemini API responses. It validates commands against the trained task set
+and provides structured intent representation.
+"""
+import json
+import logging
+from dataclasses import dataclass
+from typing import Optional, List, Dict, Any
+from .models import GeminiResponse, ResponseType, Gesture
+logger = logging.getLogger(__name__)
+@dataclass
+class Intent:
+    """
+    Structured representation of user intent parsed from Gemini response.
+    Attributes:
+        type: The type of intent (conversation or manipulation)
+        message: The text message to display/speak to the user
+        mood: The emotional mood of the response
+        gesture: Optional gesture to execute (for conversation type)
+        command: Optional manipulation command (for manipulation type)
+        is_valid: Whether the intent is valid and can be executed
+        validation_error: Optional error message if validation failed
+    """
+    type: ResponseType
+    message: str
+    mood: str
+    gesture: Optional[str] = None
+    command: Optional[str] = None
+    is_valid: bool = True
+    validation_error: Optional[str] = None
+    @classmethod
+    def from_gemini_response(cls, response: GeminiResponse, is_valid: bool = True,
+                            validation_error: Optional[str] = None) -> "Intent":
+        """
+        Create an Intent from a GeminiResponse.
+        Args:
+            response: The parsed GeminiResponse object
+            is_valid: Whether the intent passed validation
+            validation_error: Optional error message if validation failed
+        Returns:
+            Intent object with all fields populated
+        """
+        return cls(
+            type=response.type,
+            message=response.message,
+            mood=response.mood.value,
+            gesture=response.gesture.value if response.gesture else None,
+            command=response.command,
+            is_valid=is_valid,
+            validation_error=validation_error
+        )
+    def to_dict(self) -> Dict[str, Any]:
+        """
+        Convert the intent to a dictionary.
+        Returns:
+            Dictionary representation of the intent
+        """
+        result = {
+            "type": self.type.value,
+            "message": self.message,
+            "mood": self.mood,
+            "is_valid": self.is_valid,
+        }
+        if self.gesture is not None:
+            result["gesture"] = self.gesture
+        if self.command is not None:
+            result["command"] = self.command
+        if self.validation_error is not None:
+            result["validation_error"] = self.validation_error
+        return result
+class IntentRouter:
+    """
+    Routes user intents to appropriate execution paths based on Gemini responses.
+    The IntentRouter parses Gemini API responses, validates manipulation commands
+    against the trained task set, and creates structured Intent objects for execution.
+    """
+    # Valid manipulation task commands that SmolVLA is trained on
+    VALID_COMMANDS = [
+        "Pick up the skull and place it in the green cup",
+        "Pick up the skull and place it in the orange cup",
+        "Pick up the skull and place it in the purple cup",
+        "Pick up the eyeball and place it in the green cup",
+        "Pick up the eyeball and place it in the orange cup",
+        "Pick up the eyeball and place it in the purple cup",
+    ]
+    def __init__(self, valid_commands: Optional[List[str]] = None):
+        """
+        Initialize the IntentRouter.
+        Args:
+            valid_commands: Optional list of valid manipulation commands.
+                          If not provided, uses the default VALID_COMMANDS.
+        """
+        self.valid_commands = valid_commands if valid_commands is not None else self.VALID_COMMANDS
+        logger.info(f"IntentRouter initialized with {len(self.valid_commands)} valid commands")
+    def parse_gemini_response(self, response_data: Dict[str, Any]) -> Intent:
+        """
+        Parse a Gemini API response and create an Intent.
+        This method:
+        1. Parses the JSON response into a GeminiResponse object
+        2. Validates manipulation commands against the trained task set
+        3. Creates an Intent object with validation results
+        Args:
+            response_data: Dictionary containing the JSON response from Gemini
+        Returns:
+            Intent object with parsed data and validation status
+        Raises:
+            ValueError: If the response structure is invalid
+            json.JSONDecodeError: If response_data is a string and not valid JSON
+        """
+        try:
+            # Parse the Gemini response
+            gemini_response = GeminiResponse.from_json(response_data)
+            # Validate the response structure
+            try:
+                gemini_response.validate()
+            except ValueError as e:
+                logger.warning(f"Response validation warning: {e}")
+                # Continue anyway - validation warnings are not fatal
+            # For manipulation intents, validate the command
+            if gemini_response.type == ResponseType.MANIPULATION:
+                is_valid = self.validate_command(gemini_response.command)
+                if not is_valid:
+                    logger.warning(
+                        f"Invalid manipulation command: '{gemini_response.command}'. "
+                        f"Not in trained task set."
+                    )
+                    validation_error = (
+                        f"Command '{gemini_response.command}' is not in the trained task set. "
+                        f"Valid commands are: {', '.join(self.valid_commands)}"
+                    )
+                    return Intent.from_gemini_response(
+                        gemini_response,
+                        is_valid=False,
+                        validation_error=validation_error
+                    )
+                else:
+                    logger.info(f"Valid manipulation command: '{gemini_response.command}'")
+            # For conversation intents, always valid (gestures are predefined)
+            else:
+                logger.info(f"Conversation intent with gesture: {gemini_response.gesture.value}")
+            # Create and return valid intent
+            return Intent.from_gemini_response(gemini_response, is_valid=True)
+        except (ValueError, KeyError) as e:
+            logger.error(f"Failed to parse Gemini response: {e}")
+            raise ValueError(f"Invalid Gemini response structure: {e}")
+    def parse_gemini_response_string(self, response_string: str) -> Intent:
+        """
+        Parse a Gemini API response from a JSON string.
+        Args:
+            response_string: JSON string containing the Gemini response
+        Returns:
+            Intent object with parsed data and validation status
+        Raises:
+            json.JSONDecodeError: If the string is not valid JSON
+            ValueError: If the response structure is invalid
+        """
+        try:
+            response_data = json.loads(response_string)
+        except json.JSONDecodeError as e:
+            logger.error(f"Failed to parse JSON string: {e}")
+            raise
+        return self.parse_gemini_response(response_data)
+    def validate_command(self, command: str) -> bool:
+        """
+        Validate that a manipulation command is in the trained task set.
+        This performs exact string matching against the list of valid commands.
+        Commands must match exactly (case-sensitive) to be considered valid.
+        Args:
+            command: The manipulation command string to validate
+        Returns:
+            True if the command is valid, False otherwise
+        """
+        if not command or not isinstance(command, str):
+            logger.warning(f"Invalid command type: {type(command)}")
+            return False
+        # Exact match required
+        is_valid = command in self.valid_commands
+        if not is_valid:
+            # Log for debugging - maybe it's close to a valid command
+            logger.debug(f"Command '{command}' not found in valid commands")
+            logger.debug(f"Valid commands: {self.valid_commands}")
+        return is_valid
+    def get_valid_commands(self) -> List[str]:
+        """
+        Get the list of valid manipulation commands.
+        Returns:
+            List of valid command strings
+        """
+        return self.valid_commands.copy()
+    def add_valid_command(self, command: str) -> None:
+        """
+        Add a new valid manipulation command to the router.
+        This is useful when training new tasks and expanding the command set.
+        Args:
+            command: The new command string to add
+        """
+        if command not in self.valid_commands:
+            self.valid_commands.append(command)
+            logger.info(f"Added new valid command: '{command}'")
+        else:
+            logger.warning(f"Command already exists: '{command}'")
+    def remove_valid_command(self, command: str) -> bool:
+        """
+        Remove a valid manipulation command from the router.
+        Args:
+            command: The command string to remove
+        Returns:
+            True if the command was removed, False if it wasn't found
+        """
+        if command in self.valid_commands:
+            self.valid_commands.remove(command)
+            logger.info(f"Removed valid command: '{command}'")
+            return True
+        else:
+            logger.warning(f"Command not found: '{command}'")
+            return False
+    def route_intent(self, intent: Intent) -> str:
+        """
+        Determine the execution path for an intent.
+        Args:
+            intent: The Intent object to route
+        Returns:
+            String indicating the execution path: "gesture", "manipulation", or "invalid"
+        """
+        if not intent.is_valid:
+            logger.warning(f"Invalid intent: {intent.validation_error}")
+            return "invalid"
+        if intent.type == ResponseType.CONVERSATION:
+            logger.info(f"Routing to gesture execution: {intent.gesture}")
+            return "gesture"
+        elif intent.type == ResponseType.MANIPULATION:
+            logger.info(f"Routing to manipulation execution: {intent.command}")
+            return "manipulation"
+        else:
+            logger.error(f"Unknown intent type: {intent.type}")
+            return "invalid"

src/mortis/lerobot_async_client.py ADDED Viewed

	@@ -0,0 +1,668 @@

+"""
+LeRobot async inference client wrapper for Mortis manipulation tasks.
+This module provides a high-level interface to LeRobot's async inference system
+(PolicyServer + RobotClient) for executing SmolVLA manipulation tasks while
+keeping the Gradio UI responsive.
+Architecture:
+- PolicyServer: Runs in a separate thread, loads SmolVLA model, performs inference
+- RobotClient: Controls the SO101 robot, captures observations, executes actions
+- This wrapper: Manages lifecycle and provides simple API for Mortis
+"""
+import logging
+import threading
+import time
+from dataclasses import dataclass
+from enum import Enum
+from pathlib import Path
+from typing import Optional, Dict, Any, Callable
+from lerobot.robots.so101_follower import SO101FollowerConfig
+from lerobot.cameras.opencv.configuration_opencv import OpenCVCameraConfig
+from lerobot.cameras.realsense import RealSenseCameraConfig
+from lerobot.async_inference.configs import PolicyServerConfig, RobotClientConfig
+from lerobot.async_inference.policy_server import serve
+from lerobot.async_inference.robot_client import RobotClient
+logger = logging.getLogger(__name__)
+class ManipulationStatus(Enum):
+    """Status of a manipulation task execution."""
+    IDLE = "idle"
+    STARTING = "starting"
+    RUNNING = "running"
+    COMPLETE = "complete"
+    FAILED = "failed"
+    STOPPED = "stopped"
+@dataclass
+class ManipulationTask:
+    """
+    Represents a manipulation task for LeRobot async execution.
+    Attributes:
+        task: Natural language task description
+        max_steps: Maximum number of action steps to execute
+        started_at: Timestamp when task started
+        completed_at: Timestamp when task completed
+        status: Current task status
+        error: Error message if task failed
+    """
+    task: str
+    max_steps: int = 1000  # At 30fps, ~33 seconds of execution
+    started_at: Optional[float] = None
+    completed_at: Optional[float] = None
+    status: ManipulationStatus = ManipulationStatus.IDLE
+    error: Optional[str] = None
+    @property
+    def duration(self) -> Optional[float]:
+        """Get task execution duration in seconds."""
+        if self.started_at and self.completed_at:
+            return self.completed_at - self.started_at
+        return None
+class LeRobotAsyncClient:
+    """
+    High-level wrapper for LeRobot async inference system.
+    This class manages the PolicyServer and RobotClient lifecycle, providing
+    a simple interface for executing manipulation tasks asynchronously.
+    Usage:
+        # Create client
+        client = LeRobotAsyncClient(
+            robot_port="/dev/ttyACM1",
+            model_path="jlamperez/kiroween-potion-smolvla",
+            camera_configs={...}
+        )
+        # Start the system
+        client.start()
+        # Execute a task
+        client.execute_task("Pick up the skull and place it in the green cup")
+        # Check status
+        status = client.get_status()
+        # Stop when done
+        client.stop()
+    """
+    def __init__(
+        self,
+        robot_port: str = "/dev/ttyACM1",
+        robot_id: str = "my_follower_robot_arm",  # Must match calibration file name
+        model_path: str = "jlamperez/kiroween-potion-smolvla",
+        policy_device: str = "cuda",
+        camera_configs: Optional[Dict[str, Any]] = None,
+        server_host: str = "127.0.0.1",
+        server_port: int = 8080,
+        actions_per_chunk: int = 50,
+        chunk_size_threshold: float = 0.5,
+        aggregate_fn_name: str = "weighted_average",
+    ):
+        """
+        Initialize the LeRobot async client.
+        Args:
+            robot_port: Serial port for SO101 robot (e.g., "/dev/ttyACM1")
+            robot_id: Identifier for the robot
+            model_path: HuggingFace model path or local checkpoint
+            policy_device: Device for model inference ("cuda" or "cpu")
+            camera_configs: Dictionary of camera configurations
+            server_host: PolicyServer host address
+            server_port: PolicyServer port
+            actions_per_chunk: Number of actions per inference chunk
+            chunk_size_threshold: Threshold for action chunk aggregation
+            aggregate_fn_name: Function name for aggregating action chunks
+        """
+        self.robot_port = robot_port
+        self.robot_id = robot_id
+        self.model_path = model_path
+        self.policy_device = policy_device
+        self.server_host = server_host
+        self.server_port = server_port
+        self.actions_per_chunk = actions_per_chunk
+        self.chunk_size_threshold = chunk_size_threshold
+        self.aggregate_fn_name = aggregate_fn_name
+        # Use default camera configs if not provided
+        self.camera_configs = camera_configs or self._get_default_camera_configs()
+        # Server and client instances
+        self.server_thread: Optional[threading.Thread] = None
+        self.robot_client: Optional[RobotClient] = None
+        self.action_receiver_thread: Optional[threading.Thread] = None
+        self.control_thread: Optional[threading.Thread] = None
+        # Current task tracking
+        self.current_task: Optional[ManipulationTask] = None
+        self._running = False
+        self._stop_event = threading.Event()
+        self._task_stop_event = threading.Event()  # Event to signal task cancellation
+        self._idle_callback: Optional[Callable] = None  # Callback to move robot to idle
+        logger.info(f"LeRobotAsyncClient initialized with model: {model_path}")
+    def _get_default_camera_configs(self) -> Dict[str, Any]:
+        """
+        Get default camera configuration for Mortis setup.
+        IMPORTANT: This configuration MUST match the cameras used during training!
+        If you trained with IntelRealSense + OpenCV, use the same setup here.
+        Returns:
+            Dictionary of camera configurations
+        """
+        # Default camera configuration matching training setup
+        # This should match your training configuration exactly!
+        # Configuration with RealSense + OpenCV (matches training setup)
+        return {
+            "camera1": RealSenseCameraConfig(
+                serial_number_or_name="030522070314",
+                width=640,
+                height=480,
+                fps=30
+            ),
+            "camera2": OpenCVCameraConfig(
+                index_or_path=8,
+                width=640,
+                height=480,
+                fps=30
+            )
+        }
+    def start(self) -> bool:
+        """
+        Start the PolicyServer only.
+        The RobotClient will be created lazily when the first task is executed.
+        This avoids loading the model unnecessarily at startup.
+        Returns:
+            True if startup successful, False otherwise
+        """
+        if self._running:
+            logger.warning("LeRobotAsyncClient is already running")
+            return True
+        try:
+            logger.info("Starting PolicyServer...")
+            # Configure and start PolicyServer
+            server_config = PolicyServerConfig(
+                host=self.server_host,
+                port=self.server_port
+            )
+            self.server_thread = threading.Thread(
+                target=serve,
+                args=(server_config,),
+                daemon=True,
+                name="PolicyServer"
+            )
+            self.server_thread.start()
+            # Give server time to start
+            time.sleep(2.0)
+            logger.info(f"PolicyServer started on {self.server_host}:{self.server_port}")
+            self._running = True
+            self._stop_event.clear()
+            logger.info("LeRobotAsyncClient started (RobotClient will be created on first task)")
+            return True
+        except Exception as e:
+            logger.error(f"Failed to start LeRobotAsyncClient: {e}", exc_info=True)
+            self.stop()
+            return False
+    def stop(self) -> None:
+        """
+        Stop the PolicyServer and RobotClient.
+        This method gracefully shuts down all components.
+        """
+        if not self._running:
+            logger.warning("LeRobotAsyncClient is not running")
+            return
+        logger.info("Stopping LeRobotAsyncClient...")
+        self._running = False
+        self._stop_event.set()
+        # Stop control thread if running
+        if self.control_thread and self.control_thread.is_alive():
+            logger.info("Waiting for control thread to finish...")
+            self.control_thread.join(timeout=5.0)
+        # Stop robot client
+        if self.robot_client:
+            try:
+                self.robot_client.stop()
+                logger.info("RobotClient stopped")
+            except Exception as e:
+                logger.error(f"Error stopping RobotClient: {e}")
+        # Action receiver thread should stop automatically (daemon)
+        # Server thread should stop automatically (daemon)
+        self.robot_client = None
+        self.server_thread = None
+        self.action_receiver_thread = None
+        self.control_thread = None
+        logger.info("LeRobotAsyncClient stopped")
+    def execute_task(
+        self,
+        task: str,
+        max_steps: int = 1000,
+        blocking: bool = False,
+        timeout: float = 60.0
+    ) -> bool:
+        """
+        Execute a manipulation task asynchronously.
+        This method stops any running task and creates a fresh RobotClient
+        for the new task, ensuring clean state.
+        Args:
+            task: Natural language task description
+            max_steps: Maximum number of action steps
+            blocking: If True, wait for task to complete before returning
+            timeout: Maximum execution time in seconds (default: 60.0)
+        Returns:
+            True if task started successfully, False otherwise
+        """
+        if not self._running:
+            logger.error("Cannot execute task: client not running")
+            return False
+        # Always need a fresh client for each task because control_loop can only run once
+        # But we keep the PolicyServer alive so the model stays loaded
+        need_new_client = True
+        if self.robot_client is None:
+            # First task - need to create client
+            logger.info("First task - creating RobotClient...")
+        elif self.current_task and self.current_task.status == ManipulationStatus.RUNNING:
+            # Task is running - stop it first
+            logger.info(f"Stopping previous task: {self.current_task.task}")
+            self._stop_robot_client()
+        else:
+            # Previous task finished - recreate client for new task
+            logger.info("Recreating RobotClient for new task (PolicyServer keeps model loaded)")
+        # Wait for previous control thread to finish
+        if self.control_thread and self.control_thread.is_alive():
+            logger.info("Waiting for previous control thread to finish...")
+            self.control_thread.join(timeout=3.0)
+            if self.control_thread.is_alive():
+                logger.warning("Previous control thread still running, proceeding anyway")
+        # Create new task
+        self.current_task = ManipulationTask(
+            task=task,
+            max_steps=max_steps,
+            status=ManipulationStatus.STARTING
+        )
+        # Clear any previous stop signal
+        self._task_stop_event.clear()
+        logger.info(f"Executing task: {task}")
+        logger.info(f"Limits: max_steps={max_steps}, timeout={timeout}s")
+        # Create/recreate robot client only if needed
+        if need_new_client:
+            if not self._recreate_robot_client(task):
+                logger.error("Failed to create robot client")
+                self.current_task.status = ManipulationStatus.FAILED
+                self.current_task.error = "Failed to initialize robot client"
+                return False
+        # Start control loop in separate thread
+        self.control_thread = threading.Thread(
+            target=self._run_control_loop,
+            args=(task, max_steps, timeout),
+            daemon=True,
+            name="ControlLoop"
+        )
+        self.control_thread.start()
+        if blocking:
+            self.control_thread.join()
+        return True
+    def _stop_robot_client(self) -> None:
+        """
+        Stop the robot client cleanly.
+        This stops the robot client and waits for threads to finish.
+        """
+        if self.robot_client:
+            try:
+                logger.info("Stopping robot client...")
+                self.robot_client.stop()
+                # Wait for action receiver thread
+                if self.action_receiver_thread and self.action_receiver_thread.is_alive():
+                    self.action_receiver_thread.join(timeout=2.0)
+                logger.info("Robot client stopped")
+            except Exception as e:
+                logger.error(f"Error stopping robot client: {e}")
+    def _recreate_robot_client(self, task: str) -> bool:
+        """
+        Recreate the robot client with a new task.
+        This creates a fresh RobotClient instance for the new task,
+        ensuring clean state.
+        Args:
+            task: Task description for the new client
+        Returns:
+            True if successful, False otherwise
+        """
+        try:
+            # Stop existing client if any
+            self._stop_robot_client()
+            # Small delay to ensure port is released
+            time.sleep(0.5)
+            # Reconfigure robot
+            from pathlib import Path
+            from lerobot.robots.so101_follower import SO101FollowerConfig
+            from lerobot.async_inference.configs import RobotClientConfig
+            from lerobot.async_inference.robot_client import RobotClient
+            calibration_dir = Path(".cache/calibration/so101")
+            robot_config = SO101FollowerConfig(
+                port=self.robot_port,
+                id=self.robot_id,
+                cameras=self.camera_configs,
+                calibration_dir=calibration_dir
+            )
+            client_config = RobotClientConfig(
+                robot=robot_config,
+                server_address=f"{self.server_host}:{self.server_port}",
+                policy_device=self.policy_device,
+                policy_type="smolvla",
+                pretrained_name_or_path=self.model_path,
+                chunk_size_threshold=self.chunk_size_threshold,
+                actions_per_chunk=self.actions_per_chunk,
+                aggregate_fn_name=self.aggregate_fn_name,
+                task=task  # Set the task in the config
+            )
+            # Create new robot client
+            self.robot_client = RobotClient(client_config)
+            if not self.robot_client.start():
+                raise RuntimeError("Failed to start RobotClient")
+            # Start action receiver thread
+            self.action_receiver_thread = threading.Thread(
+                target=self.robot_client.receive_actions,
+                daemon=True,
+                name="ActionReceiver"
+            )
+            self.action_receiver_thread.start()
+            logger.info("Robot client recreated successfully")
+            return True
+        except Exception as e:
+            logger.error(f"Failed to recreate robot client: {e}", exc_info=True)
+            return False
+    def stop_current_task(self) -> bool:
+        """
+        Stop the currently running task by stopping the robot client.
+        This cleanly stops the robot client, which will cause the control
+        loop to exit. The client will be recreated for the next task.
+        Returns:
+            True if task was stopped successfully
+        """
+        if not self.current_task or self.current_task.status != ManipulationStatus.RUNNING:
+            logger.warning("No task currently running to stop")
+            return False
+        logger.info("Stopping current task...")
+        try:
+            # Mark task as stopped
+            self.current_task.status = ManipulationStatus.STOPPED
+            self.current_task.completed_at = time.time()
+            self.current_task.error = "Task stopped by user"
+            # Signal task stop
+            self._task_stop_event.set()
+            # Stop the robot client (this will interrupt the control loop)
+            try:
+                self._stop_robot_client()
+            except Exception as e:
+                logger.warning(f"Error stopping client (expected): {e}")
+            # Move robot to idle position
+            if self._idle_callback:
+                logger.info("Moving robot to idle position...")
+                try:
+                    self._idle_callback()
+                    logger.info("Robot moved to idle position")
+                except Exception as e:
+                    logger.error(f"Failed to move to idle: {e}")
+            logger.info("Task stopped successfully")
+            # Clear the task after a delay
+            def clear_task():
+                time.sleep(3.0)
+                if self.current_task and self.current_task.status == ManipulationStatus.STOPPED:
+                    self.current_task = None
+                    logger.info("Cleared stopped task from status")
+            clear_thread = threading.Thread(target=clear_task, daemon=True)
+            clear_thread.start()
+            return True
+        except Exception as e:
+            logger.error(f"Failed to stop task: {e}", exc_info=True)
+            return False
+    def _run_control_loop(self, task: str, max_steps: int, timeout: float) -> None:
+        """
+        Run the control loop for task execution with timeout.
+        This runs in a separate thread and executes the task using
+        the RobotClient's control_loop method. The timeout will stop
+        the task, and recreating the client for each task ensures clean state.
+        Note: max_steps is not directly enforced by LeRobot's control_loop,
+        but the timeout provides a time-based limit.
+        Args:
+            task: Task description
+            max_steps: Maximum steps (informational, not enforced)
+            timeout: Maximum execution time in seconds (default: 60.0)
+        """
+        if not self.current_task:
+            return
+        try:
+            self.current_task.status = ManipulationStatus.RUNNING
+            self.current_task.started_at = time.time()
+            logger.info(f"Starting control loop for: {task}")
+            logger.info(f"Timeout: {timeout}s (max_steps={max_steps} is informational)")
+            # Clear task stop event
+            self._task_stop_event.clear()
+            # Run control_loop in a separate thread so we can timeout
+            control_thread = threading.Thread(
+                target=lambda: self.robot_client.control_loop(task=task, verbose=False),
+                daemon=True,
+                name="ControlLoopInner"
+            )
+            control_thread.start()
+            # Wait for completion or timeout
+            control_thread.join(timeout=timeout)
+            # Check if thread is still alive (timeout occurred)
+            if control_thread.is_alive():
+                logger.warning(f"Task timed out after {timeout}s")
+                # Mark task as stopped first
+                self.current_task.status = ManipulationStatus.STOPPED
+                self.current_task.completed_at = time.time()
+                self.current_task.error = f"Task exceeded timeout of {timeout}s"
+                # Signal stop event
+                self._task_stop_event.set()
+                # Stop the robot client to interrupt the control loop
+                # This will cause the control thread to error out, but we catch it
+                logger.info("Stopping robot client to interrupt control loop...")
+                try:
+                    self._stop_robot_client()
+                except Exception as e:
+                    logger.warning(f"Error stopping client (expected): {e}")
+                # Wait a bit for thread to die
+                control_thread.join(timeout=2.0)
+                logger.info("Task stopped due to timeout")
+                # Move robot to idle position using callback if provided
+                if hasattr(self, '_idle_callback') and self._idle_callback:
+                    logger.info("Moving robot to idle position...")
+                    try:
+                        self._idle_callback()
+                        logger.info("Robot moved to idle position")
+                    except Exception as e:
+                        logger.error(f"Failed to move to idle: {e}")
+                # Clear the task after a delay so UI can show the stopped status
+                def clear_task():
+                    time.sleep(3.0)  # Show stopped status for 3 seconds
+                    if self.current_task and self.current_task.status == ManipulationStatus.STOPPED:
+                        self.current_task = None
+                        logger.info("Cleared stopped task from status")
+                clear_thread = threading.Thread(target=clear_task, daemon=True)
+                clear_thread.start()
+            else:
+                # Task completed successfully
+                self.current_task.status = ManipulationStatus.COMPLETE
+                self.current_task.completed_at = time.time()
+                logger.info(f"Task completed in {self.current_task.duration:.2f}s")
+                # Clear completed task after showing status
+                def clear_task():
+                    time.sleep(3.0)  # Show completed status for 3 seconds
+                    if self.current_task and self.current_task.status == ManipulationStatus.COMPLETE:
+                        self.current_task = None
+                        logger.info("Cleared completed task from status")
+                clear_thread = threading.Thread(target=clear_task, daemon=True)
+                clear_thread.start()
+        except KeyboardInterrupt:
+            logger.info("Task interrupted by user")
+            self.current_task.status = ManipulationStatus.STOPPED
+            self.current_task.completed_at = time.time()
+        except Exception as e:
+            logger.error(f"Task failed: {e}", exc_info=True)
+            self.current_task.status = ManipulationStatus.FAILED
+            self.current_task.error = str(e)
+            self.current_task.completed_at = time.time()
+    def get_status(self) -> ManipulationStatus:
+        """
+        Get the current task status.
+        Returns:
+            Current ManipulationStatus
+        """
+        if self.current_task:
+            return self.current_task.status
+        return ManipulationStatus.IDLE
+    def get_current_task(self) -> Optional[ManipulationTask]:
+        """
+        Get the currently executing task.
+        Returns:
+            Current ManipulationTask or None if idle
+        """
+        return self.current_task
+    def is_busy(self) -> bool:
+        """
+        Check if a task is currently executing.
+        Returns:
+            True if a task is running
+        """
+        return (
+            self.current_task is not None and
+            self.current_task.status == ManipulationStatus.RUNNING
+        )
+    def is_running(self) -> bool:
+        """
+        Check if the client is running (server and robot connected).
+        Returns:
+            True if client is running
+        """
+        return self._running
+    def set_idle_callback(self, callback: Callable) -> None:
+        """
+        Set a callback function to move the robot to idle position.
+        This callback will be called when a task times out, to safely
+        return the robot to a neutral position.
+        Args:
+            callback: Function to call (e.g., lambda: mortis_arm.move_arm("idle"))
+        """
+        self._idle_callback = callback
+        logger.info("Idle callback configured")
+    def __enter__(self):
+        """Context manager entry: start the client."""
+        self.start()
+        return self
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        """Context manager exit: stop the client."""
+        self.stop()
+        return False

src/mortis/models.py ADDED Viewed

	@@ -0,0 +1,215 @@

+"""
+Data models for Gemini API responses and intent routing.
+This module defines the structured data types used throughout the Mortis system
+for parsing Gemini responses, routing intents, and managing execution tasks.
+"""
+import json
+from dataclasses import dataclass
+from enum import Enum
+from typing import Optional, Dict, Any
+class ResponseType(Enum):
+    """Type of response from Gemini API."""
+    CONVERSATION = "conversation"
+    MANIPULATION = "manipulation"
+class Mood(Enum):
+    """Emotional mood for Mortis character responses."""
+    OMINOUS = "ominous"
+    PLAYFUL = "playful"
+    ANGRY = "angry"
+    NERVOUS = "nervous"
+    TRIUMPHANT = "triumphant"
+    MISCHIEVOUS = "mischievous"
+    SINISTER = "sinister"
+    CURIOUS = "curious"
+    NEUTRAL = "neutral"
+class Gesture(Enum):
+    """Available gesture actions for the SO101 robotic arm."""
+    IDLE = "idle"
+    WAVE = "wave"
+    POINT_LEFT = "point_left"
+    POINT_RIGHT = "point_right"
+    GRAB = "grab"
+    DROP = "drop"
+@dataclass
+class GeminiResponse:
+    """
+    Structured response from Gemini API.
+    Attributes:
+        type: Whether this is a conversation or manipulation response
+        message: The text message to display/speak to the user
+        mood: The emotional mood of the response
+        gesture: Optional gesture to execute (for conversation type)
+        command: Optional manipulation command (for manipulation type)
+    """
+    type: ResponseType
+    message: str
+    mood: Mood
+    gesture: Optional[Gesture] = None
+    command: Optional[str] = None
+    @classmethod
+    def from_json(cls, json_data: Dict[str, Any]) -> "GeminiResponse":
+        """
+        Parse a GeminiResponse from JSON data returned by Gemini API.
+        Args:
+            json_data: Dictionary containing the JSON response from Gemini
+        Returns:
+            GeminiResponse object with validated fields
+        Raises:
+            ValueError: If required fields are missing or invalid
+            KeyError: If JSON structure is malformed
+        """
+        # Validate required fields
+        if "type" not in json_data:
+            raise ValueError("Missing required field: 'type'")
+        if "message" not in json_data:
+            raise ValueError("Missing required field: 'message'")
+        if "mood" not in json_data:
+            raise ValueError("Missing required field: 'mood'")
+        # Parse response type
+        try:
+            response_type = ResponseType(json_data["type"])
+        except ValueError:
+            raise ValueError(f"Invalid response type: {json_data['type']}. Must be 'conversation' or 'manipulation'")
+        # Parse mood
+        try:
+            mood = Mood(json_data["mood"])
+        except ValueError:
+            raise ValueError(f"Invalid mood: {json_data['mood']}. Must be one of: {[m.value for m in Mood]}")
+        # Parse optional fields based on response type
+        gesture = None
+        command = None
+        if response_type == ResponseType.CONVERSATION:
+            # Conversation responses should have a gesture
+            if "gesture" in json_data:
+                try:
+                    gesture = Gesture(json_data["gesture"])
+                except ValueError:
+                    raise ValueError(f"Invalid gesture: {json_data['gesture']}. Must be one of: {[g.value for g in Gesture]}")
+            else:
+                # Default to idle if no gesture specified
+                gesture = Gesture.IDLE
+        elif response_type == ResponseType.MANIPULATION:
+            # Manipulation responses must have a command
+            if "command" not in json_data:
+                raise ValueError("Manipulation responses must include 'command' field")
+            command = json_data["command"]
+            if not isinstance(command, str) or not command.strip():
+                raise ValueError("Command must be a non-empty string")
+        # Validate message
+        message = json_data["message"]
+        if not isinstance(message, str) or not message.strip():
+            raise ValueError("Message must be a non-empty string")
+        return cls(
+            type=response_type,
+            message=message,
+            mood=mood,
+            gesture=gesture,
+            command=command
+        )
+    @classmethod
+    def from_json_string(cls, json_string: str) -> "GeminiResponse":
+        """
+        Parse a GeminiResponse from a JSON string.
+        Args:
+            json_string: JSON string containing the Gemini response
+        Returns:
+            GeminiResponse object with validated fields
+        Raises:
+            json.JSONDecodeError: If the string is not valid JSON
+            ValueError: If required fields are missing or invalid
+        """
+        try:
+            json_data = json.loads(json_string)
+        except json.JSONDecodeError as e:
+            raise json.JSONDecodeError(f"Invalid JSON string: {e.msg}", e.doc, e.pos)
+        return cls.from_json(json_data)
+    def validate(self) -> bool:
+        """
+        Validate the response structure and content.
+        Returns:
+            True if the response is valid
+        Raises:
+            ValueError: If validation fails
+        """
+        # Check message length constraints (per product requirements)
+        if len(self.message) > 120:
+            raise ValueError(f"Message exceeds 120 characters: {len(self.message)} chars")
+        word_count = len(self.message.split())
+        if word_count > 30:
+            raise ValueError(f"Message exceeds 30 words: {word_count} words")
+        # Validate type-specific requirements
+        if self.type == ResponseType.CONVERSATION:
+            if self.gesture is None:
+                raise ValueError("Conversation responses must have a gesture")
+            if self.command is not None:
+                raise ValueError("Conversation responses should not have a command")
+        elif self.type == ResponseType.MANIPULATION:
+            if self.command is None or not self.command.strip():
+                raise ValueError("Manipulation responses must have a non-empty command")
+            if self.gesture is not None:
+                raise ValueError("Manipulation responses should not have a gesture")
+        return True
+    def to_dict(self) -> Dict[str, Any]:
+        """
+        Convert the response to a dictionary.
+        Returns:
+            Dictionary representation of the response
+        """
+        result = {
+            "type": self.type.value,
+            "message": self.message,
+            "mood": self.mood.value,
+        }
+        if self.gesture is not None:
+            result["gesture"] = self.gesture.value
+        if self.command is not None:
+            result["command"] = self.command
+        return result
+    def to_json(self) -> str:
+        """
+        Convert the response to a JSON string.
+        Returns:
+            JSON string representation of the response
+        """
+        return json.dumps(self.to_dict(), indent=2)

src/mortis/robot.py ADDED Viewed

	@@ -0,0 +1,180 @@

+# ({"shoulder_pan.pos": -45, "shoulder_lift.pos": -99, "elbow_flex.pos": 0, "wrist_flex.pos": 60, "wrist_roll.pos": 0, "gripper.pos": 60}, 0.5),
+import logging
+import os
+import time
+from pathlib import Path
+from lerobot.robots.so101_follower import SO101Follower, SO101FollowerConfig
+logger = logging.getLogger(__name__)
+HOME_POSE = {
+    "shoulder_pan.pos": 0,
+    "shoulder_lift.pos": -99,
+    "elbow_flex.pos": 97,
+    "wrist_flex.pos": 55,
+    "wrist_roll.pos": 0,
+    "gripper.pos": 0,
+}
+GESTURES = {
+    "idle": [
+        (HOME_POSE, 1.0),
+    ],
+    "wave": [
+        ({"wrist_flex.pos": -40}, 0.5),
+        ({"shoulder_pan.pos": -5, "shoulder_lift.pos": 65, "elbow_flex.pos": -70}, 1),
+        ({"shoulder_lift.pos": 0, "elbow_flex.pos": 0}, 0.5),
+        ({"wrist_flex.pos": 0}, 0.5),
+        (HOME_POSE, 1.0),
+    ],
+    "point_left": [
+        ({"shoulder_pan.pos": -60, "shoulder_lift.pos": -30, "elbow_flex.pos": -15, "wrist_flex.pos": 42, "wrist_roll.pos": 0, "gripper.pos": 0}, 1),
+        ({"wrist_flex.pos": 80}, 0.5),
+        ({"wrist_flex.pos": 42}, 0.5),
+        ({"wrist_flex.pos": 80}, 0.5),
+        (HOME_POSE, 1.0),
+    ],
+    "point_right": [
+        ({"shoulder_pan.pos": 65, "shoulder_lift.pos": -50, "elbow_flex.pos": -5, "wrist_flex.pos": 55, "wrist_roll.pos": 0, "gripper.pos": 0}, 1),
+        ({"wrist_flex.pos": 90}, 0.5),
+        ({"wrist_flex.pos": 42}, 0.5),
+        ({"wrist_flex.pos": 90}, 0.5),
+        (HOME_POSE, 1.0),
+    ],
+    "grab": [
+        ({'shoulder_pan.pos': 0, 'shoulder_lift.pos': -2, 'elbow_flex.pos': -8., 'wrist_flex.pos': 55, 'wrist_roll.pos': 0, 'gripper.pos': 0}, 0.8),
+        ({"wrist_flex.pos": 80}, 0.5),
+        ({"wrist_roll.pos": -45, "gripper.pos": 40}, 1),
+        ({"elbow_flex.pos": 30}, 1),
+        ({"wrist_roll.pos": 45, "gripper.pos": 10}, 1),
+        ({"elbow_flex.pos": -20}, 1),
+        (HOME_POSE, 1.0),
+    ],
+    "drop": [
+        ({'shoulder_pan.pos': 0, 'shoulder_lift.pos': 5, 'elbow_flex.pos': 20., 'wrist_flex.pos': 55, 'wrist_roll.pos': 0, 'gripper.pos': 0}, 0.8),
+        ({"gripper.pos": 80}, 1),
+        ({"gripper.pos": 00}, 1),
+        (HOME_POSE, 1.0),
+    ],
+}
+class MortisArm:
+    """
+    Class to control the Mortis SO101 robotic arm.
+    Manages connection, disconnection, and gesture execution.
+    Supports two modes:
+    - physical: Connects to real robot hardware
+    - simulation: Simulates robot behavior without hardware
+    """
+    def __init__(self, port="/dev/ttyACM1", mode=None):
+        port = os.getenv("ROBOT_PORT", port)
+        # Determine mode: check env var or use provided mode
+        if mode is None:
+            mode = os.getenv("ROBOT_MODE", "physical").lower()
+        self.mode = mode
+        self.connected = False
+        if self.mode == "simulation":
+            logger.info("🎭 MortisArm initialized in SIMULATION mode (no physical robot)")
+            self.robot = None
+            self.connected = True  # Always "connected" in simulation
+        else:
+            config = SO101FollowerConfig(
+                port=port,
+                id="my_follower_robot_arm",
+                calibration_dir=Path(".cache/calibration/so101/"),
+            )
+            self.robot = SO101Follower(config)
+            logger.info(f"🤖 MortisArm initialized in PHYSICAL mode on port {port}")
+    def connect(self):
+        """Connects to the robotic arm."""
+        if self.mode == "simulation":
+            logger.info("🎭 Simulation mode: skipping physical connection")
+            self.connected = True
+            return
+        if not self.connected:
+            try:
+                logger.info("Attempting to connect to robot arm...")
+                self.robot.connect()
+                self.connected = self.robot.is_connected
+                if self.connected:
+                    logger.info("✅ Robot arm connected successfully")
+                    # Move to the initial position to indicate it's ready
+                    self.move_arm("idle")
+                else:
+                    logger.warning("⚠️ Failed to establish connection to robot arm")
+            except Exception as e:
+                logger.error(f"❌ Connection error: {e}", exc_info=True)
+                self.connected = False
+    def disconnect(self):
+        """Disconnects the robotic arm."""
+        if self.mode == "simulation":
+            logger.info("🎭 Simulation mode: skipping physical disconnection")
+            self.connected = False
+            return
+        if self.connected:
+            logger.info("Disconnecting robot arm...")
+            # Move to rest position before disconnecting
+            self.move_arm("idle")
+            time.sleep(1)
+            self.robot.disconnect()
+            self.connected = False
+            logger.info("✅ Robot arm disconnected")
+    def move_arm(self, gesture_name: str):
+        """
+        Executes a sequence of movements (a gesture) by its name.
+        If the gesture does not exist, it executes 'idle'.
+        """
+        if not self.connected:
+            logger.warning("⚠️ Cannot execute gesture: robot arm not connected")
+            return
+        # If the gesture is not defined, return to the neutral position.
+        if gesture_name not in GESTURES:
+            logger.warning(f"⚠️ Unknown gesture '{gesture_name}', falling back to 'idle'")
+            gesture_name = "idle"
+        sequence = GESTURES[gesture_name]
+        if self.mode == "simulation":
+            # Simulation mode: just log the gesture
+            logger.info(f"🎭 [SIMULATION] Executing gesture '{gesture_name}' ({len(sequence)} steps)")
+            # Simulate timing by sleeping for total duration
+            total_delay = sum(delay for _, delay in sequence)
+            time.sleep(total_delay)
+            logger.info(f"🎭 [SIMULATION] Gesture '{gesture_name}' completed")
+        else:
+            # Physical mode: execute on real robot
+            logger.info(f"🤖 Executing gesture '{gesture_name}' ({len(sequence)} steps)")
+            for i, (action, delay) in enumerate(sequence, 1):
+                logger.debug(f"Gesture '{gesture_name}' step {i}/{len(sequence)}: {action}")
+                self.robot.send_action(action)
+                time.sleep(delay)
+            logger.info(f"✅ Gesture '{gesture_name}' completed")
+if __name__ == "__main__":
+    mortis_arm = MortisArm()
+    if not mortis_arm.connected:
+        mortis_arm.connect()
+    mortis_arm.move_arm("drop")
+    mortis_arm.disconnect()

src/mortis/setup_dataset.py ADDED Viewed

	@@ -0,0 +1,146 @@

+#!/usr/bin/env python3
+"""
+CLI tool for setting up Mortis dataset infrastructure.
+This script initializes the dataset structure and generates
+lerobot-record scripts for data collection.
+"""
+import os
+import sys
+import subprocess
+import argparse
+from pathlib import Path
+from dotenv import load_dotenv
+from mortis.data_collector import create_mortis_dataset, DataCollector
+def check_huggingface_auth():
+    """Check if user is authenticated with Hugging Face."""
+    try:
+        result = subprocess.run(
+            ["huggingface-cli", "whoami"],
+            capture_output=True,
+            text=True,
+            timeout=5
+        )
+        return result.returncode == 0
+    except (subprocess.TimeoutExpired, FileNotFoundError):
+        return False
+def main():
+    """Main entry point for dataset setup."""
+    # Parse command line arguments
+    parser = argparse.ArgumentParser(
+        description="Setup Mortis dataset infrastructure and generate recording scripts"
+    )
+    parser.add_argument(
+        "--dataset-name",
+        type=str,
+        default=None,
+        help="Name for the dataset (default: mortis_manipulation)"
+    )
+    parser.add_argument(
+        "--hf-user",
+        type=str,
+        default=None,
+        help="Hugging Face username (default: from HF_USER env var)"
+    )
+    args = parser.parse_args()
+    # Load environment variables from .env file
+    REPO_ROOT = Path(__file__).resolve().parents[2]
+    load_dotenv(REPO_ROOT / ".env")
+    print("="*70)
+    print("Mortis Dataset Setup")
+    print("="*70)
+    print()
+    # Check Hugging Face authentication
+    print("Checking Hugging Face authentication...")
+    if not check_huggingface_auth():
+        print("⚠️  Not logged in to Hugging Face")
+        print("📝 You need to authenticate before recording datasets")
+        print()
+        print("Run this command to login:")
+        print("   huggingface-cli login")
+        print()
+        print("Get your token from: https://huggingface.co/settings/tokens")
+        print()
+        response = input("Continue anyway? (y/N): ").strip().lower()
+        if response != 'y':
+            print("Setup cancelled. Please login first with: huggingface-cli login")
+            sys.exit(0)
+        print()
+    else:
+        print("✅ Hugging Face authentication verified")
+        print()
+    # Get Hugging Face username
+    hf_user = args.hf_user or os.getenv("HF_USER")
+    if not hf_user:
+        print("⚠️  HF_USER not found in .env file or environment")
+        hf_user = input("Enter your Hugging Face username: ").strip()
+        if not hf_user:
+            print("❌ Hugging Face username is required")
+            sys.exit(1)
+        print(f"💡 Tip: Add HF_USER to your .env file to skip this prompt:")
+        print(f"   echo 'HF_USER={hf_user}' >> .env")
+        print()
+    # Get dataset name
+    dataset_name = args.dataset_name
+    if not dataset_name:
+        print("Dataset name:")
+        print("  Press Enter for default: 'mortis_manipulation'")
+        print("  Or enter a custom name (e.g., 'mortis_v2', 'test_dataset')")
+        user_input = input("Dataset name: ").strip()
+        dataset_name = user_input if user_input else "mortis_manipulation"
+        print()
+    # Create repository ID
+    repo_id = f"{hf_user}/{dataset_name}"
+    print(f"Creating dataset: {dataset_name}")
+    print(f"Repository: {repo_id}")
+    print()
+    # Create collector with custom name
+    collector = DataCollector(dataset_name, repo_id)
+    # Generate scripts
+    print("\nGenerating recording scripts...")
+    collector.generate_all_record_scripts()
+    # Show summary
+    collector.print_summary()
+    # Show instructions
+    collector.print_recording_instructions()
+    # Final instructions
+    print("="*70)
+    print("Setup Complete! 🎉")
+    print("="*70)
+    print()
+    print("Next steps:")
+    print("  1. Make sure you're logged in to Hugging Face:")
+    print("     huggingface-cli login")
+    print("  2. Connect your leader and follower robot arms")
+    print("  3. Navigate to the scripts directory:")
+    print(f"     cd {collector.dataset_dir}/scripts")
+    print("  4. Run a recording script:")
+    print("     ./record_task_0.sh")
+    print()
+    print("Or record all tasks:")
+    print("     ./record_all_tasks.sh")
+    print()
+    print("="*70)
+if __name__ == "__main__":
+    main()

src/mortis/setup_train.py ADDED Viewed

	@@ -0,0 +1,385 @@

+#!/usr/bin/env python3
+"""
+CLI tool for setting up Mortis training infrastructure.
+This script generates lerobot-train scripts with appropriate
+configurations for training SmolVLA models on Mortis datasets.
+"""
+import os
+import sys
+import argparse
+from pathlib import Path
+from dotenv import load_dotenv
+class TrainingScriptGenerator:
+    """
+    Helper for generating lerobot-train scripts.
+    This class generates shell scripts that call lerobot-train with the
+    correct parameters for training SmolVLA models on Mortis datasets.
+    Attributes:
+        dataset_repo_id: Hugging Face dataset repository ID
+        output_dir: Directory for training outputs
+        job_name: Name for the training job
+        model_repo_id: Optional Hugging Face model repository ID for pushing
+    """
+    def __init__(
+        self,
+        dataset_repo_id: str,
+        output_dir: str = "outputs/train",
+        job_name: str = "smolvla_mortis",
+        model_repo_id: str = None,
+        scripts_dir: str = "train"
+    ):
+        """
+        Initialize the TrainingScriptGenerator.
+        Args:
+            dataset_repo_id: Hugging Face dataset repository ID
+            output_dir: Base directory for training outputs (checkpoints, logs)
+            job_name: Name for the training job
+            model_repo_id: Optional HF model repo ID for pushing trained model
+            scripts_dir: Directory to save training scripts
+        """
+        self.dataset_repo_id = dataset_repo_id
+        self.output_dir = Path(output_dir)
+        self.job_name = job_name
+        self.model_repo_id = model_repo_id
+        self.scripts_dir = Path(scripts_dir)
+        # Create scripts directory
+        self.scripts_dir.mkdir(parents=True, exist_ok=True)
+        print(f"TrainingScriptGenerator initialized:")
+        print(f"  Dataset: {self.dataset_repo_id}")
+        print(f"  Scripts directory: {self.scripts_dir}")
+        print(f"  Training output directory: {self.output_dir}")
+        print(f"  Job name: {self.job_name}")
+        if self.model_repo_id:
+            print(f"  Model repository: {self.model_repo_id}")
+    def generate_train_command(
+        self,
+        policy_path: str = "lerobot/smolvla_base",
+        batch_size: int = 16,
+        steps: int = 20000,
+        save_freq: int = 5000,
+        eval_freq: int = 5000,
+        n_action_steps: int = 50,
+        chunk_size: int = 50,
+        use_amp: bool = True,
+        enable_wandb: bool = True,
+        device: str = "cuda",
+        image_transforms: bool = True,
+        rename_map: str = None,
+        cuda_alloc_conf: str = "expandable_segments:True"
+    ) -> str:
+        """
+        Generate a lerobot-train command with specified parameters.
+        Args:
+            policy_path: Path to base policy (default: lerobot/smolvla_base)
+            batch_size: Training batch size
+            steps: Total training steps
+            save_freq: Checkpoint save frequency
+            eval_freq: Evaluation frequency
+            n_action_steps: Number of action steps to predict
+            chunk_size: Action chunk size
+            use_amp: Use automatic mixed precision
+            enable_wandb: Enable Weights & Biases logging
+            device: Device to use (cuda or cpu)
+            image_transforms: Enable image transformations
+            rename_map: Optional observation key rename mapping
+            cuda_alloc_conf: CUDA memory allocator configuration
+        Returns:
+            The complete lerobot-train command as a string
+        """
+        # Load environment variables
+        load_dotenv()
+        # Build output directory path
+        full_output_dir = self.output_dir / self.job_name
+        # Default rename map for SO101 with dual cameras
+        if rename_map is None:
+            rename_map = (
+                '{"observation.images.camera1": "observation.images.camera1", '
+                '"observation.images.camera2": "observation.images.camera2"}'
+            )
+        # Build the command
+        cmd_parts = [
+            f"PYTORCH_CUDA_ALLOC_CONF={cuda_alloc_conf} \\",
+            "lerobot-train \\",
+            f"  --policy.path={policy_path} \\",
+            f"  --dataset.repo_id={self.dataset_repo_id} \\",
+            f"  --dataset.image_transforms.enable={str(image_transforms).lower()} \\",
+            f"  --policy.device={device} \\",
+            f"  --policy.use_amp={str(use_amp).lower()} \\",
+            f"  --policy.n_action_steps={n_action_steps} \\",
+            f"  --policy.chunk_size={chunk_size} \\",
+            f"  --batch_size={batch_size} \\",
+            f"  --steps={steps} \\",
+            f"  --save_checkpoint=true \\",
+            f"  --save_freq={save_freq} \\",
+            f"  --eval_freq={eval_freq} \\",
+            f"  --wandb.enable={str(enable_wandb).lower()} \\",
+            f"  --output_dir={full_output_dir} \\",
+            f"  --job_name={self.job_name} \\",
+        ]
+        # Add model repo ID if specified
+        if self.model_repo_id:
+            cmd_parts.append(f"  --policy.repo_id={self.model_repo_id} \\")
+        # Add rename map
+        cmd_parts.append(f"  --rename_map='{rename_map}'")
+        return "\n".join(cmd_parts)
+    def generate_training_script(
+        self,
+        script_name: str = "train.sh",
+        **kwargs
+    ) -> Path:
+        """
+        Generate a shell script for training.
+        Args:
+            script_name: Name for the training script
+            **kwargs: Additional arguments passed to generate_train_command
+        Returns:
+            Path to the generated script
+        """
+        script_path = self.scripts_dir / script_name
+        with open(script_path, 'w') as f:
+            f.write("#!/bin/bash\n")
+            f.write(f"# Training script for {self.job_name}\n")
+            f.write(f"# Dataset: {self.dataset_repo_id}\n")
+            f.write(f"# Generated by setup_train.py\n\n")
+            f.write("# Check if CUDA is available\n")
+            f.write("if ! command -v nvidia-smi &> /dev/null; then\n")
+            f.write('    echo "⚠️  Warning: nvidia-smi not found. CUDA may not be available."\n')
+            f.write('    read -p "Continue anyway? (y/N): " -n 1 -r\n')
+            f.write('    echo\n')
+            f.write('    if [[ ! $REPLY =~ ^[Yy]$ ]]; then\n')
+            f.write('        exit 1\n')
+            f.write('    fi\n')
+            f.write("fi\n\n")
+            f.write("# Start training\n")
+            f.write(f'echo "Starting training: {self.job_name}"\n')
+            f.write(f'echo "Dataset: {self.dataset_repo_id}"\n')
+            f.write(f'echo "Output: {self.output_dir / self.job_name}"\n')
+            f.write('echo ""\n\n')
+            f.write(self.generate_train_command(**kwargs))
+            f.write("\n")
+        # Make script executable
+        script_path.chmod(0o755)
+        print(f"Created: {script_path}")
+        return script_path
+    def generate_training_configs(self):
+        """
+        Generate multiple training scripts with different configurations.
+        Creates:
+        - train_quick.sh: Quick test training (1000 steps)
+        - train_standard.sh: Standard training (20k steps)
+        - train_full.sh: Full training (100k steps)
+        """
+        configs = [
+            {
+                "script_name": "train_quick.sh",
+                "steps": 1000,
+                "save_freq": 500,
+                "eval_freq": 500,
+                "batch_size": 8,
+            },
+            {
+                "script_name": "train_standard.sh",
+                "steps": 20000,
+                "save_freq": 5000,
+                "eval_freq": 5000,
+                "batch_size": 16,
+            },
+            {
+                "script_name": "train_full.sh",
+                "steps": 100000,
+                "save_freq": 10000,
+                "eval_freq": 10000,
+                "batch_size": 16,
+            },
+        ]
+        for config in configs:
+            self.generate_training_script(**config)
+        print(f"\n✅ Generated {len(configs)} training scripts in {self.scripts_dir}")
+    def print_usage_instructions(self):
+        """Print instructions for using the generated training scripts."""
+        print("\n" + "="*70)
+        print("Training Scripts Generated")
+        print("="*70)
+        print()
+        print("Available training scripts:")
+        print(f"  {self.scripts_dir}/train_quick.sh     - Quick test (1k steps)")
+        print(f"  {self.scripts_dir}/train_standard.sh  - Standard training (20k steps)")
+        print(f"  {self.scripts_dir}/train_full.sh      - Full training (100k steps)")
+        print()
+        print("To start training:")
+        print(f"  cd {self.scripts_dir}")
+        print("  ./train_standard.sh")
+        print()
+        print("Training outputs will be saved to:")
+        print(f"  {self.output_dir}/{self.job_name}/")
+        print()
+        print("Monitor training:")
+        print("  - Console: Watch the terminal output")
+        print("  - W&B: https://wandb.ai (if enabled)")
+        print(f"  - Checkpoints: {self.output_dir}/{self.job_name}/checkpoints/")
+        print()
+        print("Resume training:")
+        print("  Add --resume=true to the lerobot-train command")
+        print()
+        print("="*70)
+def main():
+    """Main entry point for training setup."""
+    # Parse command line arguments
+    parser = argparse.ArgumentParser(
+        description="Setup Mortis training infrastructure and generate training scripts"
+    )
+    parser.add_argument(
+        "--dataset-repo-id",
+        type=str,
+        required=True,
+        help="Hugging Face dataset repository ID (e.g., username/dataset-name)"
+    )
+    parser.add_argument(
+        "--output-dir",
+        type=str,
+        default="outputs/train",
+        help="Base directory for training outputs/checkpoints (default: outputs/train)"
+    )
+    parser.add_argument(
+        "--scripts-dir",
+        type=str,
+        default="train",
+        help="Directory to save training scripts (default: train)"
+    )
+    parser.add_argument(
+        "--job-name",
+        type=str,
+        default=None,
+        help="Name for the training job (default: derived from dataset name)"
+    )
+    parser.add_argument(
+        "--model-repo-id",
+        type=str,
+        default=None,
+        help="Hugging Face model repository ID for pushing trained model"
+    )
+    parser.add_argument(
+        "--batch-size",
+        type=int,
+        default=16,
+        help="Training batch size (default: 16)"
+    )
+    parser.add_argument(
+        "--steps",
+        type=int,
+        default=20000,
+        help="Total training steps (default: 20000)"
+    )
+    parser.add_argument(
+        "--policy-path",
+        type=str,
+        default="lerobot/smolvla_base",
+        help="Path to base policy (default: lerobot/smolvla_base)"
+    )
+    parser.add_argument(
+        "--no-wandb",
+        action="store_true",
+        help="Disable Weights & Biases logging"
+    )
+    parser.add_argument(
+        "--generate-configs",
+        action="store_true",
+        help="Generate multiple training configurations (quick, standard, full)"
+    )
+    args = parser.parse_args()
+    # Load environment variables
+    REPO_ROOT = Path(__file__).resolve().parents[2]
+    load_dotenv(REPO_ROOT / ".env")
+    print("="*70)
+    print("Mortis Training Setup")
+    print("="*70)
+    print()
+    # Derive job name from dataset if not provided
+    job_name = args.job_name
+    if not job_name:
+        # Extract dataset name from repo_id
+        dataset_name = args.dataset_repo_id.split('/')[-1]
+        job_name = f"smolvla_{dataset_name}"
+        print(f"Using job name: {job_name}")
+        print()
+    # Create generator
+    generator = TrainingScriptGenerator(
+        dataset_repo_id=args.dataset_repo_id,
+        output_dir=args.output_dir,
+        job_name=job_name,
+        model_repo_id=args.model_repo_id,
+        scripts_dir=args.scripts_dir
+    )
+    print()
+    if args.generate_configs:
+        # Generate multiple configurations
+        print("Generating training configurations...")
+        generator.generate_training_configs()
+    else:
+        # Generate single training script
+        print("Generating training script...")
+        generator.generate_training_script(
+            script_name="train.sh",
+            policy_path=args.policy_path,
+            batch_size=args.batch_size,
+            steps=args.steps,
+            enable_wandb=not args.no_wandb
+        )
+    # Print usage instructions
+    generator.print_usage_instructions()
+    # Final tips
+    print("\n💡 Tips:")
+    print("  - Adjust batch_size based on your GPU memory")
+    print("  - Monitor GPU usage with: watch -n 1 nvidia-smi")
+    print("  - Training logs are saved in the output directory")
+    print("  - Use Ctrl+C to stop training (checkpoints are saved)")
+    print()
+    print("="*70)
+if __name__ == "__main__":
+    main()

src/mortis/smolvla_executor.py ADDED Viewed

	@@ -0,0 +1,1040 @@

+"""
+SmolVLA Executor for vision-language-action robotic manipulation.
+This module implements the SmolVLA model executor that performs inference
+for manipulation tasks using the trained SmolVLA policy from LeRobot.
+"""
+import os
+import time
+import logging
+from pathlib import Path
+from typing import Optional, Dict, Any, Tuple
+from threading import Lock, Event
+import torch
+import numpy as np
+from PIL import Image as PILImage
+# LeRobot imports
+from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy
+from lerobot.policies.smolvla.configuration_smolvla import SmolVLAConfig
+# Local imports
+from .robot import MortisArm, HOME_POSE
+# Configure logging
+logger = logging.getLogger(__name__)
+class SmolVLAError(Exception):
+    """Base exception for SmolVLA executor errors."""
+    pass
+class SafetyViolationError(SmolVLAError):
+    """Exception raised when a safety constraint is violated."""
+    pass
+class TimeoutError(SmolVLAError):
+    """Exception raised when execution exceeds timeout."""
+    pass
+class GPUOutOfMemoryError(SmolVLAError):
+    """Exception raised when GPU runs out of memory."""
+    pass
+class SmolVLAExecutor:
+    """
+    Executor for SmolVLA vision-language-action model inference.
+    This class handles loading the trained SmolVLA model, capturing observations
+    from the robot and camera, running inference, and executing predicted actions
+    on the SO101 robotic arm.
+    Attributes:
+        checkpoint_path: Path to the trained model checkpoint
+        device: Device to run inference on ('cuda' or 'cpu')
+        policy: Loaded SmolVLA policy model
+        robot_arm: Reference to MortisArm instance for action execution
+        camera: Camera interface for visual observations (to be implemented)
+        valid_commands: List of trained manipulation task commands
+    """
+    # Valid manipulation commands that the model was trained on
+    VALID_COMMANDS = [
+        "Pick up the skull and place it in the green cup",
+        "Pick up the skull and place it in the orange cup",
+        "Pick up the skull and place it in the purple cup",
+        "Pick up the eyeball and place it in the green cup",
+        "Pick up the eyeball and place it in the orange cup",
+        "Pick up the eyeball and place it in the purple cup",
+    ]
+    # Safety limits for joint positions (in degrees)
+    # These define the safe workspace boundaries
+    # Based on SO101 calibration and physical constraints
+    JOINT_LIMITS = {
+        "shoulder_pan.pos": (-180, 180),
+        "shoulder_lift.pos": (-120, 120),  # Extended range for SO101
+        "elbow_flex.pos": (-135, 135),
+        "wrist_flex.pos": (-105, 105),     # Extended range for SO101
+        "wrist_roll.pos": (-180, 180),
+        "gripper.pos": (0, 100),  # 0=open, 100=closed
+    }
+    # Maximum allowed joint velocity (degrees per step)
+    MAX_JOINT_VELOCITY = 10.0
+    # Default execution timeout (seconds)
+    DEFAULT_TIMEOUT = 30.0
+    def __init__(
+        self,
+        checkpoint_path: str,
+        robot_arm: Optional[MortisArm] = None,
+        device: Optional[str] = None,
+        enable_safety_checks: bool = True,
+        timeout: Optional[float] = None
+    ):
+        """
+        Initialize the SmolVLA executor.
+        Args:
+            checkpoint_path: Path to the trained SmolVLA model checkpoint
+            robot_arm: Optional MortisArm instance (will create if not provided)
+            device: Device to run inference on ('cuda', 'cpu', or None for auto-detect)
+            enable_safety_checks: Whether to enable workspace safety checks
+            timeout: Execution timeout in seconds (None for default)
+        Raises:
+            SmolVLAError: If checkpoint path doesn't exist or model loading fails
+        """
+        # Initialize attributes first (for cleanup in case of early failure)
+        self.camera = None
+        self.policy = None
+        self.preprocessor = None
+        self.postprocessor = None
+        self.checkpoint_path = Path(checkpoint_path)
+        # Validate checkpoint path
+        if not self.checkpoint_path.exists():
+            raise SmolVLAError(f"Checkpoint path does not exist: {checkpoint_path}")
+        # Set device
+        if device is None:
+            self.device = "cuda" if torch.cuda.is_available() else "cpu"
+        else:
+            self.device = device
+        logger.info(f"Initializing SmolVLA executor on device: {self.device}")
+        # Safety configuration
+        self.enable_safety_checks = enable_safety_checks
+        self.timeout = timeout if timeout is not None else self.DEFAULT_TIMEOUT
+        # Emergency stop flag and lock
+        self._emergency_stop_flag = Event()
+        self._execution_lock = Lock()
+        self._is_executing = False
+        # Previous state for velocity checking
+        self._previous_state = None
+        logger.info(f"Safety checks: {'enabled' if enable_safety_checks else 'disabled'}")
+        logger.info(f"Execution timeout: {self.timeout}s")
+        # Initialize robot arm
+        self.robot_arm = robot_arm
+        if self.robot_arm is None:
+            logger.info("No robot arm provided, creating new MortisArm instance")
+            self.robot_arm = MortisArm()
+        # Load the model
+        self._load_model()
+        # Model is ready
+        logger.info("SmolVLA executor initialized successfully")
+    def _load_model(self):
+        """
+        Load the SmolVLA model from checkpoint.
+        Raises:
+            SmolVLAError: If model loading fails
+        """
+        try:
+            logger.info(f"Loading SmolVLA model from: {self.checkpoint_path}")
+            # Load configuration - handle extra fields in config.json
+            import json
+            config_path = self.checkpoint_path / "config.json"
+            # Load config - ensure 'type' field is set to 'smolvla'
+            config_path = self.checkpoint_path / "config.json"
+            if config_path.exists():
+                # Load config
+                with open(config_path, 'r') as f:
+                    config_dict = json.load(f)
+                # Ensure 'type' field is set to 'smolvla'
+                if 'type' not in config_dict or config_dict['type'] != 'smolvla':
+                    logger.debug("Setting 'type' field to 'smolvla' in config")
+                    config_dict['type'] = 'smolvla'
+                    # Save updated config back
+                    with open(config_path, 'w') as f:
+                        json.dump(config_dict, f, indent=2)
+                # Get VLM model name for tokenizer
+                vlm_model_name = config_dict.get('vlm_model_name', 'HuggingFaceTB/SmolVLM2-500M-Video-Instruct')
+            else:
+                vlm_model_name = 'HuggingFaceTB/SmolVLM2-500M-Video-Instruct'
+            # Load policy using from_pretrained (it will load the config automatically)
+            self.policy = SmolVLAPolicy.from_pretrained(str(self.checkpoint_path))
+            # Move to device
+            self.policy.to(self.device)
+            # Set to evaluation mode
+            self.policy.eval()
+            logger.info("SmolVLA model loaded successfully")
+            # Load preprocessor (handles tokenization automatically)
+            self._load_preprocessor()
+            # Perform warmup inference
+            self._warmup()
+        except Exception as e:
+            logger.error(f"Failed to load SmolVLA model: {e}")
+            raise SmolVLAError(f"Model loading failed: {e}")
+    def _load_preprocessor(self):
+        """
+        Load preprocessor from checkpoint.
+        The preprocessor handles automatic tokenization of task strings
+        through the TokenizerProcessorStep.
+        Raises:
+            SmolVLAError: If preprocessor loading fails
+        """
+        try:
+            from lerobot.policies.factory import make_pre_post_processors
+            logger.info("Loading preprocessor from checkpoint...")
+            # Load preprocessor and postprocessor using policy config
+            self.preprocessor, self.postprocessor = make_pre_post_processors(
+                self.policy.config,
+                pretrained_path=str(self.checkpoint_path),
+                device=self.device
+            )
+            logger.info("Preprocessor and postprocessor loaded successfully")
+        except Exception as e:
+            logger.error(f"Failed to load preprocessor: {e}")
+            raise SmolVLAError(f"Preprocessor loading failed: {e}")
+    def _warmup(self):
+        """
+        Perform warmup inference to initialize CUDA kernels and caches.
+        This reduces latency for the first real inference call.
+        """
+        if self.device == "cuda":
+            logger.info("Performing model warmup...")
+            try:
+                # Create dummy observation
+                dummy_obs = self._create_dummy_observation()
+                # Run dummy inference
+                with torch.no_grad():
+                    # SmolVLA expects a batch of observations
+                    result = self.policy.select_action(dummy_obs)
+                    # Result may be a dict with 'action' key or just a tensor
+                    if isinstance(result, dict):
+                        _ = result.get('action', result)
+                # Clear cache
+                torch.cuda.empty_cache()
+                logger.info("Model warmup complete")
+            except Exception as e:
+                # Warmup is optional - log but don't fail
+                logger.debug(f"Warmup skipped: {e}")
+                pass
+    def _create_dummy_observation(self) -> Dict[str, torch.Tensor]:
+        """
+        Create a dummy observation for warmup.
+        Returns:
+            Dictionary with dummy observation tensors
+        """
+        # Create dummy state
+        dummy_state = torch.zeros(1, 6, dtype=torch.float32, device=self.device)
+        # Create dummy images
+        dummy_image = self._create_dummy_image()
+        observation = {
+            "observation.images.camera1": dummy_image,
+            "observation.images.camera2": dummy_image.clone(),
+            "observation.images.camera3": dummy_image.clone(),
+            "observation.state": dummy_state,
+            "task": "dummy task"  # Task as string (preprocessor will handle it)
+        }
+        # Apply preprocessor to tokenize task
+        if self.preprocessor is not None:
+            observation = self.preprocessor(observation)
+        return observation
+    def validate_command(self, command: str) -> bool:
+        """
+        Validate that a command is in the trained task set.
+        Args:
+            command: The manipulation command to validate
+        Returns:
+            True if command is valid, False otherwise
+        """
+        return command in self.VALID_COMMANDS
+    def trigger_emergency_stop(self):
+        """
+        Trigger emergency stop from external thread.
+        This can be called from another thread to safely stop execution.
+        """
+        logger.warning("Emergency stop triggered externally")
+        self._emergency_stop_flag.set()
+    def is_executing(self) -> bool:
+        """
+        Check if executor is currently running a task.
+        Returns:
+            True if a task is being executed
+        """
+        return self._is_executing
+    def execute(self, command: str, max_steps: int = 500, timeout: Optional[float] = None) -> bool:
+        """
+        Execute a manipulation task using SmolVLA inference.
+        This is the main entry point for executing manipulation commands.
+        It runs the inference loop, capturing observations and executing
+        predicted actions until the task is complete or max_steps is reached.
+        Args:
+            command: Natural language task description (must be in VALID_COMMANDS)
+            max_steps: Maximum number of inference steps to execute
+            timeout: Optional timeout override (seconds)
+        Returns:
+            True if execution completed successfully, False otherwise
+        Raises:
+            SmolVLAError: If command is invalid or execution fails critically
+            SafetyViolationError: If safety constraints are violated
+            TimeoutError: If execution exceeds timeout
+        """
+        # Acquire execution lock to prevent concurrent execution
+        if not self._execution_lock.acquire(blocking=False):
+            raise SmolVLAError("Executor is already running a task")
+        try:
+            # Clear emergency stop flag
+            self._emergency_stop_flag.clear()
+            self._is_executing = True
+            # Validate command against trained task set
+            if not self.validate_command(command):
+                raise SmolVLAError(
+                    f"Invalid command: '{command}'. "
+                    f"Must be one of: {self.VALID_COMMANDS}"
+                )
+            # Ensure robot is connected
+            if not self.robot_arm.connected:
+                logger.info("Robot not connected, attempting to connect...")
+                self.robot_arm.connect()
+                if not self.robot_arm.connected:
+                    raise SmolVLAError("Failed to connect to robot arm")
+            # Use provided timeout or default
+            execution_timeout = timeout if timeout is not None else self.timeout
+            logger.info(f"Starting SmolVLA execution: '{command}'")
+            logger.info(f"Max steps: {max_steps}, Timeout: {execution_timeout}s")
+            logger.info(f"Safety checks: {'enabled' if self.enable_safety_checks else 'disabled'}")
+            try:
+                # Execute the task with timeout
+                success = self._execute_task_with_timeout(command, max_steps, execution_timeout)
+                if success:
+                    logger.info(f"Task completed successfully: '{command}'")
+                else:
+                    logger.warning(f"Task did not complete within constraints")
+                # Return to home position safely
+                logger.info("Returning to home position...")
+                self._safe_return_home()
+                return success
+            except TimeoutError as e:
+                logger.error(f"Execution timeout: {e}")
+                self._emergency_stop()
+                raise
+            except SafetyViolationError as e:
+                logger.error(f"Safety violation: {e}")
+                self._emergency_stop()
+                raise
+            except GPUOutOfMemoryError as e:
+                logger.error(f"GPU out of memory: {e}")
+                self._handle_gpu_oom()
+                self._emergency_stop()
+                raise
+            except Exception as e:
+                logger.error(f"Execution failed: {e}")
+                import traceback
+                logger.error(f"Traceback: {traceback.format_exc()}")
+                self._emergency_stop()
+                raise SmolVLAError(f"Execution failed: {e}")
+        finally:
+            # Always release lock and reset execution flag
+            self._is_executing = False
+            self._execution_lock.release()
+    def _execute_task_with_timeout(self, command: str, max_steps: int, timeout: float) -> bool:
+        """
+        Execute task with timeout monitoring.
+        Args:
+            command: The manipulation command
+            max_steps: Maximum steps
+            timeout: Timeout in seconds
+        Returns:
+            True if task completed successfully
+        Raises:
+            TimeoutError: If execution exceeds timeout
+        """
+        start_time = time.time()
+        try:
+            return self._execute_task(command, max_steps, start_time, timeout)
+        except Exception as e:
+            elapsed = time.time() - start_time
+            if elapsed >= timeout:
+                raise TimeoutError(f"Execution exceeded timeout of {timeout}s")
+            raise
+    def _execute_task(self, command: str, max_steps: int, start_time: float, timeout: float) -> bool:
+        """
+        Internal method to execute the task inference loop.
+        This method implements the core inference loop:
+        1. Capture visual and state observations
+        2. Run SmolVLA inference to predict next action
+        3. Execute action on robot
+        4. Check for task completion
+        5. Repeat until complete or max_steps reached
+        Args:
+            command: The manipulation command to execute
+            max_steps: Maximum number of steps
+        Returns:
+            True if task completed, False if max steps reached
+        """
+        # Reset task completion tracking variables
+        self._previous_action = None
+        self._stable_count = 0
+        self._previous_state = None
+        # Track execution metrics
+        last_progress_log = 0
+        progress_log_interval = 50  # Log every 50 steps
+        with torch.no_grad():
+            for step in range(max_steps):
+                # Check for emergency stop
+                if self._emergency_stop_flag.is_set():
+                    logger.warning("Emergency stop detected, aborting execution")
+                    return False
+                # Check timeout
+                elapsed = time.time() - start_time
+                if elapsed >= timeout:
+                    raise TimeoutError(f"Execution exceeded timeout of {timeout}s at step {step}")
+                # Log progress periodically
+                if step - last_progress_log >= progress_log_interval:
+                    fps = step / elapsed if elapsed > 0 else 0
+                    logger.info(
+                        f"Execution progress: step {step}/{max_steps} "
+                        f"({step/max_steps*100:.1f}%) - {fps:.1f} FPS - {elapsed:.1f}s elapsed"
+                    )
+                    last_progress_log = step
+                try:
+                    # Capture current observation
+                    observation = self._get_observation()
+                    # Add task string (preprocessor will tokenize it)
+                    observation = self._add_task_string(observation, command)
+                    # Apply preprocessor (tokenizes task string automatically)
+                    observation = self.preprocessor(observation)
+                    # Run inference to predict next action (normalized)
+                    action_normalized = self._run_inference_with_oom_handling(observation)
+                    # Debug: log normalized action
+                    logger.debug(f"Normalized action type: {type(action_normalized)}, shape: {action_normalized.shape if hasattr(action_normalized, 'shape') else 'N/A'}")
+                    # Denormalize action using postprocessor
+                    action = self.postprocessor(action_normalized)
+                    # Debug: log denormalized action
+                    logger.debug(f"Denormalized action: {action}")
+                    # Validate action safety (on denormalized action)
+                    if self.enable_safety_checks:
+                        self._check_action_safety(action, observation)
+                    # Send action to robot
+                    self._send_action(action)
+                    # Check if task is complete (use normalized action for stability check)
+                    try:
+                        is_complete = self._is_task_complete(observation, step, action_normalized)
+                        if is_complete:
+                            elapsed = time.time() - start_time
+                            logger.info(
+                                f"Task completed at step {step} "
+                                f"(elapsed: {elapsed:.2f}s, avg FPS: {step/elapsed:.1f})"
+                            )
+                            return True
+                    except Exception as e:
+                        logger.error(f"Error in _is_task_complete: {e}")
+                        raise
+                    # Small delay between steps to maintain ~30 FPS
+                    time.sleep(0.033)
+                except torch.cuda.OutOfMemoryError as e:
+                    logger.error(f"GPU out of memory at step {step}")
+                    raise GPUOutOfMemoryError(f"GPU OOM at step {step}: {e}")
+                except SafetyViolationError:
+                    # Re-raise safety violations
+                    raise
+                except Exception as e:
+                    logger.error(f"Error at step {step}: {e}")
+                    raise
+        # Max steps reached without completion
+        elapsed = time.time() - start_time
+        logger.warning(
+            f"Task did not complete within {max_steps} steps "
+            f"(elapsed: {elapsed:.2f}s)"
+        )
+        return False
+    def _get_observation(self) -> Dict[str, torch.Tensor]:
+        """
+        Get current robot observation (image + state).
+        Captures robot state from robot.get_observation() and images from cameras.
+        Returns:
+            Dictionary with observation tensors formatted for SmolVLA:
+            - observation.images.camera1: RGB image tensor [1, 3, H, W]
+            - observation.images.camera2: RGB image tensor [1, 3, H, W] (if available)
+            - observation.images.camera3: RGB image tensor [1, 3, H, W] (if available)
+            - observation.state: Joint positions tensor [1, 6]
+        """
+        try:
+            # Get robot state (joint positions)
+            robot_obs = self.robot_arm.robot.get_observation()
+            # Extract joint positions in order
+            joint_names = [
+                "shoulder_pan.pos",
+                "shoulder_lift.pos",
+                "elbow_flex.pos",
+                "wrist_flex.pos",
+                "wrist_roll.pos",
+                "gripper.pos"
+            ]
+            # Build state vector
+            state_values = [robot_obs[name] for name in joint_names]
+            state_tensor = torch.tensor(
+                state_values,
+                dtype=torch.float32,
+                device=self.device
+            ).unsqueeze(0)  # Add batch dimension
+            # Get camera images (robot.cameras is a dict of camera objects)
+            observation = {"observation.state": state_tensor}
+            if hasattr(self.robot_arm.robot, 'cameras') and self.robot_arm.robot.cameras:
+                # Get images from robot's cameras
+                for i, (camera_name, camera) in enumerate(self.robot_arm.robot.cameras.items(), start=1):
+                    try:
+                        image = camera.read()
+                        # Convert to tensor: (H, W, C) -> (1, C, H, W)
+                        image_tensor = torch.from_numpy(image).float().permute(2, 0, 1).unsqueeze(0).to(self.device) / 255.0
+                        observation[f"observation.images.camera{i}"] = image_tensor
+                        logger.debug(f"Captured image from {camera_name}: shape={image.shape}")
+                    except Exception as e:
+                        logger.warning(f"Failed to read from {camera_name}: {e}")
+                        observation[f"observation.images.camera{i}"] = self._create_dummy_image()
+            else:
+                logger.debug("No cameras configured on robot, using dummy images")
+            # Ensure we have 3 camera images (duplicate if needed)
+            for i in range(1, 4):
+                key = f"observation.images.camera{i}"
+                if key not in observation:
+                    # Use first camera or dummy
+                    if "observation.images.camera1" in observation:
+                        observation[key] = observation["observation.images.camera1"].clone()
+                    else:
+                        observation[key] = self._create_dummy_image()
+            logger.debug(f"Captured observation with keys: {list(observation.keys())}")
+            return observation
+        except Exception as e:
+            logger.warning(f"Failed to get robot observation: {e}. Using dummy observation.")
+            return self._create_dummy_observation_without_task()
+    def _create_dummy_observation_without_task(self) -> Dict[str, torch.Tensor]:
+        """
+        Create a dummy observation without task string (for error recovery).
+        Returns:
+            Dictionary with dummy observation tensors
+        """
+        dummy_state = torch.zeros(1, 6, dtype=torch.float32, device=self.device)
+        dummy_image = self._create_dummy_image()
+        return {
+            "observation.images.camera1": dummy_image,
+            "observation.images.camera2": dummy_image.clone(),
+            "observation.images.camera3": dummy_image.clone(),
+            "observation.state": dummy_state,
+        }
+    def _add_task_string(self, observation: Dict[str, torch.Tensor], command: str) -> Dict[str, torch.Tensor]:
+        """
+        Add task string to observation.
+        The preprocessor will automatically tokenize this string through
+        the TokenizerProcessorStep.
+        Args:
+            observation: Current observation dictionary
+            command: Natural language command string
+        Returns:
+            Observation dictionary with added task string
+        """
+        # Simply add the task string - the preprocessor will tokenize it
+        observation["task"] = command
+        logger.debug(f"Added task string: '{command}'")
+        return observation
+    def _create_dummy_image(self) -> torch.Tensor:
+        """
+        Create a dummy image tensor for testing without camera.
+        Returns:
+            Dummy image tensor [1, 3, 256, 256] with batch dimension
+        """
+        # Create black image
+        dummy_image = torch.zeros(1, 3, 256, 256, dtype=torch.float32, device=self.device)
+        return dummy_image
+    def _send_action(self, action: torch.Tensor):
+        """
+        Send predicted action to robot.
+        Converts the action tensor from SmolVLA to SO101 command format
+        and sends it to the robot arm for execution.
+        Args:
+            action: Action tensor from policy (shape: [batch, action_dim])
+        Raises:
+            SmolVLAError: If action execution fails
+        """
+        try:
+            # Convert action tensor to robot command dictionary
+            action_dict = self._action_to_dict(action)
+            # Send to robot
+            self.robot_arm.robot.send_action(action_dict)
+            # Log action at debug level (verbose)
+            logger.debug(f"Action sent: {action_dict}")
+        except Exception as e:
+            logger.error(f"Failed to send action to robot: {e}")
+            raise SmolVLAError(f"Action execution failed: {e}")
+    def _action_to_dict(self, action: torch.Tensor) -> Dict[str, float]:
+        """
+        Convert action tensor to SO101 command format.
+        Maps the action tensor dimensions to SO101 joint names and converts
+        to the dictionary format expected by the robot driver.
+        Args:
+            action: Action tensor from policy (shape: [batch, 6] or [6])
+        Returns:
+            Dictionary mapping joint names to positions (in degrees or normalized units)
+        Raises:
+            SmolVLAError: If action tensor has invalid shape
+        """
+        # Remove batch dimension if present
+        if action.dim() > 1:
+            action = action.squeeze(0)
+        # Validate action dimension
+        if action.shape[0] != 6:
+            raise SmolVLAError(
+                f"Invalid action shape: expected 6 dimensions, got {action.shape[0]}"
+            )
+        # Convert to numpy
+        action_np = action.cpu().numpy()
+        # Map action dimensions to joint names
+        # Order must match the training data format
+        joint_names = [
+            "shoulder_pan.pos",
+            "shoulder_lift.pos",
+            "elbow_flex.pos",
+            "wrist_flex.pos",
+            "wrist_roll.pos",
+            "gripper.pos"
+        ]
+        # Create action dictionary
+        action_dict = {
+            name: float(action_np[i])
+            for i, name in enumerate(joint_names)
+        }
+        return action_dict
+    def _is_task_complete(
+        self,
+        observation: Dict[str, torch.Tensor],
+        step: int,
+        action: torch.Tensor
+    ) -> bool:
+        """
+        Determine if the task is complete.
+        This method uses multiple heuristics to detect task completion:
+        1. Minimum step count (ensure task has progressed)
+        2. Maximum step count (assume completion after sufficient time)
+        3. Action stability (detect when robot has settled)
+        In a production system, this could be enhanced with:
+        - Learned termination classifier
+        - Visual goal detection
+        - Force/torque feedback
+        - Success detection from camera
+        Args:
+            observation: Current observation dictionary
+            step: Current step number
+            action: Predicted action tensor
+        Returns:
+            True if task should be considered complete
+        """
+        # Minimum steps before considering completion (allow task to progress)
+        MIN_STEPS = 100
+        # Maximum steps - assume task is complete after this many steps
+        # Most manipulation tasks should complete within 400-450 steps at 30 FPS
+        # (approximately 13-15 seconds)
+        MAX_STEPS = 450
+        # Early exit: not enough steps yet
+        if step < MIN_STEPS:
+            return False
+        # Late exit: max steps reached, consider complete
+        if step >= MAX_STEPS:
+            logger.info(f"Task completion: max steps ({MAX_STEPS}) reached")
+            return True
+        # Check for action stability (robot has settled into final position)
+        if hasattr(self, '_previous_action') and self._previous_action is not None:
+            action_diff = torch.abs(action - self._previous_action).max().item()
+            # If action changes are very small, robot may have settled
+            if action_diff < 0.01:  # Threshold for "stable" action
+                if not hasattr(self, '_stable_count'):
+                    self._stable_count = 0
+                self._stable_count += 1
+                # If stable for 30 consecutive steps (~1 second), consider complete
+                if self._stable_count >= 30:
+                    logger.info(
+                        f"Task completion: action stability detected at step {step} "
+                        f"(stable for {self._stable_count} steps)"
+                    )
+                    return True
+            else:
+                # Reset stability counter if action changes significantly
+                self._stable_count = 0
+        # Store current action for next comparison
+        self._previous_action = action.clone()
+        # Not complete yet
+        return False
+    def _check_action_safety(self, action: torch.Tensor, observation: Dict[str, torch.Tensor]):
+        """
+        Check if predicted action is safe to execute.
+        Validates:
+        1. Joint position limits
+        2. Joint velocity limits
+        3. Workspace boundaries
+        Args:
+            action: Predicted action tensor
+            observation: Current observation
+        Raises:
+            SafetyViolationError: If action violates safety constraints
+        """
+        # Convert action to dict for checking
+        action_dict = self._action_to_dict(action)
+        # Check joint position limits
+        for joint_name, position in action_dict.items():
+            if joint_name in self.JOINT_LIMITS:
+                min_pos, max_pos = self.JOINT_LIMITS[joint_name]
+                if position < min_pos or position > max_pos:
+                    raise SafetyViolationError(
+                        f"Joint {joint_name} position {position:.2f} exceeds limits "
+                        f"[{min_pos}, {max_pos}]"
+                    )
+        # Check joint velocity limits (if we have previous state)
+        if self._previous_state is not None:
+            current_state = observation["observation.state"].squeeze(0).cpu().numpy()
+            velocity = np.abs(current_state - self._previous_state)
+            max_velocity = np.max(velocity)
+            if max_velocity > self.MAX_JOINT_VELOCITY:
+                raise SafetyViolationError(
+                    f"Joint velocity {max_velocity:.2f} exceeds limit "
+                    f"{self.MAX_JOINT_VELOCITY}"
+                )
+        # Update previous state for next check
+        self._previous_state = observation["observation.state"].squeeze(0).cpu().numpy().copy()
+    def _run_inference_with_oom_handling(self, observation: Dict[str, torch.Tensor]) -> torch.Tensor:
+        """
+        Run inference with GPU out-of-memory handling.
+        Args:
+            observation: Current observation
+        Returns:
+            Predicted action tensor
+        Raises:
+            GPUOutOfMemoryError: If GPU runs out of memory
+        """
+        try:
+            result = self.policy.select_action(observation)
+            # Debug: log what we got back
+            logger.debug(f"Policy returned type: {type(result)}")
+            if isinstance(result, dict):
+                logger.debug(f"Policy returned dict keys: {result.keys()}")
+            # SmolVLA returns a dictionary with 'action' key
+            if isinstance(result, dict):
+                if 'action' in result:
+                    return result['action']
+                else:
+                    # Try to find the action in the dict
+                    logger.error(f"Policy returned dict without 'action' key. Keys: {result.keys()}")
+                    raise SmolVLAError(f"Policy returned unexpected format: {type(result)}")
+            return result
+        except torch.cuda.OutOfMemoryError as e:
+            logger.error("GPU out of memory during inference")
+            # Try to recover by clearing cache
+            torch.cuda.empty_cache()
+            # Try one more time
+            try:
+                result = self.policy.select_action(observation)
+                if isinstance(result, dict):
+                    if 'action' in result:
+                        return result['action']
+                    else:
+                        raise SmolVLAError(f"Policy returned unexpected format: {type(result)}")
+                return result
+            except torch.cuda.OutOfMemoryError:
+                raise GPUOutOfMemoryError("GPU out of memory, cannot recover")
+    def _handle_gpu_oom(self):
+        """
+        Handle GPU out-of-memory error by clearing cache and resetting state.
+        """
+        logger.info("Handling GPU out-of-memory error...")
+        if self.device == "cuda":
+            # Clear CUDA cache
+            torch.cuda.empty_cache()
+            # Log memory stats
+            if torch.cuda.is_available():
+                allocated = torch.cuda.memory_allocated() / 1024**3
+                reserved = torch.cuda.memory_reserved() / 1024**3
+                logger.info(f"GPU memory: {allocated:.2f}GB allocated, {reserved:.2f}GB reserved")
+        logger.info("GPU memory cleared")
+    def _safe_return_home(self):
+        """
+        Safely return robot to home position with error handling.
+        """
+        try:
+            self.robot_arm.move_arm("idle")
+            logger.info("Robot returned to home position")
+        except Exception as e:
+            logger.error(f"Failed to return to home position: {e}")
+            # Try direct position command as fallback
+            try:
+                self.robot_arm.robot.send_action(HOME_POSE)
+                logger.info("Robot returned to home using direct command")
+            except Exception as e2:
+                logger.error(f"Direct home command also failed: {e2}")
+    def _emergency_stop(self):
+        """
+        Emergency stop: return robot to safe idle position.
+        This is called when an error occurs during execution.
+        Sets the emergency stop flag and attempts to safely stop the robot.
+        """
+        logger.warning("Emergency stop triggered")
+        # Set emergency stop flag
+        self._emergency_stop_flag.set()
+        try:
+            # Try to stop robot immediately
+            self._safe_return_home()
+            logger.info("Emergency stop completed - robot in safe position")
+        except Exception as e:
+            logger.error(f"Emergency stop failed: {e}")
+            logger.error("MANUAL INTERVENTION MAY BE REQUIRED")
+    def cleanup(self):
+        """
+        Clean up resources (camera, GPU memory, etc.).
+        Should be called when the executor is no longer needed.
+        """
+        logger.info("Cleaning up SmolVLA executor...")
+        # Disconnect camera
+        if hasattr(self, 'camera') and self.camera is not None:
+            try:
+                self.camera.disconnect()
+            except Exception as e:
+                logger.warning(f"Camera disconnect failed: {e}")
+        # Clear GPU memory
+        if hasattr(self, 'device') and self.device == "cuda":
+            torch.cuda.empty_cache()
+        logger.info("Cleanup complete")
+    def __del__(self):
+        """Destructor to ensure cleanup."""
+        try:
+            self.cleanup()
+        except Exception:
+            # Silently ignore cleanup errors in destructor
+            pass
+def init_smolvla_executor(
+    checkpoint_path: Optional[str] = None,
+    robot_arm: Optional[MortisArm] = None,
+    device: Optional[str] = None
+) -> SmolVLAExecutor:
+    """
+    Factory function to initialize SmolVLA executor with environment configuration.
+    Args:
+        checkpoint_path: Path to model checkpoint (uses env var if not provided)
+        robot_arm: Optional MortisArm instance
+        device: Device to use (uses env var or auto-detect if not provided)
+    Returns:
+        Initialized SmolVLAExecutor instance
+    Raises:
+        SmolVLAError: If initialization fails
+    """
+    # Get checkpoint path from environment if not provided
+    if checkpoint_path is None:
+        checkpoint_path = os.getenv("SMOLVLA_CHECKPOINT_PATH")
+        if checkpoint_path is None:
+            raise SmolVLAError(
+                "No checkpoint path provided and SMOLVLA_CHECKPOINT_PATH not set"
+            )
+    # Get device from environment if not provided
+    if device is None:
+        device = os.getenv("SMOLVLA_DEVICE")
+    logger.info(f"Initializing SmolVLA executor with checkpoint: {checkpoint_path}")
+    return SmolVLAExecutor(
+        checkpoint_path=checkpoint_path,
+        robot_arm=robot_arm,
+        device=device
+    )

src/mortis/stt_service.py ADDED Viewed

	@@ -0,0 +1,383 @@

+"""
+Speech-to-Text service for Mortis voice input.
+This module provides the STTService class for converting audio input to text,
+with support for Gemini native audio processing and fallback to Google Cloud Speech-to-Text.
+"""
+import os
+import logging
+from pathlib import Path
+from typing import Optional, Literal
+from enum import Enum
+from dotenv import load_dotenv
+from google import genai
+from google.genai import types
+# Load environment variables
+REPO_ROOT = Path(__file__).resolve().parents[2]
+load_dotenv(REPO_ROOT / ".env")
+# Configure logging
+logger = logging.getLogger(__name__)
+class STTProvider(Enum):
+    """Available Speech-to-Text providers."""
+    GEMINI = "gemini"
+    GOOGLE_STT = "google_stt"
+class AudioFormat(Enum):
+    """Supported audio formats."""
+    WAV = "wav"
+    MP3 = "mp3"
+    WEBM = "webm"
+    OGG = "ogg"
+    FLAC = "flac"
+class AudioProcessingError(Exception):
+    """Base exception for audio processing errors."""
+    pass
+class STTService:
+    """
+    Speech-to-Text service for converting audio input to text.
+    Supports multiple STT providers:
+    - Gemini native audio (primary, recommended)
+    - Google Cloud Speech-to-Text (fallback)
+    The service automatically handles audio format validation and conversion.
+    """
+    def __init__(
+        self,
+        provider: Optional[STTProvider] = None,
+        api_key: Optional[str] = None,
+        model_name: Optional[str] = None,
+        language_code: str = "en-US",
+        enable_fallback: bool = True
+    ):
+        """
+        Initialize STT service.
+        Args:
+            provider: STT provider to use (defaults to GEMINI from env or GEMINI)
+            api_key: API key for Gemini (defaults to GEMINI_API_KEY env var)
+            model_name: Gemini model to use (defaults to GEMINI_MODEL env var or gemini-1.5-flash)
+            language_code: Language code for transcription (default: en-US)
+            enable_fallback: Whether to enable fallback to Google STT on Gemini failure
+        """
+        # Determine provider from environment or default to Gemini
+        if provider is None:
+            provider_str = os.getenv("STT_PROVIDER", "gemini").lower()
+            try:
+                provider = STTProvider(provider_str)
+            except ValueError:
+                logger.warning(f"Invalid STT_PROVIDER '{provider_str}', defaulting to GEMINI")
+                provider = STTProvider.GEMINI
+        self.provider = provider
+        self.language_code = language_code
+        self.enable_fallback = enable_fallback
+        # Initialize Gemini client for audio processing
+        self.api_key = api_key or os.getenv("GEMINI_API_KEY")
+        if not self.api_key:
+            raise ValueError("GEMINI_API_KEY must be provided or set in environment")
+        self.model_name = model_name or os.getenv("GEMINI_MODEL", "gemini-1.5-flash")
+        self.client = genai.Client(api_key=self.api_key)
+        # Initialize Google Cloud STT client (lazy loading)
+        self._google_stt_client = None
+        logger.info(
+            f"STTService initialized with provider: {self.provider.value}, "
+            f"model: {self.model_name}, language: {self.language_code}, "
+            f"fallback: {self.enable_fallback}"
+        )
+    def transcribe(self, audio_path: str) -> str:
+        """
+        Transcribe audio file to text.
+        Args:
+            audio_path: Path to audio file
+        Returns:
+            Transcribed text
+        Raises:
+            AudioProcessingError: If transcription fails with all providers
+            FileNotFoundError: If audio file doesn't exist
+        """
+        # Validate audio file exists
+        audio_file = Path(audio_path)
+        if not audio_file.exists():
+            raise FileNotFoundError(f"Audio file not found: {audio_path}")
+        # Validate audio format
+        if not self._validate_audio_format(audio_file):
+            raise AudioProcessingError(
+                f"Unsupported audio format: {audio_file.suffix}. "
+                f"Supported formats: {[fmt.value for fmt in AudioFormat]}"
+            )
+        logger.info(f"Transcribing audio file: {audio_path} using {self.provider.value}")
+        # Try primary provider
+        try:
+            if self.provider == STTProvider.GEMINI:
+                return self._transcribe_with_gemini(audio_path)
+            elif self.provider == STTProvider.GOOGLE_STT:
+                return self._transcribe_with_google_stt(audio_path)
+        except Exception as e:
+            logger.warning(f"Primary STT provider ({self.provider.value}) failed: {e}")
+            # Try fallback if enabled
+            if self.enable_fallback:
+                logger.info("Attempting fallback STT provider...")
+                try:
+                    if self.provider == STTProvider.GEMINI:
+                        # Fallback to Google STT
+                        return self._transcribe_with_google_stt(audio_path)
+                    else:
+                        # Fallback to Gemini
+                        return self._transcribe_with_gemini(audio_path)
+                except Exception as fallback_error:
+                    logger.error(f"Fallback STT provider also failed: {fallback_error}")
+                    raise AudioProcessingError(
+                        f"All STT providers failed. Primary: {e}, Fallback: {fallback_error}"
+                    ) from fallback_error
+            else:
+                raise AudioProcessingError(f"STT transcription failed: {e}") from e
+    def _validate_audio_format(self, audio_file: Path) -> bool:
+        """
+        Validate that audio file format is supported.
+        Args:
+            audio_file: Path to audio file
+        Returns:
+            True if format is supported, False otherwise
+        """
+        suffix = audio_file.suffix.lstrip('.').lower()
+        supported_formats = [fmt.value for fmt in AudioFormat]
+        return suffix in supported_formats
+    def _transcribe_with_gemini(self, audio_path: str) -> str:
+        """
+        Transcribe audio using Gemini native audio support.
+        Args:
+            audio_path: Path to audio file
+        Returns:
+            Transcribed text
+        Raises:
+            Exception: If Gemini API call fails
+        """
+        logger.debug(f"Transcribing with Gemini: {audio_path}")
+        try:
+            # Upload audio file to Gemini
+            audio_file = self.client.files.upload(file=audio_path)
+            logger.debug(f"Audio file uploaded: {audio_file.name}")
+            # Create prompt for transcription
+            prompt = (
+                "Transcribe this audio accurately. "
+                "Return only the transcribed text without any additional commentary or formatting."
+            )
+            # Generate content with audio
+            response = self.client.models.generate_content(
+                model=self.model_name,
+                contents=[prompt, audio_file]
+            )
+            # Extract transcribed text
+            if response.text is None:
+                logger.warning("Gemini returned None for transcription")
+                logger.debug(f"Response object: {response}")
+                # Check if there are candidates with parts
+                if hasattr(response, 'candidates') and response.candidates:
+                    logger.debug(f"Response has {len(response.candidates)} candidates")
+                    for i, candidate in enumerate(response.candidates):
+                        logger.debug(f"Candidate {i}: {candidate}")
+                transcript = ""
+            else:
+                transcript = response.text.strip()
+            if transcript:
+                logger.info(f"Gemini transcription successful: '{transcript[:50]}...'")
+            else:
+                logger.warning("Gemini transcription returned empty result")
+            # Clean up uploaded file
+            try:
+                self.client.files.delete(name=audio_file.name)
+                logger.debug(f"Deleted uploaded audio file: {audio_file.name}")
+            except Exception as cleanup_error:
+                logger.warning(f"Failed to delete uploaded audio file: {cleanup_error}")
+            return transcript
+        except Exception as e:
+            logger.error(f"Gemini transcription failed: {type(e).__name__}: {e}")
+            raise
+    def _transcribe_with_google_stt(self, audio_path: str) -> str:
+        """
+        Transcribe audio using Google Cloud Speech-to-Text API.
+        Args:
+            audio_path: Path to audio file
+        Returns:
+            Transcribed text
+        Raises:
+            Exception: If Google STT API call fails
+            ImportError: If google-cloud-speech is not installed
+        """
+        logger.debug(f"Transcribing with Google STT: {audio_path}")
+        try:
+            from google.cloud import speech_v1
+        except ImportError:
+            raise ImportError(
+                "google-cloud-speech is not installed. "
+                "Install it with: pip install google-cloud-speech"
+            )
+        # Initialize Google STT client (lazy loading)
+        if self._google_stt_client is None:
+            self._google_stt_client = speech_v1.SpeechClient()
+            logger.debug("Google STT client initialized")
+        # Read audio file
+        with open(audio_path, "rb") as audio_file:
+            audio_content = audio_file.read()
+        # Determine audio encoding from file extension
+        audio_path_obj = Path(audio_path)
+        suffix = audio_path_obj.suffix.lstrip('.').lower()
+        encoding_map = {
+            "wav": speech_v1.RecognitionConfig.AudioEncoding.LINEAR16,
+            "mp3": speech_v1.RecognitionConfig.AudioEncoding.MP3,
+            "flac": speech_v1.RecognitionConfig.AudioEncoding.FLAC,
+            "ogg": speech_v1.RecognitionConfig.AudioEncoding.OGG_OPUS,
+            "webm": speech_v1.RecognitionConfig.AudioEncoding.WEBM_OPUS,
+        }
+        encoding = encoding_map.get(suffix, speech_v1.RecognitionConfig.AudioEncoding.LINEAR16)
+        # Configure recognition
+        audio = speech_v1.RecognitionAudio(content=audio_content)
+        config = speech_v1.RecognitionConfig(
+            encoding=encoding,
+            language_code=self.language_code,
+            enable_automatic_punctuation=True,
+        )
+        # Perform transcription
+        try:
+            response = self._google_stt_client.recognize(config=config, audio=audio)
+            # Extract transcript from results
+            if not response.results:
+                logger.warning("Google STT returned no results")
+                return ""
+            # Combine all alternatives (usually just one)
+            transcript = " ".join(
+                result.alternatives[0].transcript
+                for result in response.results
+                if result.alternatives
+            )
+            logger.info(f"Google STT transcription successful: '{transcript[:50]}...'")
+            return transcript.strip()
+        except Exception as e:
+            logger.error(f"Google STT transcription failed: {type(e).__name__}: {e}")
+            raise
+    def configure(
+        self,
+        provider: Optional[STTProvider] = None,
+        language_code: Optional[str] = None,
+        enable_fallback: Optional[bool] = None
+    ):
+        """
+        Reconfigure STT service settings.
+        Args:
+            provider: New STT provider to use
+            language_code: New language code
+            enable_fallback: Whether to enable fallback
+        """
+        if provider is not None:
+            self.provider = provider
+            logger.info(f"STT provider changed to: {provider.value}")
+        if language_code is not None:
+            self.language_code = language_code
+            logger.info(f"Language code changed to: {language_code}")
+        if enable_fallback is not None:
+            self.enable_fallback = enable_fallback
+            logger.info(f"Fallback {'enabled' if enable_fallback else 'disabled'}")
+# Example usage
+if __name__ == "__main__":
+    import sys
+    # Configure logging for testing
+    logging.basicConfig(
+        level=logging.INFO,
+        format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
+    )
+    # Check for audio file argument
+    if len(sys.argv) < 2:
+        print("Usage: python -m mortis.stt_service <audio_file>")
+        print("Example: python -m mortis.stt_service test_audio.wav")
+        sys.exit(1)
+    audio_file = sys.argv[1]
+    try:
+        # Create STT service
+        stt_service = STTService()
+        # Transcribe audio
+        print(f"\nTranscribing: {audio_file}")
+        print("-" * 60)
+        transcript = stt_service.transcribe(audio_file)
+        print(f"Transcript: {transcript}")
+        print("-" * 60)
+    except FileNotFoundError as e:
+        print(f"Error: {e}")
+        sys.exit(1)
+    except AudioProcessingError as e:
+        print(f"Audio processing error: {e}")
+        sys.exit(1)
+    except ValueError as e:
+        print(f"Configuration error: {e}")
+        print("Please set GEMINI_API_KEY in your .env file")
+        sys.exit(1)
+    except Exception as e:
+        print(f"Unexpected error: {type(e).__name__}: {e}")
+        sys.exit(1)

src/mortis/tools.py ADDED Viewed

	@@ -0,0 +1,418 @@

+"""
+LLM integration for Mortis conversational AI.
+This module provides the ask_mortis() function that integrates with the Gemini API
+to generate character-driven responses and coordinate gesture execution.
+"""
+import logging
+import time
+from typing import Tuple, Optional
+from pathlib import Path
+from .robot import MortisArm
+from .gemini_client import GeminiClient
+from .models import GeminiResponse
+# Configure logging
+logger = logging.getLogger(__name__)
+# Global instances
+mortis_arm = MortisArm()
+gemini_client = None  # Lazy initialization
+stt_service = None  # Lazy initialization
+tts_service = None  # Lazy initialization
+intent_router = None  # Lazy initialization
+smolvla_executor = None  # Lazy initialization
+def _get_gemini_client() -> GeminiClient:
+    """
+    Get or create the global GeminiClient instance.
+    Returns:
+        GeminiClient instance
+    """
+    global gemini_client
+    if gemini_client is None:
+        gemini_client = GeminiClient()
+        logger.info("GeminiClient initialized")
+    return gemini_client
+def _get_stt_service():
+    """
+    Get or create the global STTService instance.
+    Returns:
+        STTService instance
+    """
+    global stt_service
+    if stt_service is None:
+        from .stt_service import STTService
+        stt_service = STTService()
+        logger.info("STTService initialized")
+    return stt_service
+def _get_tts_service():
+    """
+    Get or create the global TTSService instance.
+    Returns:
+        TTSService instance
+    """
+    global tts_service
+    if tts_service is None:
+        from .tts_service import get_tts_service
+        tts_service = get_tts_service()
+        logger.info("TTSService initialized")
+    return tts_service
+def _get_intent_router():
+    """
+    Get or create the global IntentRouter instance.
+    Returns:
+        IntentRouter instance
+    """
+    global intent_router
+    if intent_router is None:
+        from .intent_router import IntentRouter
+        intent_router = IntentRouter()
+        logger.info("IntentRouter initialized")
+    return intent_router
+def _get_smolvla_executor():
+    """
+    Get or create the global SmolVLAExecutor instance.
+    Returns:
+        SmolVLAExecutor instance or None if not configured
+    """
+    global smolvla_executor
+    if smolvla_executor is None:
+        import os
+        # Check if we're in simulation mode
+        robot_mode = os.getenv("ROBOT_MODE", "physical").lower()
+        if robot_mode == "simulation":
+            logger.info("SmolVLA disabled in simulation mode")
+            smolvla_executor = None
+            return None
+        checkpoint_path = os.getenv("SMOLVLA_CHECKPOINT_PATH")
+        if checkpoint_path:
+            try:
+                from .smolvla_executor import SmolVLAExecutor
+                smolvla_executor = SmolVLAExecutor(
+                    checkpoint_path=checkpoint_path,
+                    robot_arm=mortis_arm
+                )
+                logger.info(f"SmolVLAExecutor initialized with checkpoint: {checkpoint_path}")
+            except Exception as e:
+                logger.warning(f"Failed to initialize SmolVLAExecutor: {e}")
+                logger.warning("Manipulation commands will fall back to gestures")
+                smolvla_executor = None
+        else:
+            logger.info("SMOLVLA_CHECKPOINT_PATH not set, manipulation commands will use gestures")
+            smolvla_executor = None
+    return smolvla_executor
+def ask_mortis(
+    user_msg: Optional[str] = None,
+    model_name: Optional[str] = None,
+    audio_path: Optional[str] = None
+) -> Tuple[str, str, str]:
+    """
+    Send user message to Gemini API and get Mortis response with gesture.
+    This function supports both text and voice input through a unified interface.
+    It implements the complete voice-to-text-to-Gemini-to-TTS pipeline with
+    latency monitoring.
+    Processing flow:
+    1. If audio_path provided, transcribe to text using STT
+    2. Connect to robot arm if not already connected
+    3. Send text message to Gemini API
+    4. Parse structured JSON response
+    5. Return message, mood, and gesture for execution
+    Args:
+        user_msg: User's input message text (optional if audio_path provided)
+        model_name: Optional Gemini model name (uses default from env if not provided)
+        audio_path: Optional path to audio file for voice input
+    Returns:
+        Tuple of (message, mood, gesture) where:
+            - message: Text response from Mortis
+            - mood: Emotional mood (e.g., "ominous", "playful")
+            - gesture: Gesture to execute (e.g., "wave", "idle")
+    Raises:
+        ValueError: If neither user_msg nor audio_path is provided
+    Note:
+        This function maintains backward compatibility with the previous API.
+        The gesture is returned but not automatically executed - the caller
+        is responsible for executing the gesture via mortis_arm.move_arm().
+        Latency monitoring logs are generated for voice processing pipeline.
+    """
+    pipeline_start = time.time()
+    # Validate input
+    if user_msg is None and audio_path is None:
+        raise ValueError("Either user_msg or audio_path must be provided")
+    # Voice input processing
+    if audio_path is not None:
+        logger.info(f"🎤 Processing voice input from: {audio_path}")
+        stt_start = time.time()
+        try:
+            # Get STT service
+            stt = _get_stt_service()
+            # Transcribe audio to text
+            user_msg = stt.transcribe(audio_path)
+            stt_latency = time.time() - stt_start
+            logger.info(f"⏱️ STT latency: {stt_latency:.2f}s")
+            logger.info(f"📝 Transcribed: '{user_msg[:50]}...'")
+            if not user_msg or not user_msg.strip():
+                logger.warning("⚠️ STT returned empty transcription")
+                return "I couldn't hear you... speak again.", "nervous", "idle"
+        except Exception as e:
+            logger.error(f"❌ Voice input processing failed: {e}")
+            return "The spirits couldn't understand... try again.", "ominous", "idle"
+    # Ensure robot is connected
+    if not mortis_arm.connected:
+        try:
+            mortis_arm.connect()
+            logger.info("Robot arm connected")
+        except Exception as e:
+            logger.error(f"Failed to connect to robot arm: {e}")
+            # Continue anyway - we can still generate responses
+    # Get Gemini client
+    client = _get_gemini_client()
+    # Reconfigure model if specified
+    if model_name:
+        client.configure_model(model_name=model_name)
+        logger.info(f"Using Gemini model: {model_name}")
+    # Send message to Gemini
+    logger.info(f"💬 Asking Mortis: {user_msg[:50]}...")
+    gemini_start = time.time()
+    response_json = client.send_message(user_msg)
+    gemini_latency = time.time() - gemini_start
+    logger.info(f"⏱️ Gemini latency: {gemini_latency:.2f}s")
+    # Parse response using IntentRouter
+    try:
+        # Get intent router
+        router = _get_intent_router()
+        # Parse Gemini response into Intent
+        intent = router.parse_gemini_response(response_json)
+        # Extract fields for return
+        message = intent.message
+        mood = intent.mood
+        gesture = intent.gesture if intent.gesture else "idle"
+        # Route based on intent type
+        execution_path = router.route_intent(intent)
+        if execution_path == "manipulation":
+            # Valid manipulation command - attempt SmolVLA execution
+            logger.info(f"🤖 Manipulation command detected: '{intent.command}'")
+            # Try to get SmolVLA executor
+            executor = _get_smolvla_executor()
+            if executor is not None:
+                try:
+                    # Execute manipulation task
+                    logger.info(f"Executing manipulation task: {intent.command}")
+                    success = executor.execute(intent.command)
+                    if success:
+                        logger.info(f"✅ Manipulation task completed successfully")
+                    else:
+                        logger.warning(f"⚠️ Manipulation task did not complete fully")
+                    # Return with "manipulation" as gesture to indicate manipulation was executed
+                    gesture = "manipulation"
+                except Exception as e:
+                    logger.error(f"❌ SmolVLA execution failed: {e}")
+                    logger.info("Falling back to gesture execution")
+                    # Fallback to gesture execution
+                    gesture = "idle"
+                    if mortis_arm.connected:
+                        mortis_arm.move_arm(gesture)
+            else:
+                # No SmolVLA executor available, fall back to gesture
+                logger.warning("SmolVLA executor not available, falling back to gesture")
+                gesture = "idle"
+                if mortis_arm.connected:
+                    mortis_arm.move_arm(gesture)
+        elif execution_path == "gesture":
+            # Conversational response with gesture
+            logger.info(f"💬 Conversation with gesture: {gesture}")
+            # Execute gesture immediately
+            if mortis_arm.connected:
+                try:
+                    mortis_arm.move_arm(gesture)
+                except Exception as e:
+                    logger.error(f"Failed to execute gesture '{gesture}': {e}")
+        elif execution_path == "invalid":
+            # Invalid intent - fall back to gesture
+            logger.warning(f"⚠️ Invalid intent: {intent.validation_error}")
+            logger.info("Falling back to conversational gesture")
+            # Use gesture from intent or default to idle
+            gesture = intent.gesture if intent.gesture else "idle"
+            # Execute gesture
+            if mortis_arm.connected:
+                try:
+                    mortis_arm.move_arm(gesture)
+                except Exception as e:
+                    logger.error(f"Failed to execute fallback gesture '{gesture}': {e}")
+        # Calculate total pipeline latency
+        total_latency = time.time() - pipeline_start
+        logger.info(f"⏱️ Total pipeline latency: {total_latency:.2f}s")
+        logger.info(f"👻 Mortis responds (path: {execution_path}, mood: {mood}, gesture: {gesture})")
+        return message, mood, gesture
+    except (ValueError, KeyError) as e:
+        # If parsing fails, return safe defaults
+        logger.error(f"Failed to parse Gemini response: {e}")
+        logger.error(f"Response JSON: {response_json}")
+        # Return fallback response
+        return "The spirits are confused... try again.", "ominous", "idle"
+def ask_mortis_with_voice(
+    user_msg: Optional[str] = None,
+    model_name: Optional[str] = None,
+    audio_path: Optional[str] = None,
+    generate_audio: bool = True
+) -> Tuple[str, str, str, Optional[str]]:
+    """
+    Complete voice-to-text-to-Gemini-to-TTS pipeline with audio output.
+    This is a convenience function that wraps ask_mortis() and adds TTS
+    generation for the response. It provides the full multi-modal experience.
+    Args:
+        user_msg: User's input message text (optional if audio_path provided)
+        model_name: Optional Gemini model name
+        audio_path: Optional path to audio file for voice input
+        generate_audio: Whether to generate audio output (default: True)
+    Returns:
+        Tuple of (message, mood, gesture, audio_path) where:
+            - message: Text response from Mortis
+            - mood: Emotional mood
+            - gesture: Gesture to execute
+            - audio_path: Path to generated audio file (None if generation fails)
+    Note:
+        This function logs latency for the complete voice processing pipeline
+        including STT, Gemini inference, and TTS generation.
+    """
+    pipeline_start = time.time()
+    # Get text response from Gemini (handles STT if audio_path provided)
+    message, mood, gesture = ask_mortis(
+        user_msg=user_msg,
+        model_name=model_name,
+        audio_path=audio_path
+    )
+    # Generate audio response if requested
+    response_audio_path = None
+    if generate_audio:
+        tts_start = time.time()
+        try:
+            # Get TTS service
+            tts = _get_tts_service()
+            # Generate audio
+            response_audio_path = tts.synthesize(message)
+            tts_latency = time.time() - tts_start
+            logger.info(f"⏱️ TTS latency: {tts_latency:.2f}s")
+            if response_audio_path:
+                logger.info(f"🔊 Audio generated: {response_audio_path}")
+            else:
+                logger.warning("⚠️ TTS returned None")
+        except Exception as e:
+            logger.error(f"❌ TTS generation failed: {e}")
+            # Continue without audio - text response is still valid
+    # Log total pipeline latency including TTS
+    total_latency = time.time() - pipeline_start
+    logger.info(f"⏱️ Complete voice pipeline latency: {total_latency:.2f}s")
+    return message, mood, gesture, response_audio_path
+if __name__ == "__main__":
+    # Configure logging for testing
+    logging.basicConfig(level=logging.INFO)
+    # Test conversational interactions
+    print("=== Test 1: Greeting ===")
+    message, mood, gesture = ask_mortis("Mortis, someone is entering the lab… act!")
+    print(f"Message: {message}")
+    print(f"Mood: {mood}")
+    print(f"Gesture: {gesture}")
+    print()
+    print("=== Test 2: Introduction ===")
+    message, mood, gesture = ask_mortis("Introduce yourself with a sinister bow.")
+    print(f"Message: {message}")
+    print(f"Mood: {mood}")
+    print(f"Gesture: {gesture}")
+    print()
+    print("=== Test 3: Action sequence ===")
+    message, mood, gesture = ask_mortis("Grab the cursed vial and then release it.")
+    print(f"Message: {message}")
+    print(f"Mood: {mood}")
+    print(f"Gesture: {gesture}")
+    print()
+    print("=== Test 4: Manipulation command ===")
+    message, mood, gesture = ask_mortis("Can you move the skull to the green cup?")
+    print(f"Message: {message}")
+    print(f"Mood: {mood}")
+    print(f"Gesture: {gesture}")
+    print()

src/mortis/tts_service.py ADDED Viewed

	@@ -0,0 +1,225 @@

+"""
+Text-to-Speech service for Mortis voice output.
+Provides TTS capabilities using Google Cloud Text-to-Speech API with
+fallback to local gTTS for offline scenarios.
+"""
+import os
+import time
+import logging
+from pathlib import Path
+from typing import Optional
+logger = logging.getLogger(__name__)
+class TTSService:
+    """
+    Text-to-Speech service for converting Mortis responses to audio.
+    Uses Google Cloud TTS as primary service with gTTS as fallback.
+    Configured for a deep, ominous voice suitable for Mortis character.
+    """
+    def __init__(
+        self,
+        output_dir: str = "outputs",
+        use_google_tts: bool = True,
+        voice_name: str = "en-US-Neural2-D",
+        speaking_rate: float = 0.9,
+        pitch: float = -2.0
+    ):
+        """
+        Initialize TTS service.
+        Args:
+            output_dir: Directory for generated audio files
+            use_google_tts: Whether to use Google Cloud TTS (requires credentials)
+            voice_name: Google TTS voice name (Neural2-D is deep male voice)
+            speaking_rate: Speech speed (0.9 = slightly slower for ominous effect)
+            pitch: Voice pitch (-2.0 = lower for spooky voice)
+        """
+        self.output_dir = Path(output_dir)
+        self.output_dir.mkdir(parents=True, exist_ok=True)
+        self.use_google_tts = use_google_tts
+        self.voice_name = voice_name
+        self.speaking_rate = speaking_rate
+        self.pitch = pitch
+        # Try to initialize Google TTS client
+        self.google_client = None
+        self.texttospeech = None
+        if self.use_google_tts:
+            try:
+                from google.cloud import texttospeech
+                self.google_client = texttospeech.TextToSpeechClient()
+                self.texttospeech = texttospeech
+                logger.info("Google Cloud TTS initialized successfully")
+            except ImportError as e:
+                logger.warning(f"Google Cloud TTS not available: {e}. Will use gTTS fallback.")
+                self.use_google_tts = False
+            except Exception as e:
+                logger.warning(f"Failed to initialize Google TTS: {e}. Will use gTTS fallback.")
+                self.use_google_tts = False
+        logger.info(f"TTS Service initialized (Google TTS: {self.use_google_tts})")
+    def synthesize(self, text: str, filename: Optional[str] = None) -> Optional[str]:
+        """
+        Convert text to speech audio file.
+        Args:
+            text: Text to convert to speech
+            filename: Optional custom filename (without extension)
+        Returns:
+            Path to generated audio file, or None if synthesis fails
+        """
+        if not text or not text.strip():
+            logger.warning("Empty text provided to TTS service")
+            return None
+        # Generate filename if not provided
+        if filename is None:
+            timestamp = int(time.time() * 1000)
+            filename = f"mortis_response_{timestamp}"
+        # Try Google TTS first
+        if self.use_google_tts and self.google_client:
+            try:
+                audio_path = self._synthesize_google_tts(text, filename)
+                logger.info(f"Generated audio with Google TTS: {audio_path}")
+                return audio_path
+            except Exception as e:
+                logger.error(f"Google TTS failed: {e}. Falling back to gTTS.")
+        # Fallback to gTTS
+        try:
+            audio_path = self._synthesize_gtts(text, filename)
+            logger.info(f"Generated audio with gTTS: {audio_path}")
+            return audio_path
+        except Exception as e:
+            logger.error(f"gTTS also failed: {e}. No audio generated.")
+            return None
+    def _synthesize_google_tts(self, text: str, filename: str) -> str:
+        """
+        Synthesize speech using Google Cloud TTS.
+        Args:
+            text: Text to synthesize
+            filename: Base filename (without extension)
+        Returns:
+            Path to generated MP3 file
+        """
+        # Prepare synthesis input
+        synthesis_input = self.texttospeech.SynthesisInput(text=text)
+        # Configure voice parameters for Mortis character
+        voice = self.texttospeech.VoiceSelectionParams(
+            language_code="en-US",
+            name=self.voice_name,
+            ssml_gender=self.texttospeech.SsmlVoiceGender.MALE
+        )
+        # Configure audio output
+        audio_config = self.texttospeech.AudioConfig(
+            audio_encoding=self.texttospeech.AudioEncoding.MP3,
+            speaking_rate=self.speaking_rate,
+            pitch=self.pitch
+        )
+        # Perform synthesis
+        response = self.google_client.synthesize_speech(
+            input=synthesis_input,
+            voice=voice,
+            audio_config=audio_config
+        )
+        # Save audio file
+        output_path = self.output_dir / f"{filename}.mp3"
+        with open(output_path, "wb") as out:
+            out.write(response.audio_content)
+        return str(output_path)
+    def _synthesize_gtts(self, text: str, filename: str) -> str:
+        """
+        Synthesize speech using gTTS (local fallback).
+        Args:
+            text: Text to synthesize
+            filename: Base filename (without extension)
+        Returns:
+            Path to generated MP3 file
+        """
+        from gtts import gTTS
+        # Create TTS object with slower speech for ominous effect
+        tts = gTTS(text=text, lang='en', slow=True)
+        # Save audio file
+        output_path = self.output_dir / f"{filename}.mp3"
+        tts.save(str(output_path))
+        return str(output_path)
+    def cleanup_old_files(self, max_age_seconds: int = 3600):
+        """
+        Remove old audio files to prevent disk space issues.
+        Args:
+            max_age_seconds: Maximum age of files to keep (default: 1 hour)
+        """
+        current_time = time.time()
+        removed_count = 0
+        for audio_file in self.output_dir.glob("mortis_response_*.mp3"):
+            try:
+                file_age = current_time - audio_file.stat().st_mtime
+                if file_age > max_age_seconds:
+                    audio_file.unlink()
+                    removed_count += 1
+            except Exception as e:
+                logger.warning(f"Failed to remove old file {audio_file}: {e}")
+        if removed_count > 0:
+            logger.info(f"Cleaned up {removed_count} old audio files")
+# Global TTS service instance
+_tts_service: Optional[TTSService] = None
+def get_tts_service() -> TTSService:
+    """
+    Get or create global TTS service instance.
+    Returns:
+        Singleton TTSService instance
+    """
+    global _tts_service
+    if _tts_service is None:
+        # Check if Google Cloud credentials are available
+        use_google = bool(os.getenv("GOOGLE_APPLICATION_CREDENTIALS"))
+        _tts_service = TTSService(use_google_tts=use_google)
+    return _tts_service
+def synthesize_speech(text: str, filename: Optional[str] = None) -> Optional[str]:
+    """
+    Convenience function to synthesize speech using global TTS service.
+    Args:
+        text: Text to convert to speech
+        filename: Optional custom filename
+    Returns:
+        Path to generated audio file, or None if synthesis fails
+    """
+    service = get_tts_service()
+    return service.synthesize(text, filename)