Deploy Computer Agent v2.0 full stack

Browse files

Files changed (9) hide show

.gitignore +6 -0
README.md +93 -0
core_agent.py +707 -0
e2bqwen.py +500 -0
eval_harness.py +366 -0
mcp_tools.py +479 -0
requirements.txt +25 -0
templates/viewer.html +753 -0
voice_interface.py +137 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,6 @@

+.local/
+__pycache__/
+*.pyc
+memory_db/
+tmp/
+eval_results/

README.md ADDED Viewed

	@@ -0,0 +1,93 @@

+---
+title: Computer Agent v2.0
+emoji: 🤖
+colorFrom: purple
+colorTo: blue
+sdk: gradio
+sdk_version: "5.0.0"
+app_file: app.py
+pinned: false
+license: apache-2.0
+short_description: "Enhanced universal computer agent with planner, MCP, memory & voice"
+---
+# 🤖 Open Computer Agent v2.0
+An **enhanced** universal computer-use agent built on [smolagents](https://github.com/huggingface/smolagents), [E2B Desktop](https://e2b.dev), and [Playwright](https://playwright.dev).
+## What's New in v2.0
+| Feature | Description |
+|---------|-------------|
+| 🧠 **Hierarchical Planner** | Breaks goals into subtasks before execution using a cheap text model |
+| 🔌 **Playwright MCP** | Semantic browser control (click by text/role, extract tables/links, evaluate JS) |
+| 🎯 **Multi-Model Router** | Auto-selects the cheapest capable model (fast vision ↔ powerful vision ↔ fast text ↔ powerful text) |
+| 🧩 **Set-of-Marks Vision** | Overlays numbered bounding boxes on UI elements for coordinate-free interaction |
+| 🗄️ **Long-Term Memory** | ChromaDB vector store retrieves similar past tasks and proven strategies |
+| 🔍 **Verifier Agent** | Checks subtask completion and triggers recovery loops |
+| 🛑 **Human-in-the-Loop** | Pauses on sensitive actions (payments, emails, deletes) for user approval |
+| 🎙️ **Voice I/O** | Speak tasks and hear responses via Whisper STT + Kokoro TTS |
+| 💰 **Cost Dashboard** | Real-time $/task, token usage, and latency tracking |
+| 📹 **Session Recording** | Saves every step as replayable macros with GIF/MP4 export potential |
+| 🧪 **Enhanced Eval** | Built-in benchmark suite with LLM-as-a-Judge grading and A/B testing |
+## Architecture
+```
+User Input (Text / Voice / File)
+       |
+       v
+[Intelligence Router] ----> Planner (JSON DAG)
+       |
+       v
+[Memory Retrieval] (ChromaDB)
+       |
+       v
+[Plan Executor]
+       |
+       +---> [Browser Sub-Agent] (Playwright MCP)
+       +---> [Desktop Sub-Agent] (E2B + SoM Vision)
+       +---> [Coder Sub-Agent] (Code Interpreter)
+       +---> [HF Hub Sub-Agent] (Search / Upload)
+       |
+       v
+[Verifier] -> Retry / Alternative / Continue
+       |
+       v
+[Macro Saver] + Cost Report + Session Recording
+```
+## Quick Start
+1. Set your **HF_TOKEN** and **E2B_API_KEY** in the Space Secrets.
+2. Type a task (or speak it) and hit **🚀 Let's go!**.
+3. Watch the agent plan, execute, verify, and report costs.
+## Sensitive Actions
+By default, the agent pauses before:
+- Payments, purchases, subscriptions
+- Sending emails/messages/posts
+- Deleting files or uninstalling software
+- Password/credit-card fields
+Enable **Auto-approve all actions** in Advanced Options to disable HITL.
+## Cost Budget
+Default budget is **$2.00 USD per session**. The router automatically downgrades to cheaper models as the budget is consumed.
+## Benchmarks
+Run the built-in eval suite:
+```python
+from eval_harness import EvaluationHarness
+# See eval_harness.py for usage
+```
+## Credits
+- [smolagents](https://github.com/huggingface/smolagents) by Hugging Face
+- [E2B](https://e2b.dev) for secure sandboxed desktops
+- [Playwright](https://playwright.dev) for browser automation
+- [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct) for vision reasoning

core_agent.py ADDED Viewed

	@@ -0,0 +1,707 @@

+"""
+core_agent.py — Enhanced Computer Agent Brain
+=============================================
+Hierarchical Planner + Verifier + Multi-Model Router + Long-Term Memory
+"""
+import os
+import json
+import time
+import uuid
+from datetime import datetime
+from typing import Any, Dict, List, Optional, Tuple
+from dataclasses import dataclass, field
+import numpy as np
+from PIL import Image, ImageDraw, ImageFont
+# Smolagents
+from smolagents import CodeAgent, tool
+from smolagents.agent_types import AgentImage
+from smolagents.memory import ActionStep, TaskStep
+from smolagents.models import ChatMessage, Model, HfApiModel
+from smolagents.monitoring import LogLevel
+# Local model fallback
+from huggingface_hub import InferenceClient
+# Try ChromaDB for memory
+try:
+    import chromadb
+    from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction
+    HAS_CHROMA = True
+except ImportError:
+    HAS_CHROMA = False
+# Try sentence-transformers for embeddings
+try:
+    from sentence_transformers import SentenceTransformer
+    HAS_ST = True
+except ImportError:
+    HAS_ST = False
+# ---------------------------------------------------------------------------
+# Data models
+# ---------------------------------------------------------------------------
+@dataclass
+class Subtask:
+    id: str
+    description: str
+    status: str = "pending"  # pending | running | completed | failed
+    strategy: str = "auto"   # browser | desktop | code | vision
+    depends_on: List[str] = field(default_factory=list)
+    result: Any = None
+    retries: int = 0
+    max_retries: int = 2
+@dataclass
+class Plan:
+    goal: str
+    subtasks: List[Subtask]
+    created_at: float = field(default_factory=time.time)
+@dataclass
+class ModelCall:
+    model_id: str
+    tokens_in: int = 0
+    tokens_out: int = 0
+    latency_ms: float = 0.0
+    cost_usd: float = 0.0
+    timestamp: float = field(default_factory=time.time)
+# ---------------------------------------------------------------------------
+# Multi-Model Intelligence Router
+# ---------------------------------------------------------------------------
+MODEL_REGISTRY = {
+    "fast_vision": {
+        "model_id": "Qwen/Qwen2.5-VL-7B-Instruct",
+        "endpoint": None,  # Use HF Inference API
+        "type": "vision",
+        "cost_per_1k_in": 0.0001,
+        "cost_per_1k_out": 0.0002,
+        "max_tokens": 2048,
+    },
+    "powerful_vision": {
+        "model_id": "Qwen/Qwen2.5-VL-72B-Instruct",
+        "endpoint": None,
+        "type": "vision",
+        "cost_per_1k_in": 0.001,
+        "cost_per_1k_out": 0.002,
+        "max_tokens": 4096,
+    },
+    "fast_text": {
+        "model_id": "Qwen/Qwen2.5-32B-Instruct",
+        "endpoint": None,
+        "type": "text",
+        "cost_per_1k_in": 0.0002,
+        "cost_per_1k_out": 0.0004,
+        "max_tokens": 4096,
+    },
+    "powerful_text": {
+        "model_id": "Qwen/Qwen3-235B-A22B",
+        "endpoint": None,
+        "type": "text",
+        "cost_per_1k_in": 0.0015,
+        "cost_per_1k_out": 0.003,
+        "max_tokens": 8192,
+    },
+}
+class IntelligenceRouter(Model):
+    """Routes tasks to the optimal model based on complexity, modality, and cost."""
+    def __init__(
+        self,
+        hf_token: Optional[str] = None,
+        default_vision: str = "powerful_vision",
+        default_text: str = "fast_text",
+        cost_budget_usd: float = 1.0,
+    ):
+        super().__init__()
+        self.hf_token = hf_token or os.getenv("HF_TOKEN") or os.getenv("HUGGINGFACE_API_KEY")
+        self.default_vision = default_vision
+        self.default_text = default_text
+        self.cost_budget_usd = cost_budget_usd
+        self.cost_so_far_usd = 0.0
+        self.call_history: List[ModelCall] = []
+        self._clients: Dict[str, InferenceClient] = {}
+    def _get_client(self, model_key: str) -> InferenceClient:
+        if model_key not in self._clients:
+            cfg = MODEL_REGISTRY[model_key]
+            self._clients[model_key] = InferenceClient(
+                model=cfg["model_id"],
+                token=self.hf_token,
+            )
+        return self._clients[model_key]
+    def select_model(
+        self,
+        task_type: str = "vision",
+        complexity: str = "medium",
+        has_images: bool = False,
+    ) -> str:
+        """Select the best model for a given task."""
+        if self.cost_so_far_usd >= self.cost_budget_usd * 0.9:
+            # Budget nearly exhausted — use cheapest
+            return "fast_vision" if has_images else "fast_text"
+        if has_images or task_type == "vision":
+            if complexity in ("high", "complex", "spatial"):
+                return self.default_vision
+            return "fast_vision"
+        if complexity in ("high", "complex", "reasoning"):
+            return "powerful_text"
+        return self.default_text
+    def __call__(
+        self,
+        messages: List[Dict[str, Any]],
+        stop_sequences: Optional[List[str]] = None,
+        task_type: str = "vision",
+        complexity: str = "medium",
+        has_images: bool = False,
+        **kwargs,
+    ) -> ChatMessage:
+        model_key = self.select_model(task_type, complexity, has_images)
+        cfg = MODEL_REGISTRY[model_key]
+        client = self._get_client(model_key)
+        start = time.time()
+        try:
+            # HF InferenceClient chat_completion
+            response = client.chat_completion(
+                messages=messages,
+                max_tokens=cfg["max_tokens"],
+                stop=stop_sequences,
+            )
+            latency = (time.time() - start) * 1000
+            # Estimate cost (rough token counting)
+            content = response.choices[0].message.content or ""
+            tok_in = self._estimate_tokens(messages)
+            tok_out = len(content.split()) * 1.3  # rough
+            cost = (tok_in / 1000) * cfg["cost_per_1k_in"] + (tok_out / 1000) * cfg["cost_per_1k_out"]
+            self.cost_so_far_usd += cost
+            self.call_history.append(ModelCall(
+                model_id=cfg["model_id"],
+                tokens_in=int(tok_in),
+                tokens_out=int(tok_out),
+                latency_ms=latency,
+                cost_usd=cost,
+            ))
+            return ChatMessage(role="assistant", content=content)
+        except Exception as e:
+            # Fallback to default vision/text
+            fallback = self.default_vision if has_images else self.default_text
+            if model_key == fallback:
+                raise
+            print(f"[{model_key}] failed: {e}. Falling back to {fallback}")
+            return self.__call__(
+                messages, stop_sequences, task_type, complexity, has_images, **kwargs
+            )
+    def _estimate_tokens(self, messages: List[Dict[str, Any]]) -> int:
+        # Very rough estimate: 4 chars ~= 1 token
+        total = 0
+        for msg in messages:
+            content = msg.get("content", "")
+            if isinstance(content, str):
+                total += len(content) // 4
+            elif isinstance(content, list):
+                for item in content:
+                    if isinstance(item, dict) and "text" in item:
+                        total += len(item["text"]) // 4
+        return max(total, 1)
+    def get_cost_report(self) -> Dict[str, Any]:
+        return {
+            "budget_usd": self.cost_budget_usd,
+            "spent_usd": round(self.cost_so_far_usd, 6),
+            "remaining_usd": round(self.cost_budget_usd - self.cost_so_far_usd, 6),
+            "calls": len(self.call_history),
+            "by_model": self._aggregate_by_model(),
+        }
+    def _aggregate_by_model(self) -> Dict[str, Dict[str, float]]:
+        agg = {}
+        for c in self.call_history:
+            agg.setdefault(c.model_id, {"calls": 0, "tokens_in": 0, "tokens_out": 0, "cost": 0.0})
+            agg[c.model_id]["calls"] += 1
+            agg[c.model_id]["tokens_in"] += c.tokens_in
+            agg[c.model_id]["tokens_out"] += c.tokens_out
+            agg[c.model_id]["cost"] += c.cost_usd
+        return agg
+# ---------------------------------------------------------------------------
+# Hierarchical Planner
+# ---------------------------------------------------------------------------
+PLANNER_SYSTEM_PROMPT = """You are a Task Planner for a computer automation agent.
+Given a user's high-level goal, break it into a JSON list of subtasks.
+Each subtask must have:
+- description: concise action description
+- strategy: one of [browser, desktop, code, vision]
+- depends_on: list of subtask indices (0-based) that must finish before this one
+Rules:
+1. Use "browser" for web navigation, "desktop" for OS-level GUI actions,
+   "code" for writing/running scripts, "vision" for visual reasoning.
+2. Keep subtasks atomic (1-3 actions each).
+3. Start with gathering info, then acting, then verifying.
+4. Output ONLY valid JSON. No markdown fences.
+Example input: "Find Hugging Face HQ in Paris using Google Maps"
+Example output:
+[
+  {"description": "Open Google Maps in browser", "strategy": "browser", "depends_on": []},
+  {"description": "Search for 'Hugging Face Paris'", "strategy": "browser", "depends_on": [0]},
+  {"description": "Extract the address from the result card", "strategy": "vision", "depends_on": [1]},
+  {"description": "Verify the address contains 'Paris'", "strategy": "code", "depends_on": [2]}
+]
+"""
+class HierarchicalPlanner:
+    """Breaks a user goal into a DAG of subtasks using a cheap text model."""
+    def __init__(self, router: IntelligenceRouter):
+        self.router = router
+    def plan(self, goal: str, context: str = "") -> Plan:
+        messages = [
+            {"role": "system", "content": PLANNER_SYSTEM_PROMPT},
+            {"role": "user", "content": f"Goal: {goal}\nContext: {context}\n\nGenerate the subtask JSON list."},
+        ]
+        response = self.router(
+            messages,
+            task_type="text",
+            complexity="medium",
+            has_images=False,
+        )
+        raw = response.content.strip()
+        # Strip markdown fences if present
+        if raw.startswith("```"):
+            raw = raw.split("```", 2)[-1]
+            if raw.startswith("json"):
+                raw = raw[4:]
+        raw = raw.strip()
+        try:
+            data = json.loads(raw)
+        except json.JSONDecodeError:
+            # Fallback: single subtask with the whole goal
+            data = [{"description": goal, "strategy": "auto", "depends_on": []}]
+        subtasks = []
+        for i, item in enumerate(data):
+            subtasks.append(Subtask(
+                id=f"st_{i:03d}",
+                description=item.get("description", str(item)),
+                strategy=item.get("strategy", "auto"),
+                depends_on=item.get("depends_on", []),
+            ))
+        return Plan(goal=goal, subtasks=subtasks)
+# ---------------------------------------------------------------------------
+# Verifier & Recovery
+# ---------------------------------------------------------------------------
+VERIFIER_SYSTEM_PROMPT = """You are a Verifier agent. Given a subtask description, the agent's action trace, and a screenshot, determine if the subtask was completed successfully.
+Respond with ONLY a JSON object:
+{"success": true/false, "reason": "short explanation", "next_action": "continue|retry|alternative"}
+Rules:
+- success=true if the intended outcome is clearly visible in the screenshot or trace.
+- next_action=retry if the agent seems close but missed a click.
+- next_action=alternative if the approach is fundamentally wrong.
+"""
+class VerifierAgent:
+    """Checks if a subtask succeeded and suggests recovery."""
+    def __init__(self, router: IntelligenceRouter):
+        self.router = router
+    def verify(
+        self,
+        subtask: Subtask,
+        action_trace: List[str],
+        screenshot: Optional[Image.Image] = None,
+    ) -> Dict[str, Any]:
+        trace_text = "\n".join(action_trace[-10:])  # last 10 actions
+        content = [
+            {"type": "text", "text": f"Subtask: {subtask.description}\nAction trace:\n{trace_text}\n\nWas this completed successfully?"},
+        ]
+        if screenshot:
+            # In a real implementation we'd base64 encode the image
+            content.append({"type": "text", "text": "[Screenshot available — analyze it]"})
+        messages = [
+            {"role": "system", "content": VERIFIER_SYSTEM_PROMPT},
+            {"role": "user", "content": content},
+        ]
+        response = self.router(
+            messages,
+            task_type="vision" if screenshot else "text",
+            complexity="medium",
+            has_images=screenshot is not None,
+        )
+        raw = response.content.strip()
+        if raw.startswith("```"):
+            raw = raw.split("```", 2)[-1]
+            if raw.startswith("json"):
+                raw = raw[4:]
+        raw = raw.strip()
+        try:
+            return json.loads(raw)
+        except json.JSONDecodeError:
+            return {"success": True, "reason": "Parsing failed, assuming success", "next_action": "continue"}
+# ---------------------------------------------------------------------------
+# Long-Term Memory (ChromaDB)
+# ---------------------------------------------------------------------------
+class AgentMemory:
+    """Stores and retrieves past task trajectories for few-shot prompting."""
+    def __init__(self, persist_dir: str = "./memory_db"):
+        self.persist_dir = persist_dir
+        os.makedirs(persist_dir, exist_ok=True)
+        self.collection = None
+        if HAS_CHROMA and HAS_ST:
+            self.client = chromadb.PersistentClient(path=persist_dir)
+            self.ef = SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")
+            self.collection = self.client.get_or_create_collection(
+                name="task_memory",
+                embedding_function=self.ef,
+            )
+        elif HAS_ST:
+            # Fallback: in-memory similarity with numpy
+            self.embedder = SentenceTransformer("all-MiniLM-L6-v2")
+            self._memories: List[Dict] = []
+        else:
+            self._memories: List[Dict] = []
+    def embed(self, text: str) -> List[float]:
+        if HAS_ST:
+            return self.embedder.encode(text).tolist()
+        return []
+    def add_task(
+        self,
+        task: str,
+        strategy_summary: str,
+        success: bool,
+        final_answer: str = "",
+        domain: str = "general",
+    ):
+        entry = {
+            "task": task,
+            "strategy_summary": strategy_summary,
+            "success": success,
+            "final_answer": final_answer,
+            "domain": domain,
+            "timestamp": time.time(),
+        }
+        if self.collection:
+            self.collection.add(
+                documents=[task],
+                metadatas=[entry],
+                ids=[str(uuid.uuid4())],
+            )
+        else:
+            self._memories.append(entry)
+    def retrieve_similar(
+        self,
+        query: str,
+        n_results: int = 3,
+        filter_success: bool = True,
+    ) -> List[Dict[str, Any]]:
+        if self.collection:
+            where = {"success": True} if filter_success else None
+            results = self.collection.query(
+                query_texts=[query],
+                n_results=n_results,
+                where=where,
+            )
+            out = []
+            for meta in results.get("metadatas", [[]])[0]:
+                out.append(meta)
+            return out
+        else:
+            # Simple exact/contains match fallback
+            query_lower = query.lower()
+            scored = []
+            for m in self._memories:
+                score = 0
+                if query_lower in m["task"].lower():
+                    score += 10
+                if m.get("domain", "") in query_lower:
+                    score += 5
+                if filter_success and not m.get("success", False):
+                    score -= 100
+                scored.append((score, m))
+            scored.sort(key=lambda x: x[0], reverse=True)
+            return [x[1] for x in scored[:n_results]]
+    def get_domain_tips(self, domain: str) -> List[str]:
+        tips = []
+        for m in self._memories:
+            if m.get("domain") == domain and m.get("success"):
+                tips.append(m.get("strategy_summary", ""))
+        return tips[:5]
+# ---------------------------------------------------------------------------
+# Set-of-Marks (SoM) Preprocessor
+# ---------------------------------------------------------------------------
+class SoMPreprocessor:
+    """Overlays numbered bounding boxes on UI elements for the agent to reference by ID."""
+    def __init__(self, use_icon_detection: bool = False):
+        self.use_icon_detection = use_icon_detection
+        self.element_registry: Dict[int, Tuple[int, int, int, int]] = {}
+        self.next_id = 1
+    def detect_elements(self, image: Image.Image) -> List[Tuple[int, int, int, int]]:
+        """Lightweight heuristic element detection.
+        In production, replace with OmniParser or seeclick model.
+        """
+        # Simple grid-based + edge heuristic fallback
+        w, h = image.size
+        boxes = []
+        # Detect potential buttons/links by looking for rectangular regions
+        # This is a placeholder — real implementation would use a vision model
+        # For now, divide screen into a coarse grid and let agent pick grid cells
+        cols, rows = 8, 6
+        cell_w, cell_h = w // cols, h // rows
+        for r in range(rows):
+            for c in range(cols):
+                x1, y1 = c * cell_w, r * cell_h
+                x2, y2 = x1 + cell_w, y1 + cell_h
+                boxes.append((x1, y1, x2, y2))
+        return boxes
+    def preprocess(self, image: Image.Image) -> Tuple[Image.Image, Dict[int, Tuple[int, int, int, int]]]:
+        """Return annotated image + element registry mapping ID -> bbox."""
+        boxes = self.detect_elements(image)
+        annotated = image.copy()
+        draw = ImageDraw.Draw(annotated)
+        registry = {}
+        try:
+            font = ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf", 14)
+        except Exception:
+            font = ImageFont.load_default()
+        for i, (x1, y1, x2, y2) in enumerate(boxes, start=1):
+            registry[i] = (x1, y1, x2, y2)
+            # Draw bounding box
+            draw.rectangle([x1, y1, x2, y2], outline="#00FF00", width=2)
+            # Draw label background
+            label = str(i)
+            bbox = draw.textbbox((0, 0), label, font=font)
+            tw, th = bbox[2] - bbox[0], bbox[3] - bbox[1]
+            draw.rectangle([x1, y1, x1 + tw + 4, y1 + th + 4], fill="#00FF00")
+            draw.text((x1 + 2, y1 + 2), label, fill="#000000", font=font)
+        self.element_registry = registry
+        self.next_id = len(registry) + 1
+        return annotated, registry
+    def get_center(self, element_id: int) -> Tuple[int, int]:
+        x1, y1, x2, y2 = self.element_registry[element_id]
+        return (x1 + x2) // 2, (y1 + y2) // 2
+# ---------------------------------------------------------------------------
+# Session Recorder & Macro Saver
+# ---------------------------------------------------------------------------
+@dataclass
+class SessionFrame:
+    step: int
+    screenshot_path: Optional[str]
+    action: str
+    observation: str
+    timestamp: float
+class SessionRecorder:
+    """Records every step for replay, GIF generation, and macro creation."""
+    def __init__(self, session_id: str, output_dir: str = "./sessions"):
+        self.session_id = session_id
+        self.output_dir = os.path.join(output_dir, session_id)
+        os.makedirs(self.output_dir, exist_ok=True)
+        self.frames: List[SessionFrame] = []
+        self.start_time = time.time()
+    def log_step(
+        self,
+        step: int,
+        screenshot: Optional[Image.Image],
+        action: str,
+        observation: str,
+    ):
+        path = None
+        if screenshot:
+            path = os.path.join(self.output_dir, f"step_{step:03d}.png")
+            screenshot.save(path)
+        frame = SessionFrame(
+            step=step,
+            screenshot_path=path,
+            action=action,
+            observation=observation,
+            timestamp=time.time(),
+        )
+        self.frames.append(frame)
+        # Also append to JSONL
+        with open(os.path.join(self.output_dir, "session.jsonl"), "a") as f:
+            f.write(json.dumps({
+                "step": step,
+                "action": action,
+                "observation": observation,
+                "timestamp": frame.timestamp,
+                "screenshot": path,
+            }) + "\n")
+    def save_macro(self, name: str) -> str:
+        """Save successful trajectory as a replayable macro."""
+        macro = {
+            "name": name,
+            "session_id": self.session_id,
+            "frames": [
+                {"action": f.action, "observation": f.observation, "timestamp": f.timestamp}
+                for f in self.frames
+            ],
+        }
+        path = os.path.join(self.output_dir, f"macro_{name}.json")
+        with open(path, "w") as f:
+            json.dump(macro, f, indent=2)
+        return path
+    def generate_summary(self) -> Dict[str, Any]:
+        duration = time.time() - self.start_time
+        actions = [f.action for f in self.frames]
+        return {
+            "session_id": self.session_id,
+            "duration_sec": round(duration, 2),
+            "steps": len(self.frames),
+            "actions": actions,
+        }
+# ---------------------------------------------------------------------------
+# HITL (Human-in-the-Loop) Checkpoint
+# ---------------------------------------------------------------------------
+class HITLCheckpoint:
+    """Defines categories of actions that require human approval."""
+    SENSITIVE_KEYWORDS = [
+        "password", "credit card", "ssn", "social security",
+        "payment", "checkout", "buy", "purchase", "subscribe",
+        "delete", "remove", "uninstall", "format",
+        "send email", "send message", "post to", "tweet",
+    ]
+    def __init__(self, auto_approve: bool = False):
+        self.auto_approve = auto_approve
+        self.pending_approvals: List[Dict[str, Any]] = []
+    def check_action(self, action: str, context: str = "") -> Tuple[bool, Optional[str]]:
+        """Returns (approved, reason). If not approved, reason explains why."""
+        if self.auto_approve:
+            return True, None
+        action_lower = action.lower()
+        for kw in self.SENSITIVE_KEYWORDS:
+            if kw in action_lower:
+                return False, f"Sensitive action detected: '{kw}'. Requires human approval."
+        return True, None
+    def request_approval(self, action: str, screenshot_path: Optional[str] = None) -> Dict[str, Any]:
+        req = {
+            "id": str(uuid.uuid4()),
+            "action": action,
+            "screenshot": screenshot_path,
+            "status": "pending",
+            "requested_at": time.time(),
+        }
+        self.pending_approvals.append(req)
+        return req
+# ---------------------------------------------------------------------------
+# Cost Tracker
+# ---------------------------------------------------------------------------
+class CostTracker:
+    """Tracks per-task and cumulative costs across all model calls."""
+    def __init__(self):
+        self.tasks: Dict[str, List[ModelCall]] = {}
+    def start_task(self, task_id: str):
+        self.tasks[task_id] = []
+    def log_call(self, task_id: str, call: ModelCall):
+        self.tasks.setdefault(task_id, []).append(call)
+    def get_task_report(self, task_id: str) -> Dict[str, Any]:
+        calls = self.tasks.get(task_id, [])
+        total_cost = sum(c.cost_usd for c in calls)
+        total_tokens = sum(c.tokens_in + c.tokens_out for c in calls)
+        total_latency = sum(c.latency_ms for c in calls)
+        return {
+            "task_id": task_id,
+            "calls": len(calls),
+            "total_cost_usd": round(total_cost, 6),
+            "total_tokens": total_tokens,
+            "avg_latency_ms": round(total_latency / max(len(calls), 1), 2),
+            "by_model": self._aggregate(calls),
+        }
+    def _aggregate(self, calls: List[ModelCall]) -> Dict[str, Dict[str, float]]:
+        agg = {}
+        for c in calls:
+            agg.setdefault(c.model_id, {"calls": 0, "cost": 0.0, "tokens": 0})
+            agg[c.model_id]["calls"] += 1
+            agg[c.model_id]["cost"] += c.cost_usd
+            agg[c.model_id]["tokens"] += c.tokens_in + c.tokens_out
+        return agg
+# ---------------------------------------------------------------------------
+# Convenience: Compose everything into an AgentConfig
+# ---------------------------------------------------------------------------
+@dataclass
+class AgentConfig:
+    hf_token: Optional[str] = None
+    cost_budget_usd: float = 2.0
+    use_planner: bool = True
+    use_verifier: bool = True
+    use_memory: bool = True
+    use_som: bool = True
+    use_hitl: bool = True
+    use_recorder: bool = True
+    memory_dir: str = "./memory_db"
+    auto_approve: bool = False

e2bqwen.py ADDED Viewed

	@@ -0,0 +1,500 @@

+import os
+import time
+import unicodedata
+import spaces
+from datetime import datetime
+from io import BytesIO
+from typing import Any, Dict, List, Optional
+# E2B imports
+from e2b_desktop import Sandbox
+from PIL import Image, ImageDraw
+# SmolaAgents imports
+from smolagents import CodeAgent, HfApiModel, tool
+from smolagents.agent_types import AgentImage
+from smolagents.memory import ActionStep, TaskStep
+from smolagents.models import ChatMessage, Model
+from smolagents.monitoring import LogLevel
+E2B_SYSTEM_PROMPT_TEMPLATE = """You are a desktop automation assistant that can control a remote desktop environment. The current date is <<current_date>>.
+<action process>
+You will be given a task to solve in several steps. At each step you will perform an action.
+After each action, you'll receive an updated screenshot.
+Then you will proceed as follows, with these sections: don't skip any!
+Short term goal: ...
+What I see: ...
+Reflection: ...
+Action:
+```python
+click(254, 308)
+```<end_code>
+Akways format your action ('Action:' part) as Python code blocks as shown above.
+</action_process>
+<tools>
+On top of performing computations in the Python code snippets that you create, you only have access to these tools to interact with the desktop, no additional ones:
+{%- for tool in tools.values() %}
+- {{ tool.name }}: {{ tool.description }}
+    Takes inputs: {{tool.inputs}}
+    Returns an output of type: {{tool.output_type}}
+{%- endfor %}
+</tools>
+<click_guidelines>
+Look at elements on the screen to determine what to click or interact with.
+The desktop has a resolution of <<resolution_x>>x<<resolution_y>> pixels, take it into account to decide clicking coordinates. NEVER USE HYPOTHETIC OR ASSUMED COORDINATES, USE TRUE COORDINATES that you can see from the screenshot.
+Use precise coordinates based on the current screenshot for mouse movements and clicks.
+Whenever you click, MAKE SURE to click in the middle of the button, text, link or any other clickable element. Not under, not on the side. IN THE MIDDLE, else you risk to miss it.
+In menus it is always better to click in the middle of the text rather than in the tiny icon. Calculate extremelly well the coordinates. A mistake here can make the full task fail.
+Sometimes you may have missed a click, so never assume that you're on the right page, always make sure that your previous action worked.
+In the screenshot you will see a green crosshair displayed over the position of your last click: this way can inspect if the mouse pointer is off of the targeted element, pay special attention to it.
+</click_guidelines>
+<task_resolution_example>
+For a task like "Open a text editor and type 'Hello World'":
+Step 1:
+Short term goal: I want to open a text editor.
+What I see: I am on the homepage of my desktop. I see the applications
+Reflection: I think that a notes application would fit in the Applications menu, let's open it. I'll carefully click in the middle of the text 'Applications'/
+Action:
+```python
+click(51, 8)
+```<end_code>
+Step 2:
+Short term goal: I want to open a text editor.
+What I see: I am on the homepage of my desktop, with the applications menu open. I see an Accessories section, I see it is a section in the menu thanks to the tiny white triangle after the text accessories.
+Reflection: I think that a notes application would fit the Accessories section. I SHOULD NOT try to move through the menus with scroll, it won't work:
+I'll look for Accessories and click on it being very precise, clicking in the middle of the text 'Accessories'.
+Action:
+```python
+click(76, 195)
+```<end_code>
+Step 3:
+Short term goal: I want to open a text editor.
+What I see: I am under the Accessories menu. Under the open submenu Accessories, I've found 'Text Editor'.
+Reflection: This must be my notes app. I remember that menus are navigated through clicking. I will now click on it being very precise, clicking in the middle of the text 'Text Editor'.
+Action:
+```python
+click(251, 441)
+```<end_code>
+Step 4:
+Short term goal: I want to open a text editor.
+What I see: I am still under the Accessories menu. Nothing has changed compared to previous screenshot. Under the open submenu Accessories, I still see 'Text Editor'. The green cross is off from the element.
+Reflection: My last click must have been off. Let's correct this. I will click the correct place, right in the middle of the element.
+Action:
+```python
+click(241, 441)
+```<end_code>
+Step 5:
+Short term goal: I want to type 'Hello World'.
+What I see: I have opened a Notepad. The Notepad app is open on an empty page
+Reflection: Now Notepad is open as intended, time to type text.
+Action:
+```python
+type_text("Hello World")
+```<end_code>
+Step 6:
+Short term goal: I want to type 'Hello World'.
+What I see: The Notepad app displays 'Hello World'
+Reflection: Now that I've 1. Opened the notepad and 2. typed 'Hello World', and 3. the result seems correct, I think the Task is completed. I will return a confirmation that the task is completed.
+Action:
+```python
+final_answer("Done")
+```<end_code>
+</task_resolution_example>
+<general_guidelines>
+Always analyze the latest screenshot carefully before performing actions.
+You can wait for appropriate loading times using the wait() tool. But don't wait forever, sometimes you've just misclicked and the process didn't launch.
+Execute one action at a time: don't try to pack a click and typing in one action.
+On each step, look at the last screenshot and action to validate if previous steps worked and decide the next action. If you repeated an action already without effect, it means that this action is useless: don't repeat it and try something else.
+Use click to move through menus on the desktop and scroll for web and specific applications.
+Always analyze the latest screenshot carefully before performing actions.
+Desktop menus usually expand with more options, the tiny triangle next to some text in a menu means that menu expands. For example in Office in the Applications menu expands showing presentation or writing applications.
+NEVER CLICK THE WEB BROWSER ICON TO OPEN THE WEB BROWSER: use open_url directly.
+In browser, ignore any sign-in popups while they don't interfere with the elements you want to interact with.
+</general_guidelines>
+""".replace("<<current_date>>", datetime.now().strftime("%A, %d-%B-%Y"))
+@spaces.GPU
+def draw_marker_on_image(image_copy, click_coordinates):
+    x, y = click_coordinates
+    draw = ImageDraw.Draw(image_copy)
+    cross_size, linewidth = 10, 3
+    # Draw cross
+    draw.line((x - cross_size, y, x + cross_size, y), fill="green", width=linewidth)
+    draw.line((x, y - cross_size, x, y + cross_size), fill="green", width=linewidth)
+    # Add a circle around it for better visibility
+    draw.ellipse(
+        (
+            x - cross_size * 2,
+            y - cross_size * 2,
+            x + cross_size * 2,
+            y + cross_size * 2,
+        ),
+        outline="green",
+        width=linewidth,
+    )
+    return image_copy
+@spaces.GPU
+def get_agent_summary_erase_images(agent):
+    for memory_step in agent.memory.steps:
+        if hasattr(memory_step, "observations_images"):
+            memory_step.observations_images = None
+        if hasattr(memory_step, "task_images"):
+            memory_step.task_images = None
+    return agent.write_memory_to_messages()
+@spaces.GPU
+class E2BVisionAgent(CodeAgent):
+    """Agent for e2b desktop automation with Qwen2.5VL vision capabilities"""
+    def __init__(
+        self,
+        model: HfApiModel,
+        data_dir: str,
+        desktop: Sandbox,
+        tools: List[tool] = None,
+        max_steps: int = 200,
+        verbosity_level: LogLevel = 2,
+        planning_interval: int = None,
+        use_v1_prompt: bool = False,
+        **kwargs,
+    ):
+        self.desktop = desktop
+        self.data_dir = data_dir
+        self.planning_interval = planning_interval
+        # Initialize Desktop
+        self.width, self.height = self.desktop.get_screen_size()
+        print(f"Screen size: {self.width}x{self.height}")
+        # Set up temp directory
+        os.makedirs(self.data_dir, exist_ok=True)
+        print(f"Screenshots and steps will be saved to: {self.data_dir}")
+        self.use_v1_prompt = use_v1_prompt
+        # Initialize base agent
+        super().__init__(
+            tools=tools or [],
+            model=model,
+            max_steps=max_steps,
+            verbosity_level=verbosity_level,
+            planning_interval=self.planning_interval,
+            **kwargs,
+        )
+        self.prompt_templates["system_prompt"] = E2B_SYSTEM_PROMPT_TEMPLATE.replace(
+            "<<resolution_x>>", str(self.width)
+        ).replace("<<resolution_y>>", str(self.height))
+        # Add screen info to state
+        self.state["screen_width"] = self.width
+        self.state["screen_height"] = self.height
+        # Add default tools
+        self.logger.log("Setting up agent tools...")
+        self._setup_desktop_tools()
+        self.step_callbacks.append(self.take_screenshot_callback)
+    def _setup_desktop_tools(self):
+        """Register all desktop tools"""
+        @tool
+        def click(x: int, y: int) -> str:
+            """
+            Performs a left-click at the specified coordinates
+            Args:
+                x: The x coordinate (horizontal position)
+                y: The y coordinate (vertical position)
+            """
+            self.desktop.move_mouse(x, y)
+            self.desktop.left_click()
+            self.click_coordinates = [x, y]
+            self.logger.log(f"Clicked at coordinates ({x}, {y})")
+            return f"Clicked at coordinates ({x}, {y})"
+        @tool
+        def right_click(x: int, y: int) -> str:
+            """
+            Performs a right-click at the specified coordinates
+            Args:
+                x: The x coordinate (horizontal position)
+                y: The y coordinate (vertical position)
+            """
+            self.desktop.move_mouse(x, y)
+            self.desktop.right_click()
+            self.click_coordinates = [x, y]
+            self.logger.log(f"Right-clicked at coordinates ({x}, {y})")
+            return f"Right-clicked at coordinates ({x}, {y})"
+        @tool
+        def double_click(x: int, y: int) -> str:
+            """
+            Performs a double-click at the specified coordinates
+            Args:
+                x: The x coordinate (horizontal position)
+                y: The y coordinate (vertical position)
+            """
+            self.desktop.move_mouse(x, y)
+            self.desktop.double_click()
+            self.click_coordinates = [x, y]
+            self.logger.log(f"Double-clicked at coordinates ({x}, {y})")
+            return f"Double-clicked at coordinates ({x}, {y})"
+        @tool
+        def move_mouse(x: int, y: int) -> str:
+            """
+            Moves the mouse cursor to the specified coordinates
+            Args:
+                x: The x coordinate (horizontal position)
+                y: The y coordinate (vertical position)
+            """
+            self.desktop.move_mouse(x, y)
+            self.logger.log(f"Moved mouse to coordinates ({x}, {y})")
+            return f"Moved mouse to coordinates ({x}, {y})"
+        def normalize_text(text):
+            return "".join(
+                c
+                for c in unicodedata.normalize("NFD", text)
+                if not unicodedata.combining(c)
+            )
+        @tool
+        def type_text(text: str) -> str:
+            """
+            Types the specified text at the current cursor position.
+            Args:
+                text: The text to type
+            """
+            clean_text = normalize_text(text)
+            self.desktop.write(clean_text, delay_in_ms=75)
+            self.logger.log(f"Typed text: '{clean_text}'")
+            return f"Typed text: '{clean_text}'"
+        @tool
+        def press_key(key: str) -> str:
+            """
+            Presses a keyboard key
+            Args:
+                key: The key to press (e.g. "enter", "space", "backspace", etc.).
+            """
+            self.desktop.press(key)
+            self.logger.log(f"Pressed key: {key}")
+            return f"Pressed key: {key}"
+        @tool
+        def go_back() -> str:
+            """
+            Goes back to the previous page in the browser. If using this tool doesn't work, just click the button directly.
+            Args:
+            """
+            self.desktop.press(["alt", "left"])
+            self.logger.log("Went back one page")
+            return "Went back one page"
+        @tool
+        def drag_and_drop(x1: int, y1: int, x2: int, y2: int) -> str:
+            """
+            Clicks [x1, y1], drags mouse to [x2, y2], then release click.
+            Args:
+                x1: origin x coordinate
+                y1: origin y coordinate
+                x2: end x coordinate
+                y2: end y coordinate
+            """
+            self.desktop.drag([x1, y1], [x2, y2])
+            message = f"Dragged and dropped from [{x1}, {y1}] to [{x2}, {y2}]"
+            self.logger.log(message)
+            return message
+        @tool
+        def scroll(x: int, y: int, direction: str = "down", amount: int = 2) -> str:
+            """
+            Moves the mouse to selected coordinates, then uses the scroll button: this could scroll the page or zoom, depending on the app. DO NOT use scroll to move through linux desktop menus.
+            Args:
+                x: The x coordinate (horizontal position) of the element to scroll/zoom
+                y: The y coordinate (vertical position) of the element to scroll/zoom
+                direction: The direction to scroll ("up" or "down"), defaults to "down". For zoom, "up" zooms in, "down" zooms out.
+                amount: The amount to scroll. A good amount is 1 or 2.
+            """
+            self.desktop.move_mouse(x, y)
+            self.desktop.scroll(direction=direction, amount=amount)
+            message = f"Scrolled {direction} by {amount}"
+            self.logger.log(message)
+            return message
+        @tool
+        def wait(seconds: float) -> str:
+            """
+            Waits for the specified number of seconds. Very useful in case the prior order is still executing (for example starting very heavy applications like browsers or office apps)
+            Args:
+                seconds: Number of seconds to wait, generally 3 is enough.
+            """
+            time.sleep(seconds)
+            self.logger.log(f"Waited for {seconds} seconds")
+            return f"Waited for {seconds} seconds"
+        @tool
+        def open_url(url: str) -> str:
+            """
+            Directly opens a browser with the specified url: use this at start of web searches rather than trying to click the browser.
+            Args:
+                url: The URL to open
+            """
+            # Make sure URL has http/https prefix
+            if not url.startswith(("http://", "https://")):
+                url = "https://" + url
+            self.desktop.open(url)
+            # Give it time to load
+            time.sleep(2)
+            self.logger.log(f"Opening URL: {url}")
+            return f"Opened URL: {url}"
+        @tool
+        def find_on_page_ctrl_f(search_string: str) -> str:
+            """
+            Scroll the browser viewport to the first occurrence of the search string. This is equivalent to Ctrl+F. Use this to search on a pdf for instance.
+            Args:
+                search_string: The string to search for on the page.
+            """
+            self.desktop.press(["ctrl", "f"])
+            time.sleep(0.3)
+            clean_text = normalize_text(search_string)
+            self.desktop.write(clean_text, delay_in_ms=75)
+            time.sleep(0.3)
+            self.desktop.press("enter")
+            time.sleep(0.3)
+            self.desktop.press("esc")
+            output_message = f"Scrolled to the first occurrence of '{clean_text}'"
+            self.logger.log(output_message)
+            return output_message
+        # Register the tools
+        self.tools["click"] = click
+        self.tools["right_click"] = right_click
+        self.tools["double_click"] = double_click
+        self.tools["move_mouse"] = move_mouse
+        self.tools["type_text"] = type_text
+        self.tools["press_key"] = press_key
+        self.tools["scroll"] = scroll
+        self.tools["wait"] = wait
+        self.tools["open_url"] = open_url
+        self.tools["go_back"] = go_back
+        self.tools["drag_and_drop"] = drag_and_drop
+        self.tools["find_on_page_ctrl_f"] = find_on_page_ctrl_f
+    def take_screenshot_callback(self, memory_step: ActionStep, agent=None) -> None:
+        """Callback that takes a screenshot + memory snapshot after a step completes"""
+        self.logger.log("Analyzing screen content...")
+        current_step = memory_step.step_number
+        time.sleep(2.5)  # Let things happen on the desktop
+        screenshot_bytes = self.desktop.screenshot(format="bytes")
+        image = Image.open(BytesIO(screenshot_bytes))
+        # Create a filename with step number
+        screenshot_path = os.path.join(self.data_dir, f"step_{current_step:03d}.png")
+        image.save(screenshot_path)
+        image_copy = image.copy()
+        if getattr(self, "click_coordinates", None):
+            print("DRAWING MARKER")
+            image_copy = draw_marker_on_image(image_copy, self.click_coordinates)
+        self.last_marked_screenshot = AgentImage(screenshot_path)
+        print(f"Saved screenshot for step {current_step} to {screenshot_path}")
+        for previous_memory_step in (
+            agent.memory.steps
+        ):  # Remove previous screenshots from logs for lean processing
+            if (
+                isinstance(previous_memory_step, ActionStep)
+                and previous_memory_step.step_number <= current_step - 1
+            ):
+                previous_memory_step.observations_images = None
+            elif isinstance(previous_memory_step, TaskStep):
+                previous_memory_step.task_images = None
+            if (
+                isinstance(previous_memory_step, ActionStep)
+                and previous_memory_step.step_number == current_step - 1
+            ):
+                if (
+                    previous_memory_step.tool_calls
+                    and getattr(previous_memory_step.tool_calls[0], "arguments", None)
+                    and memory_step.tool_calls
+                    and getattr(memory_step.tool_calls[0], "arguments", None)
+                ):
+                    if (
+                        previous_memory_step.tool_calls[0].arguments
+                        == memory_step.tool_calls[0].arguments
+                    ):
+                        memory_step.observations += "\nWARNING: You've executed the same action several times in a row. MAKE SURE TO NOT UNNECESSARILY REPEAT ACTIONS."
+        # Add the marker-edited image to the current memory step
+        memory_step.observations_images = [image_copy]
+        # memory_step.observations_images = [screenshot_path] # IF YOU USE THIS INSTEAD OF ABOVE, LAUNCHING A SECOND TASK BREAKS
+        self.click_coordinates = None  # Reset click marker
+    def close(self):
+        """Clean up resources"""
+        if self.desktop:
+            print("Stopping e2b stream and killing sandbox...")
+            self.desktop.stream.stop()
+            self.desktop.kill()
+            print("E2B sandbox terminated")
+class QwenVLAPIModel(Model):
+    """Model wrapper for Qwen2.5VL API with fallback mechanism"""
+    def __init__(
+        self,
+        model_id: str = "Qwen/Qwen2.5-VL-72B-Instruct",
+        hf_token: str = None,
+    ):
+        super().__init__()
+        self.model_id = model_id
+        self.base_model = HfApiModel(
+            model_id="https://n5wr7lfx6wp94tvl.us-east-1.aws.endpoints.huggingface.cloud",
+            token=hf_token,
+            max_tokens=4096,
+        )
+        self.fallback_model = HfApiModel(
+            model_id="https://ahbeihft09ulicbf.us-east-1.aws.endpoints.huggingface.cloud",
+            token=hf_token,
+            max_tokens=4096,
+        )
+    def __call__(
+        self,
+        messages: List[Dict[str, Any]],
+        stop_sequences: Optional[List[str]] = None,
+        **kwargs,
+    ) -> ChatMessage:
+        try:
+            message = self.base_model(messages, stop_sequences, **kwargs)
+            return message
+        except Exception as e:
+            print(f"Base model failed with error: {e}. Calling fallback model.")
+        # Continue to fallback
+        try:
+            message = self.fallback_model(messages, stop_sequences, **kwargs)
+            return message
+        except Exception as e:
+            raise Exception(f"Both endpoints failed. Last error: {e}")

eval_harness.py ADDED Viewed

	@@ -0,0 +1,366 @@

+"""
+eval_harness.py — Enhanced Evaluation Framework
+================================================
+Supports custom benchmarks, WebArena-style tasks, GAIA-style tasks,
+A/B testing, and LLM-as-a-judge grading.
+"""
+import os
+import json
+import time
+import random
+from concurrent.futures import ThreadPoolExecutor
+from typing import Any, Dict, List, Optional, Callable
+from dataclasses import dataclass, field, asdict
+# ---------------------------------------------------------------------------
+# Benchmark Tasks
+# ---------------------------------------------------------------------------
+@dataclass
+class BenchmarkTask:
+    id: str
+    category: str
+    description: str
+    expected_answer: Optional[str] = None
+    expected_contains: Optional[List[str]] = None
+    max_steps: int = 50
+    setup_script: Optional[str] = None  # Shell commands to prep the sandbox
+    teardown_script: Optional[str] = None
+    weight: float = 1.0
+DEFAULT_BENCHMARKS: List[BenchmarkTask] = [
+    # Web navigation
+    BenchmarkTask(
+        id="puppies",
+        category="web_search",
+        description="Find me pictures of cute puppies",
+        expected_contains=["puppy", "dog", "image"],
+        max_steps=30,
+    ),
+    BenchmarkTask(
+        id="gmaps_hf_hq",
+        category="web_navigation",
+        description="Use Google Maps to find the Hugging Face HQ in Paris",
+        expected_contains=["Paris", "Hugging Face", "5/7"],
+        max_steps=40,
+    ),
+    BenchmarkTask(
+        id="wikipedia_april4",
+        category="web_research",
+        description="Go to Wikipedia and find what happened on April 4th",
+        expected_contains=["April", "4"],
+        max_steps=30,
+    ),
+    BenchmarkTask(
+        id="commute_bern_basel",
+        category="web_navigation",
+        description="Find out the travel time by train from Bern to Basel on Google Maps",
+        expected_contains=["Bern", "Basel", "hour", "min"],
+        max_steps=40,
+    ),
+    BenchmarkTask(
+        id="hf_flux_gpu",
+        category="hf_ecosystem",
+        description="Go to Hugging Face Spaces and find the Space flux.1 schnell. Use it to generate an image of a GPU",
+        expected_contains=["GPU", "image"],
+        max_steps=60,
+    ),
+    BenchmarkTask(
+        id="github_trending",
+        category="web_research",
+        description="Go to GitHub trending and find the top Python repository today",
+        expected_contains=["Python", "github.com"],
+        max_steps=35,
+    ),
+    BenchmarkTask(
+        id="pdf_extract",
+        category="document",
+        description="Download a sample PDF from the internet and extract the first paragraph",
+        expected_contains=["PDF", "paragraph"],
+        max_steps=40,
+    ),
+    BenchmarkTask(
+        id="calc_sum",
+        category="code_execution",
+        description="Calculate the sum of the first 100 prime numbers using Python",
+        expected_answer="24133",
+        max_steps=20,
+    ),
+    BenchmarkTask(
+        id="dark_mode_maps",
+        category="web_navigation",
+        description="Open Google Maps and switch to dark mode if available",
+        expected_contains=["dark", "theme"],
+        max_steps=30,
+    ),
+    BenchmarkTask(
+        id="hf_model_search",
+        category="hf_ecosystem",
+        description="Search Hugging Face Hub for 'text-to-video' models and list the top 3 by downloads",
+        expected_contains=["text-to-video", "model"],
+        max_steps=35,
+    ),
+]
+# ---------------------------------------------------------------------------
+# LLM-as-a-Judge
+# ---------------------------------------------------------------------------
+class LLMJudge:
+    """Grades agent outputs using a language model."""
+    def __init__(self, model_call: Callable[[List[Dict[str, Any]]], str]):
+        self.model_call = model_call
+    def grade_exact(self, predicted: str, expected: str) -> float:
+        return 1.0 if expected.lower().strip() in predicted.lower().strip() else 0.0
+    def grade_contains(self, predicted: str, expected_list: List[str]) -> float:
+        if not expected_list:
+            return 1.0
+        matched = sum(1 for e in expected_list if e.lower() in predicted.lower())
+        return matched / len(expected_list)
+    def grade_semantic(
+        self,
+        task_description: str,
+        agent_trace: str,
+        predicted: str,
+        expected: Optional[str] = None,
+        expected_contains: Optional[List[str]] = None,
+    ) -> Dict[str, Any]:
+        """Use an LLM to judge success on a 0-1 scale."""
+        prompt = f"""You are an expert evaluator. A computer agent was given this task:
+Task: {task_description}
+The agent's final response / trace summary:
+{predicted[:2000]}
+Expected answer (if any): {expected or 'N/A'}
+Expected keywords (if any): {expected_contains or 'N/A'}
+Rate the agent's success on a scale from 0.0 to 1.0, where:
+- 1.0 = fully completed and correct
+- 0.5 = partially correct or incomplete
+- 0.0 = completely wrong or failed
+Respond ONLY with a JSON object:
+{{"score": float, "reason": "short explanation", "missing": "what was missing"}}
+"""
+        response = self.model_call([{"role": "user", "content": prompt}])
+        content = response.strip()
+        if content.startswith("```"):
+            content = content.split("```", 2)[-1]
+            if content.startswith("json"):
+                content = content[4:]
+        content = content.strip()
+        try:
+            result = json.loads(content)
+            return {
+                "score": float(result.get("score", 0.0)),
+                "reason": result.get("reason", ""),
+                "missing": result.get("missing", ""),
+            }
+        except (json.JSONDecodeError, ValueError):
+            # Fallback heuristic
+            score = 0.5 if "success" in predicted.lower() or "done" in predicted.lower() else 0.0
+            return {"score": score, "reason": "LLM judge parsing failed, heuristic fallback", "missing": ""}
+# ---------------------------------------------------------------------------
+# Evaluation Harness
+# ---------------------------------------------------------------------------
+@dataclass
+class TaskResult:
+    task_id: str
+    success: bool
+    score: float
+    duration_sec: float
+    steps_taken: int
+    final_output: str
+    error: Optional[str] = None
+    judge_reason: Optional[str] = None
+@dataclass
+class EvalSummary:
+    total_tasks: int
+    passed: int
+    failed: int
+    avg_score: float
+    avg_duration: float
+    by_category: Dict[str, Dict[str, Any]]
+    results: List[TaskResult]
+    timestamp: float = field(default_factory=time.time)
+class EvaluationHarness:
+    """Run benchmarks against the agent and produce reports."""
+    def __init__(
+        self,
+        agent_factory: Callable[[], Any],
+        judge_model_call: Optional[Callable] = None,
+        output_dir: str = "./eval_results",
+    ):
+        self.agent_factory = agent_factory
+        self.judge = LLMJudge(judge_model_call) if judge_model_call else None
+        self.output_dir = output_dir
+        os.makedirs(output_dir, exist_ok=True)
+    def run_task(
+        self,
+        task: BenchmarkTask,
+        num_runs: int = 1,
+    ) -> List[TaskResult]:
+        results = []
+        for run_idx in range(num_runs):
+            start = time.time()
+            agent = self.agent_factory()
+            try:
+                # Run the agent
+                output = agent.run(task.description, max_steps=task.max_steps)
+                duration = time.time() - start
+                # Grade
+                if self.judge:
+                    judge_result = self.judge.grade_semantic(
+                        task.description,
+                        str(output),
+                        str(output),
+                        task.expected_answer,
+                        task.expected_contains,
+                    )
+                    score = judge_result["score"]
+                    reason = judge_result["reason"]
+                else:
+                    if task.expected_answer:
+                        score = self.judge.grade_exact(str(output), task.expected_answer) if self.judge else 0.0
+                    elif task.expected_contains:
+                        score = self.judge.grade_contains(str(output), task.expected_contains) if self.judge else 0.0
+                    else:
+                        score = 0.5
+                    reason = "Heuristic grading (no LLM judge)"
+                success = score >= 0.7
+                results.append(TaskResult(
+                    task_id=f"{task.id}_run{run_idx}",
+                    success=success,
+                    score=score,
+                    duration_sec=round(duration, 2),
+                    steps_taken=getattr(agent, "step_number", 0),
+                    final_output=str(output)[:2000],
+                    error=None,
+                    judge_reason=reason,
+                ))
+            except Exception as e:
+                duration = time.time() - start
+                results.append(TaskResult(
+                    task_id=f"{task.id}_run{run_idx}",
+                    success=False,
+                    score=0.0,
+                    duration_sec=round(duration, 2),
+                    steps_taken=0,
+                    final_output="",
+                    error=str(e),
+                    judge_reason="Exception during execution",
+                ))
+        return results
+    def run_suite(
+        self,
+        tasks: Optional[List[BenchmarkTask]] = None,
+        num_runs: int = 1,
+        max_parallel: int = 2,
+    ) -> EvalSummary:
+        tasks = tasks or DEFAULT_BENCHMARKS
+        all_results: List[TaskResult] = []
+        def run_single(task):
+            return self.run_task(task, num_runs=num_runs)
+        with ThreadPoolExecutor(max_workers=max_parallel) as executor:
+            futures = [executor.submit(run_single, t) for t in tasks]
+            for future in futures:
+                all_results.extend(future.result())
+        # Aggregate
+        passed = sum(1 for r in all_results if r.success)
+        total = len(all_results)
+        avg_score = sum(r.score for r in all_results) / max(total, 1)
+        avg_duration = sum(r.duration_sec for r in all_results) / max(total, 1)
+        by_category: Dict[str, Any] = {}
+        for r in all_results:
+            # Map back to category from task_id prefix
+            cat = "unknown"
+            for t in tasks:
+                if r.task_id.startswith(t.id):
+                    cat = t.category
+                    break
+            by_category.setdefault(cat, {"count": 0, "passed": 0, "avg_score": 0.0, "scores": []})
+            by_category[cat]["count"] += 1
+            if r.success:
+                by_category[cat]["passed"] += 1
+            by_category[cat]["scores"].append(r.score)
+        for cat, data in by_category.items():
+            data["avg_score"] = round(sum(data["scores"]) / max(len(data["scores"]), 1), 3)
+            del data["scores"]
+        summary = EvalSummary(
+            total_tasks=total,
+            passed=passed,
+            failed=total - passed,
+            avg_score=round(avg_score, 3),
+            avg_duration=round(avg_duration, 2),
+            by_category=by_category,
+            results=all_results,
+        )
+        # Save
+        ts = int(time.time())
+        path = os.path.join(self.output_dir, f"eval_summary_{ts}.json")
+        with open(path, "w") as f:
+            json.dump(asdict(summary), f, indent=2, default=str)
+        print(f"Evaluation saved to {path}")
+        return summary
+    def compare_strategies(
+        self,
+        strategy_a_factory: Callable[[], Any],
+        strategy_b_factory: Callable[[], Any],
+        tasks: Optional[List[BenchmarkTask]] = None,
+        num_runs: int = 3,
+    ) -> Dict[str, Any]:
+        """A/B test two agent configurations."""
+        print("Running Strategy A...")
+        old_factory = self.agent_factory
+        self.agent_factory = strategy_a_factory
+        results_a = self.run_suite(tasks, num_runs=num_runs, max_parallel=1)
+        print("Running Strategy B...")
+        self.agent_factory = strategy_b_factory
+        results_b = self.run_suite(tasks, num_runs=num_runs, max_parallel=1)
+        self.agent_factory = old_factory
+        return {
+            "strategy_a": {
+                "avg_score": results_a.avg_score,
+                "pass_rate": results_a.passed / max(results_a.total_tasks, 1),
+                "avg_duration": results_a.avg_duration,
+            },
+            "strategy_b": {
+                "avg_score": results_b.avg_score,
+                "pass_rate": results_b.passed / max(results_b.total_tasks, 1),
+                "avg_duration": results_b.avg_duration,
+            },
+            "winner": "A" if results_a.avg_score > results_b.avg_score else "B",
+        }

mcp_tools.py ADDED Viewed

	@@ -0,0 +1,479 @@

+"""
+mcp_tools.py — MCP Bridge for Enhanced Computer Control
+=======================================================
+Playwright Browser MCP + Code Execution + FileSystem + HF Hub MCP
+"""
+import os
+import json
+import time
+import base64
+import tempfile
+from typing import Any, Dict, List, Optional, Tuple
+from dataclasses import dataclass
+from io import BytesIO
+from PIL import Image
+# Smolagents tool decorator
+from smolagents import tool
+# Playwright
+try:
+    from playwright.sync_api import sync_playwright, Page, Browser, BrowserContext
+    HAS_PLAYWRIGHT = True
+except ImportError:
+    HAS_PLAYWRIGHT = False
+    sync_playwright = None
+    Page = Browser = BrowserContext = Any
+# E2B code execution
+try:
+    from e2b_code_interpreter import Sandbox as CodeSandbox
+    HAS_E2B_CODE = True
+except ImportError:
+    HAS_E2B_CODE = False
+    CodeSandbox = Any
+# ---------------------------------------------------------------------------
+# Playwright Browser MCP
+# ---------------------------------------------------------------------------
+class BrowserMCP:
+    """High-level browser automation via Playwright.
+    Replaces raw coordinate clicking with semantic selectors.
+    """
+    def __init__(self, headless: bool = True, browser_type: str = "chromium"):
+        self.headless = headless
+        self.browser_type = browser_type
+        self._playwright = None
+        self._browser: Optional[Browser] = None
+        self._context: Optional[BrowserContext] = None
+        self._page: Optional[Page] = None
+        self._closed = True
+    def start(self):
+        if not HAS_PLAYWRIGHT:
+            raise RuntimeError("Playwright not installed. Run: pip install playwright && playwright install chromium")
+        self._playwright = sync_playwright().start()
+        browser_cls = getattr(self._playwright, self.browser_type)
+        self._browser = browser_cls.launch(headless=self.headless)
+        self._context = self._browser.new_context(
+            viewport={"width": 1280, "height": 720},
+            user_agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
+        )
+        self._page = self._context.new_page()
+        self._closed = False
+        return self._page
+    def close(self):
+        if self._context:
+            self._context.close()
+        if self._browser:
+            self._browser.close()
+        if self._playwright:
+            self._playwright.stop()
+        self._closed = True
+    def ensure_page(self) -> Page:
+        if self._closed or self._page is None:
+            self.start()
+        return self._page
+    def goto(self, url: str, wait_until: str = "networkidle") -> str:
+        page = self.ensure_page()
+        if not url.startswith(("http://", "https://")):
+            url = "https://" + url
+        page.goto(url, wait_until=wait_until, timeout=30000)
+        return f"Navigated to {url}"
+    def click(self, selector: str, by: str = "css") -> str:
+        page = self.ensure_page()
+        if by == "text":
+            page.get_by_text(selector).first.click()
+        elif by == "role":
+            role, name = selector.split("::", 1)
+            page.get_by_role(role.strip(), name=name.strip()).first.click()
+        else:
+            page.locator(selector).first.click()
+        return f"Clicked element: {selector}"
+    def fill(self, selector: str, text: str, by: str = "css") -> str:
+        page = self.ensure_page()
+        if by == "text":
+            el = page.get_by_text(selector).first
+        elif by == "role":
+            role, name = selector.split("::", 1)
+            el = page.get_by_role(role.strip(), name=name.strip()).first
+        else:
+            el = page.locator(selector).first
+        el.fill(text)
+        return f"Filled '{selector}' with '{text}'"
+    def press(self, key: str) -> str:
+        page = self.ensure_page()
+        page.keyboard.press(key)
+        return f"Pressed key: {key}"
+    def scroll(self, direction: str = "down", amount: int = 500) -> str:
+        page = self.ensure_page()
+        if direction == "down":
+            page.mouse.wheel(0, amount)
+        else:
+            page.mouse.wheel(0, -amount)
+        return f"Scrolled {direction} by {amount}"
+    def get_text(self, selector: str = "body") -> str:
+        page = self.ensure_page()
+        return page.locator(selector).first.inner_text()
+    def get_html(self) -> str:
+        page = self.ensure_page()
+        return page.content()
+    def screenshot(self, path: Optional[str] = None) -> str:
+        page = self.ensure_page()
+        if path:
+            page.screenshot(path=path, full_page=True)
+            return f"Screenshot saved to {path}"
+        else:
+            buf = page.screenshot(full_page=True)
+            return base64.b64encode(buf).decode("utf-8")
+    def find_and_click(self, text: str) -> str:
+        """Semantic find-and-click by visible text."""
+        page = self.ensure_page()
+        page.get_by_text(text).first.click()
+        return f"Found and clicked text: {text}"
+    def search_on_page(self, query: str) -> str:
+        page = self.ensure_page()
+        page.keyboard.press("Control+f")
+        page.keyboard.insert_text(query)
+        page.keyboard.press("Enter")
+        time.sleep(0.5)
+        page.keyboard.press("Escape")
+        return f"Searched for '{query}' on page"
+    def download_file(self, url: str, save_path: str) -> str:
+        page = self.ensure_page()
+        with page.expect_download() as dl_info:
+            page.goto(url)
+        dl = dl_info.value
+        dl.save_as(save_path)
+        return f"Downloaded to {save_path}"
+    def extract_links(self) -> List[Dict[str, str]]:
+        page = self.ensure_page()
+        links = page.eval_on_selector_all("a", """elements => elements.map(a => ({href: a.href, text: a.innerText.trim()}))""")
+        return links
+    def extract_tables(self) -> List[List[List[str]]]:
+        page = self.ensure_page()
+        tables = page.eval_on_selector_all("table", """
+            tables => tables.map(t => {
+                return Array.from(t.querySelectorAll('tr')).map(row =>
+                    Array.from(row.querySelectorAll('td, th')).map(cell => cell.innerText.trim())
+                );
+            })
+        """)
+        return tables
+    def evaluate_js(self, script: str) -> Any:
+        page = self.ensure_page()
+        return page.evaluate(script)
+# ---------------------------------------------------------------------------
+# Tool factory for smolagents integration
+# ---------------------------------------------------------------------------
+def make_browser_tools(browser_mcp: BrowserMCP) -> Dict[str, Any]:
+    """Generate smolagents @tool functions from BrowserMCP."""
+    @tool
+    def browser_goto(url: str) -> str:
+        """Navigate the browser to a URL. Prefer this over clicking browser icons."""
+        return browser_mcp.goto(url)
+    @tool
+    def browser_click(selector: str, by: str = "css") -> str:
+        """Click an element by CSS selector, text content, or ARIA role.
+        by can be 'css', 'text', or 'role' (role::name format)."""
+        return browser_mcp.click(selector, by)
+    @tool
+    def browser_fill(selector: str, text: str, by: str = "css") -> str:
+        """Fill a form field with text."""
+        return browser_mcp.fill(selector, text, by)
+    @tool
+    def browser_press_key(key: str) -> str:
+        """Press a keyboard key (e.g., 'Enter', 'Tab', 'Escape')."""
+        return browser_mcp.press(key)
+    @tool
+    def browser_scroll(direction: str = "down", amount: int = 500) -> str:
+        """Scroll the page up or down."""
+        return browser_mcp.scroll(direction, amount)
+    @tool
+    def browser_get_text(selector: str = "body") -> str:
+        """Extract text content from a page element."""
+        return browser_mcp.get_text(selector)
+    @tool
+    def browser_find_and_click(text: str) -> str:
+        """Find an element by its visible text and click it."""
+        return browser_mcp.find_and_click(text)
+    @tool
+    def browser_screenshot(path: str = "") -> str:
+        """Take a screenshot of the current page. If path is empty, returns base64."""
+        return browser_mcp.screenshot(path or None)
+    @tool
+    def browser_extract_links() -> str:
+        """Extract all links from the current page as JSON."""
+        links = browser_mcp.extract_links()
+        return json.dumps(links[:50], indent=2)  # Limit to 50
+    @tool
+    def browser_extract_tables() -> str:
+        """Extract all tables from the current page as JSON."""
+        tables = browser_mcp.extract_tables()
+        return json.dumps(tables[:5], indent=2)
+    @tool
+    def browser_evaluate_js(script: str) -> str:
+        """Execute JavaScript in the browser context and return the result."""
+        result = browser_mcp.evaluate_js(script)
+        return json.dumps(result, default=str)
+    return {
+        "browser_goto": browser_goto,
+        "browser_click": browser_click,
+        "browser_fill": browser_fill,
+        "browser_press_key": browser_press_key,
+        "browser_scroll": browser_scroll,
+        "browser_get_text": browser_get_text,
+        "browser_find_and_click": browser_find_and_click,
+        "browser_screenshot": browser_screenshot,
+        "browser_extract_links": browser_extract_links,
+        "browser_extract_tables": browser_extract_tables,
+        "browser_evaluate_js": browser_evaluate_js,
+    }
+# ---------------------------------------------------------------------------
+# Code Execution MCP (E2B Code Interpreter)
+# ---------------------------------------------------------------------------
+class CodeExecutionMCP:
+    """Sandboxed Python/JS code execution via E2B."""
+    def __init__(self, api_key: Optional[str] = None):
+        self.api_key = api_key or os.getenv("E2B_API_KEY")
+        self._sandbox: Optional[Any] = None
+    def _get_sandbox(self):
+        if not HAS_E2B_CODE:
+            raise RuntimeError("e2b_code_interpreter not installed")
+        if self._sandbox is None:
+            self._sandbox = CodeSandbox(api_key=self.api_key)
+        return self._sandbox
+    def run_python(self, code: str, timeout: int = 30) -> Dict[str, Any]:
+        sb = self._get_sandbox()
+        execution = sb.run_code(code, timeout=timeout)
+        return {
+            "stdout": execution.logs.stdout,
+            "stderr": execution.logs.stderr,
+            "results": [str(r) for r in execution.results],
+            "error": execution.error,
+        }
+    def run_shell(self, command: str, timeout: int = 30) -> Dict[str, Any]:
+        sb = self._get_sandbox()
+        execution = sb.run_code(f"!{command}", timeout=timeout)
+        return {
+            "stdout": execution.logs.stdout,
+            "stderr": execution.logs.stderr,
+            "error": execution.error,
+        }
+    def install_package(self, package: str) -> str:
+        result = self.run_shell(f"pip install {package}")
+        return f"Installed {package}: {result['stdout'][:500]}"
+    def close(self):
+        if self._sandbox:
+            self._sandbox.kill()
+            self._sandbox = None
+def make_code_tools(code_mcp: CodeExecutionMCP) -> Dict[str, Any]:
+    @tool
+    def execute_python(code: str) -> str:
+        """Execute Python code in a sandboxed environment. Use for data processing, calculations, or parsing."""
+        result = code_mcp.run_python(code)
+        if result["error"]:
+            return f"Error: {result['error']}\nStderr: {result['stderr']}"
+        out = "\n".join(result["stdout"])
+        if result["results"]:
+            out += f"\nResults: {result['results']}"
+        return out[:3000]
+    @tool
+    def execute_shell(command: str) -> str:
+        """Execute a shell command in the sandbox."""
+        result = code_mcp.run_shell(command)
+        if result["error"]:
+            return f"Error: {result['error']}"
+        return "\n".join(result["stdout"])[:3000]
+    @tool
+    def install_python_package(package: str) -> str:
+        """Install a Python package in the sandbox."""
+        return code_mcp.install_package(package)
+    return {
+        "execute_python": execute_python,
+        "execute_shell": execute_shell,
+        "install_python_package": install_python_package,
+    }
+# ---------------------------------------------------------------------------
+# FileSystem MCP (Local + E2B)
+# ---------------------------------------------------------------------------
+class FileSystemMCP:
+    """Read/write files either locally or in the E2B sandbox."""
+    def __init__(self, base_dir: str = "./workspace"):
+        self.base_dir = os.path.abspath(base_dir)
+        os.makedirs(self.base_dir, exist_ok=True)
+    def _safe_path(self, path: str) -> str:
+        abs_path = os.path.abspath(os.path.join(self.base_dir, path))
+        if not abs_path.startswith(self.base_dir):
+            raise ValueError("Path traversal attempt detected")
+        return abs_path
+    def read_file(self, path: str) -> str:
+        sp = self._safe_path(path)
+        with open(sp, "r", encoding="utf-8", errors="ignore") as f:
+            return f.read()
+    def write_file(self, path: str, content: str) -> str:
+        sp = self._safe_path(path)
+        os.makedirs(os.path.dirname(sp), exist_ok=True)
+        with open(sp, "w", encoding="utf-8") as f:
+            f.write(content)
+        return f"Wrote {len(content)} chars to {path}"
+    def list_dir(self, path: str = ".") -> List[str]:
+        sp = self._safe_path(path)
+        return os.listdir(sp)
+    def read_image(self, path: str) -> Image.Image:
+        sp = self._safe_path(path)
+        return Image.open(sp)
+def make_fs_tools(fs_mcp: FileSystemMCP) -> Dict[str, Any]:
+    @tool
+    def fs_read(path: str) -> str:
+        """Read a text file from the workspace."""
+        return fs_mcp.read_file(path)
+    @tool
+    def fs_write(path: str, content: str) -> str:
+        """Write text content to a file in the workspace."""
+        return fs_mcp.write_file(path, content)
+    @tool
+    def fs_list(path: str = ".") -> str:
+        """List files in a workspace directory."""
+        return json.dumps(fs_mcp.list_dir(path))
+    return {
+        "fs_read": fs_read,
+        "fs_write": fs_write,
+        "fs_list": fs_list,
+    }
+# ---------------------------------------------------------------------------
+# HF Hub MCP (Hugging Face ecosystem integration)
+# ---------------------------------------------------------------------------
+class HFHubMCP:
+    """Interact with the Hugging Face Hub from within the agent."""
+    def __init__(self, token: Optional[str] = None):
+        self.token = token or os.getenv("HF_TOKEN")
+        from huggingface_hub import HfApi, upload_file, create_repo
+        self.api = HfApi(token=self.token)
+        self._upload_file = upload_file
+        self._create_repo = create_repo
+    def search_models(self, query: str, limit: int = 10) -> List[Dict[str, Any]]:
+        models = self.api.list_models(search=query, limit=limit, sort="downloads")
+        return [{"id": m.id, "downloads": m.downloads, "tags": m.tags} for m in models]
+    def search_datasets(self, query: str, limit: int = 10) -> List[Dict[str, Any]]:
+        datasets = self.api.list_datasets(search=query, limit=limit)
+        return [{"id": d.id, "tags": d.tags} for d in datasets]
+    def search_spaces(self, query: str, limit: int = 10) -> List[Dict[str, Any]]:
+        spaces = self.api.list_spaces(search=query, limit=limit)
+        return [{"id": s.id, "sdk": getattr(s, "sdk", "unknown")} for s in spaces]
+    def upload_to_dataset(self, repo_id: str, file_path: str, path_in_repo: str) -> str:
+        self._upload_file(
+            path_or_fileobj=file_path,
+            path_in_repo=path_in_repo,
+            repo_id=repo_id,
+            repo_type="dataset",
+            token=self.token,
+        )
+        return f"Uploaded {file_path} to {repo_id}/{path_in_repo}"
+def make_hf_tools(hf_mcp: HFHubMCP) -> Dict[str, Any]:
+    @tool
+    def hf_search_models(query: str, limit: int = 10) -> str:
+        """Search Hugging Face Hub for models."""
+        results = hf_mcp.search_models(query, limit)
+        return json.dumps(results, indent=2)
+    @tool
+    def hf_search_datasets(query: str, limit: int = 10) -> str:
+        """Search Hugging Face Hub for datasets."""
+        results = hf_mcp.search_datasets(query, limit)
+        return json.dumps(results, indent=2)
+    @tool
+    def hf_search_spaces(query: str, limit: int = 10) -> str:
+        """Search Hugging Face Hub for Spaces."""
+        results = hf_mcp.search_spaces(query, limit)
+        return json.dumps(results, indent=2)
+    @tool
+    def hf_upload_dataset_file(repo_id: str, file_path: str, path_in_repo: str) -> str:
+        """Upload a file to a Hugging Face dataset repository."""
+        return hf_mcp.upload_to_dataset(repo_id, file_path, path_in_repo)
+    return {
+        "hf_search_models": hf_search_models,
+        "hf_search_datasets": hf_search_datasets,
+        "hf_search_spaces": hf_search_spaces,
+        "hf_upload_dataset_file": hf_upload_dataset_file,
+    }

requirements.txt ADDED Viewed

	@@ -0,0 +1,25 @@

+smolagents==1.14.0
+e2b_desktop==1.6.5
+Pillow
+huggingface_hub
+openai
+gradio_modal
+spaces
+python-dotenv
+# Enhanced stack
+playwright>=1.40.0
+chromadb>=0.6.0
+sentence-transformers>=3.0.0
+numpy>=1.26.0
+tiktoken>=0.7.0
+pydantic>=2.0.0
+aiohttp>=3.9.0
+httpx>=0.27.0
+soundfile>=0.12.0
+# Voice
+faster-whisper>=1.0.0
+# Optional: E2B code interpreter (falls back gracefully if missing)
+e2b_code_interpreter>=1.0.0

templates/viewer.html ADDED Viewed

	@@ -0,0 +1,753 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>Computer Agent Evaluation Viewer</title>
+    <style>
+        /* CSS styles here */
+        body {
+            font-family: Arial, sans-serif;
+            margin: 0;
+            padding: 20px;
+            background-color: #f5f5f5;
+        }
+        .container {
+            max-width: 1200px;
+            margin: 0 auto;
+            background-color: #fff;
+            padding: 20px;
+            border-radius: 8px;
+            box-shadow: 0 2px 10px rgba(0,0,0,0.1);
+        }
+        h1, h2, h3 {
+            color: #333;
+        }
+        select, input, button {
+            padding: 8px 12px;
+            margin: 5px 0;
+            border: 1px solid #ddd;
+            border-radius: 4px;
+        }
+        button {
+            background-color: #4a6cf7;
+            color: white;
+            cursor: pointer;
+            border: none;
+        }
+        button:hover {
+            background-color: #3a5ce5;
+        }
+        button:disabled {
+            background-color: #cccccc;
+            cursor: not-allowed;
+        }
+        .row {
+            display: flex;
+            margin-bottom: 20px;
+        }
+        .col {
+            flex: 1;
+            padding: 0 10px;
+        }
+        .image-viewer {
+            width: 100%;
+            max-height: 500px;
+            border: 1px solid #ddd;
+            border-radius: 4px;
+            overflow: hidden;
+            margin-bottom: 10px;
+            position: relative;
+        }
+        .image-viewer img {
+            max-width: 100%;
+            max-height: 450px;
+            display: block;
+            margin: 0 auto;
+        }
+        .image-controls {
+            display: flex;
+            justify-content: space-between;
+            align-items: center;
+            margin-top: 10px;
+        }
+        .nav-buttons {
+            display: flex;
+            gap: 10px;
+        }
+        .step {
+            border: 1px solid #ddd;
+            border-radius: 4px;
+            margin-bottom: 10px;
+            overflow: hidden;
+        }
+        .step-header {
+            background-color: #f0f0f0;
+            padding: 10px;
+            font-weight: bold;
+            cursor: pointer;
+            display: flex;
+            justify-content: space-between;
+        }
+        .step-content {
+            padding: 15px;
+            white-space: pre-wrap;
+            font-family: monospace;
+            background-color: #f9f9f9;
+            max-height: 300px;
+            overflow-y: auto;
+        }
+        .hidden {
+            display: none;
+        }
+        .status-success {
+            color: #22c55e;
+            font-weight: bold;
+        }
+        .status-failure {
+            color: #ef4444;
+            font-weight: bold;
+        }
+        .tabs {
+            display: flex;
+            border-bottom: 1px solid #ddd;
+            margin-bottom: 20px;
+        }
+        .tab {
+            padding: 10px 20px;
+            cursor: pointer;
+            border-bottom: 2px solid transparent;
+        }
+        .tab.active {
+            border-bottom-color: #4a6cf7;
+            font-weight: bold;
+        }
+        .tab-content {
+            display: none;
+        }
+        .tab-content.active {
+            display: block;
+        }
+        pre {
+            background-color: #f0f0f0;
+            padding: 10px;
+            border-radius: 4px;
+            overflow-x: auto;
+            white-space: pre-wrap;
+        }
+        .error-message {
+            background-color: #fee2e2;
+            color: #b91c1c;
+            padding: 10px;
+            border-radius: 4px;
+            margin: 10px 0;
+        }
+        .loading {
+            display: inline-block;
+            width: 20px;
+            height: 20px;
+            border: 2px solid #f3f3f3;
+            border-top: 2px solid #3498db;
+            border-radius: 50%;
+            animation: spin 1s linear infinite;
+            margin-left: 10px;
+        }
+        @keyframes spin {
+            0% { transform: rotate(0deg); }
+            100% { transform: rotate(360deg); }
+        }
+    </style>
+</head>
+<body>
+    <div class="container">
+        <h1>Computer Agent Evaluation Viewer</h1>
+        <!-- Path and Eval Selection -->
+        <div style="margin-bottom: 20px; padding: 15px; background-color: #f0f0f0; border-radius: 8px;">
+            <h2>Load Evaluation Data</h2>
+            <div style="display: flex; gap: 10px; margin-top: 10px;">
+                <input type="text" id="base-path" placeholder="Base directory path (leave empty for default)"
+                       style="flex-grow: 1; padding: 8px; border: 1px solid #ddd; border-radius: 4px;">
+                <button id="refresh-evals-btn">Refresh</button>
+            </div>
+            <div style="margin-top: 10px;">
+                <label for="eval-select">Select Evaluation:</label>
+                <select id="eval-select" style="min-width: 300px;"></select>
+            </div>
+            <div id="load-status" style="margin-top: 10px; font-style: italic;"></div>
+        </div>
+        <!-- Example and Run Selectors -->
+        <div class="row">
+            <div class="col">
+                <label for="example-select">Select Example:</label>
+                <select id="example-select">
+                    <option value="">-- Select Example --</option>
+                </select>
+            </div>
+            <div class="col">
+                <label for="run-select">Select Run:</label>
+                <select id="run-select" disabled>
+                    <option value="">-- Select Run --</option>
+                </select>
+            </div>
+        </div>
+        <!-- Task & Status Display -->
+        <div id="run-details" class="hidden">
+            <div>
+                <h2>Task</h2>
+                <pre id="task-text"></pre>
+            </div>
+            <div>
+                <h2>Run Status</h2>
+                <div id="status-display"></div>
+            </div>
+            <!-- Tabs -->
+            <div class="tabs">
+                <div class="tab active" data-tab="screenshots">Screenshots</div>
+                <div class="tab" data-tab="agent-trace">Agent Trace</div>
+                <div class="tab" data-tab="raw-json">Raw JSON</div>
+            </div>
+            <!-- Screenshots Tab -->
+            <div id="screenshots-tab" class="tab-content active">
+                <div id="no-images" class="hidden">
+                    <p>No screenshots available for this run.</p>
+                </div>
+                <div id="image-container" class="image-viewer hidden">
+                    <img id="current-image" src="" alt="Screenshot">
+                    <p id="image-caption" class="text-center"></p>
+                </div>
+                <div class="image-controls hidden" id="image-controls">
+                    <div class="nav-buttons">
+                        <button id="prev-image">Previous</button>
+                        <span id="image-counter">0 / 0</span>
+                        <button id="next-image">Next</button>
+                    </div>
+                    <input type="range" id="image-slider" min="0" max="0" value="0" style="width: 100%">
+                </div>
+            </div>
+            <!-- Agent Trace Tab -->
+            <div id="agent-trace-tab" class="tab-content">
+                <div id="agent-steps"></div>
+            </div>
+            <!-- Raw JSON Tab -->
+            <div id="raw-json-tab" class="tab-content">
+                <div id="json-loading-indicator" class="hidden">
+                    <p>Loading metadata... <span class="loading"></span></p>
+                </div>
+                <div id="json-error" class="error-message hidden"></div>
+                <pre id="raw-json"></pre>
+            </div>
+        </div>
+    </div>
+    <script>
+        // Application state
+        const appState = {
+            basePath: '',
+            evalId: null,
+            currentExampleId: null,
+            currentRunId: null,
+            currentImages: [],
+            currentImageIndex: 0,
+            loadedData: {
+                examples: {},
+                runs: {},
+                metadata: {},
+                screenshots: {}
+            }
+        };
+        // DOM elements
+        const basePathInput = document.getElementById('base-path');
+        const refreshEvalsBtn = document.getElementById('refresh-evals-btn');
+        const evalSelect = document.getElementById('eval-select');
+        const loadStatusDisplay = document.getElementById('load-status');
+        const exampleSelect = document.getElementById('example-select');
+        const runSelect = document.getElementById('run-select');
+        const runDetails = document.getElementById('run-details');
+        const taskText = document.getElementById('task-text');
+        const statusDisplay = document.getElementById('status-display');
+        const imageContainer = document.getElementById('image-container');
+        const noImages = document.getElementById('no-images');
+        const imageControls = document.getElementById('image-controls');
+        const currentImage = document.getElementById('current-image');
+        const imageCaption = document.getElementById('image-caption');
+        const imageCounter = document.getElementById('image-counter');
+        const imageSlider = document.getElementById('image-slider');
+        const prevImage = document.getElementById('prev-image');
+        const nextImage = document.getElementById('next-image');
+        const agentSteps = document.getElementById('agent-steps');
+        const rawJson = document.getElementById('raw-json');
+        const jsonLoadingIndicator = document.getElementById('json-loading-indicator');
+        const jsonError = document.getElementById('json-error');
+        // Initialize by loading available evaluations
+        refreshEvalsBtn.addEventListener('click', loadEvaluations);
+        // Load evaluations from server
+        async function loadEvaluations() {
+            appState.basePath = basePathInput.value.trim();
+            loadStatusDisplay.textContent = 'Loading evaluations...';
+            refreshEvalsBtn.disabled = true;
+            try {
+                const response = await fetch(`/api/evals?path=${encodeURIComponent(appState.basePath)}`);
+                if (!response.ok) {
+                    const errorData = await response.json();
+                    throw new Error(errorData.error || 'Failed to load evaluations');
+                }
+                const evals = await response.json();
+                // Clear existing options
+                evalSelect.innerHTML = '<option value="">-- Select Evaluation --</option>';
+                // Add new options
+                evals.forEach(evalId => {
+                    const option = document.createElement('option');
+                    option.value = evalId;
+                    option.textContent = evalId;
+                    evalSelect.appendChild(option);
+                });
+                loadStatusDisplay.textContent = `Loaded ${evals.length} evaluations`;
+                // AUTO-SELECT LATEST EVALUATION
+                if (evals.length > 0) {
+                    // Sort evaluations to get the latest one
+                    evals.sort().reverse();
+                    evalSelect.value = evals[0];
+                    // Trigger change event to load examples
+                    evalSelect.dispatchEvent(new Event('change'));
+                }
+            } catch (err) {
+                console.error('Error loading evaluations:', err);
+                loadStatusDisplay.textContent = `Error: ${err.message}`;
+            } finally {
+                refreshEvalsBtn.disabled = false;
+            }
+        }
+        // Handle evaluation selection
+        evalSelect.addEventListener('change', async () => {
+            appState.evalId = evalSelect.value;
+            if (!appState.evalId) {
+                exampleSelect.innerHTML = '<option value="">-- Select Example --</option>';
+                exampleSelect.disabled = true;
+                runSelect.innerHTML = '<option value="">-- Select Run --</option>';
+                runSelect.disabled = true;
+                runDetails.classList.add('hidden');
+                return;
+            }
+            try {
+                loadStatusDisplay.textContent = 'Loading examples...';
+                evalSelect.disabled = true;
+                const response = await fetch(`/api/eval/${appState.evalId}/examples?path=${encodeURIComponent(appState.basePath)}`);
+                if (!response.ok) {
+                    const errorData = await response.json();
+                    throw new Error(errorData.error || 'Failed to load examples');
+                }
+                const examples = await response.json();
+                appState.loadedData.examples = examples;
+                // Update example dropdown
+                exampleSelect.innerHTML = '<option value="">-- Select Example --</option>';
+                for (const [exampleId, task] of Object.entries(examples)) {
+                    const option = document.createElement('option');
+                    option.value = exampleId;
+                    option.textContent = exampleId;
+                    option.title = task; // Show task as tooltip
+                    exampleSelect.appendChild(option);
+                }
+                exampleSelect.disabled = false;
+                runSelect.innerHTML = '<option value="">-- Select Run --</option>';
+                runSelect.disabled = true;
+                runDetails.classList.add('hidden');
+                loadStatusDisplay.textContent = `Loaded ${Object.keys(examples).length} examples`;
+                // AUTO-SELECT FIRST EXAMPLE
+                if (Object.keys(examples).length > 0) {
+                    const firstExampleId = Object.keys(examples)[0];
+                    exampleSelect.value = firstExampleId;
+                    // Trigger change event to load runs
+                    exampleSelect.dispatchEvent(new Event('change'));
+                }
+            } catch (err) {
+                console.error('Error loading examples:', err);
+                loadStatusDisplay.textContent = `Error: ${err.message}`;
+            } finally {
+                evalSelect.disabled = false;
+            }
+        });
+        // Example selection
+        exampleSelect.addEventListener('change', async () => {
+            appState.currentExampleId = exampleSelect.value;
+            // Reset run selection
+            runSelect.innerHTML = '<option value="">-- Select Run --</option>';
+            if (!appState.currentExampleId) {
+                runSelect.disabled = true;
+                runDetails.classList.add('hidden');
+                return;
+            }
+            try {
+                loadStatusDisplay.textContent = 'Loading runs...';
+                exampleSelect.disabled = true;
+                const response = await fetch(`/api/eval/${appState.evalId}/example/${appState.currentExampleId}/runs?path=${encodeURIComponent(appState.basePath)}`);
+                if (!response.ok) {
+                    const errorData = await response.json();
+                    throw new Error(errorData.error || 'Failed to load runs');
+                }
+                const runs = await response.json();
+                appState.loadedData.runs[appState.currentExampleId] = runs;
+                // SORT RUNS by ID (assuming run IDs have timestamps or sequence numbers)
+                runs.sort((a, b) => a.id.localeCompare(b.id, undefined, {numeric: true}));
+                // Update run dropdown with sorted runs
+                runSelect.innerHTML = '<option value="">-- Select Run --</option>';
+                runs.forEach(run => {
+                    const option = document.createElement('option');
+                    option.value = run.id;
+                    option.textContent = `${run.id} (${run.status})`;
+                    option.dataset.status = run.status;
+                    runSelect.appendChild(option);
+                });
+                runSelect.disabled = false;
+                runDetails.classList.add('hidden');
+                loadStatusDisplay.textContent = `Loaded ${runs.length} runs`;
+                // AUTO-SELECT FIRST RUN
+                if (runs.length > 0) {
+                    runSelect.value = runs[0].id;
+                    // Trigger change event to load run data
+                    runSelect.dispatchEvent(new Event('change'));
+                }
+            } catch (err) {
+                console.error('Error loading runs:', err);
+                loadStatusDisplay.textContent = `Error: ${err.message}`;
+            } finally {
+                exampleSelect.disabled = false;
+            }
+        });
+        // Run selection
+        runSelect.addEventListener('change', () => {
+            appState.currentRunId = runSelect.value;
+            if (appState.currentRunId && appState.currentExampleId) {
+                loadRunData(appState.currentExampleId, appState.currentRunId);
+                runDetails.classList.remove('hidden');
+            } else {
+                runDetails.classList.add('hidden');
+            }
+        });
+        // Load run data
+        async function loadRunData(exampleId, runId) {
+            loadStatusDisplay.textContent = 'Loading run data...';
+            runSelect.disabled = true;
+            jsonLoadingIndicator.classList.remove('hidden');
+            jsonError.classList.add('hidden');
+            try {
+                // Get metadata
+                const metadataResponse = await fetch(`/api/eval/${appState.evalId}/example/${exampleId}/run/${runId}/metadata?path=${encodeURIComponent(appState.basePath)}`);
+                let metadata;
+                if (metadataResponse.ok) {
+                    metadata = await metadataResponse.json();
+                } else {
+                    const errorData = await metadataResponse.json();
+                    console.error('Error loading metadata:', errorData);
+                    jsonError.textContent = `Error loading metadata: ${errorData.error || 'Unknown error'}`;
+                    jsonError.classList.remove('hidden');
+                    metadata = null;
+                }
+                appState.loadedData.metadata[exampleId] = appState.loadedData.metadata[exampleId] || {};
+                appState.loadedData.metadata[exampleId][runId] = metadata;
+                // Display task
+                const task = appState.loadedData.examples[exampleId];
+                taskText.textContent = task || "No task available";
+                // Display status
+                let statusHtml = "";
+                if (metadata) {
+                    if (metadata.status === 'completed') {
+                        statusHtml = `<p><span class="status-success">✓ Completed successfully</span></p>`;
+                    } else {
+                        statusHtml = `<p><span class="status-failure">✗ Failed</span></p>`;
+                        if (metadata.error_message) {
+                            statusHtml += `<p>Error: ${metadata.error_message}</p>`;
+                        }
+                    }
+                } else {
+                    statusHtml = "<p>Status information not available</p>";
+                }
+                statusDisplay.innerHTML = statusHtml;
+                // Get screenshots
+                const screenshotsResponse = await fetch(`/api/eval/${appState.evalId}/example/${exampleId}/run/${runId}/screenshots?path=${encodeURIComponent(appState.basePath)}`);
+                const screenshots = await screenshotsResponse.json();
+                appState.loadedData.screenshots[exampleId] = appState.loadedData.screenshots[exampleId] || {};
+                appState.loadedData.screenshots[exampleId][runId] = screenshots;
+                // Load screenshots
+                loadScreenshots(exampleId, runId);
+                // Load agent trace
+                renderAgentTrace(metadata);
+                // Display raw JSON
+                if (metadata) {
+                    rawJson.textContent = JSON.stringify(metadata, null, 2);
+                } else {
+                    rawJson.textContent = "No metadata available";
+                }
+                // Show screenshots tab by default
+                document.querySelector('.tab[data-tab="screenshots"]').click();
+                loadStatusDisplay.textContent = 'Run data loaded successfully';
+            } catch (err) {
+                console.error('Error loading run data:', err);
+                loadStatusDisplay.textContent = `Error: ${err.message}`;
+                jsonError.textContent = `Error loading data: ${err.message}`;
+                jsonError.classList.remove('hidden');
+            } finally {
+                jsonLoadingIndicator.classList.add('hidden');
+                runSelect.disabled = false;
+            }
+        }
+        // Load screenshots
+        function loadScreenshots(exampleId, runId) {
+            appState.currentImages = appState.loadedData.screenshots[exampleId]?.[runId] || [];
+            if (appState.currentImages.length === 0) {
+                imageContainer.classList.add('hidden');
+                imageControls.classList.add('hidden');
+                noImages.classList.remove('hidden');
+                return;
+            }
+            // Setup image viewer
+            noImages.classList.add('hidden');
+            imageContainer.classList.remove('hidden');
+            imageControls.classList.remove('hidden');
+            // Configure slider
+            imageSlider.min = 0;
+            imageSlider.max = appState.currentImages.length - 1;
+            imageSlider.value = 0;
+            // Reset to first image
+            appState.currentImageIndex = 0;
+            updateImageDisplay();
+        }
+        // Update image display
+        function updateImageDisplay() {
+            if (appState.currentImages.length === 0) return;
+            const image = appState.currentImages[appState.currentImageIndex];
+            currentImage.src = image.path;
+            imageCaption.textContent = image.name;
+            imageCounter.textContent = `${appState.currentImageIndex + 1} / ${appState.currentImages.length}`;
+            imageSlider.value = appState.currentImageIndex;
+            // Update button states
+            prevImage.disabled = appState.currentImageIndex === 0;
+            nextImage.disabled = appState.currentImageIndex === appState.currentImages.length - 1;
+        }
+        // Image navigation
+        prevImage.addEventListener('click', () => {
+            if (appState.currentImageIndex > 0) {
+                appState.currentImageIndex--;
+                updateImageDisplay();
+            }
+        });
+        nextImage.addEventListener('click', () => {
+            if (appState.currentImageIndex < appState.currentImages.length - 1) {
+                appState.currentImageIndex++;
+                updateImageDisplay();
+            }
+        });
+        imageSlider.addEventListener('input', () => {
+            appState.currentImageIndex = parseInt(imageSlider.value);
+            updateImageDisplay();
+        });
+        // Tab handling
+        document.querySelectorAll('.tab').forEach(tab => {
+            tab.addEventListener('click', () => {
+                // Set active tab
+                document.querySelectorAll('.tab').forEach(t => t.classList.remove('active'));
+                tab.classList.add('active');
+                // Show active content
+                const tabId = tab.getAttribute('data-tab');
+                document.querySelectorAll('.tab-content').forEach(content => {
+                    content.classList.remove('active');
+                });
+                document.getElementById(`${tabId}-tab`).classList.add('active');
+            });
+        });
+        // Render agent trace - UPDATED to show all sections expanded and remove duplicated task title
+        function renderAgentTrace(metadata) {
+            agentSteps.innerHTML = '';
+            if (!metadata || !metadata.summary || metadata.summary.length === 0) {
+                agentSteps.innerHTML = '<p>No agent trace data available</p>';
+                return;
+            }
+            // Process each step
+            metadata.summary.forEach((step, index) => {
+                const stepDiv = document.createElement('div');
+                stepDiv.className = 'step';
+                // Create step header
+                const headerDiv = document.createElement('div');
+                headerDiv.className = 'step-header';
+                let headerText = `Step ${index}`;
+                if (index === 0 && step.task) {
+                    headerText = 'Task';
+                } else if (step.model_output_message) {
+                    headerText = 'Planning';
+                } else if (step.tool_calls) {
+                    headerText = `Action ${index}`;
+                } else if (step.error) {
+                    headerText = 'Error';
+                }
+                headerDiv.innerHTML = `<span>${headerText}</span><span>▲</span>`;
+                stepDiv.appendChild(headerDiv);
+                // Create step content
+                const contentDiv = document.createElement('div');
+                contentDiv.className = 'step-content';
+                // Make all sections visible by default
+                contentDiv.style.display = 'block';
+                let contentHtml = '';
+                // Task information - don't duplicate the title
+                if (index === 0 && step.task) {
+                    // Just show the task content without the "Task:" title
+                    contentHtml += `${step.task}\n\n`;
+                }
+                // Model output and planning
+                if (step.model_output_message && step.model_output_message.content) {
+                    contentHtml += `<strong>Model Output:</strong>\n${step.model_output_message.content}\n\n`;
+                    if (step.plan) {
+                        contentHtml += `<strong>Plan:</strong>\n${step.plan}\n\n`;
+                    }
+                }
+                // Tool calls
+                if (step.tool_calls && step.tool_calls.length > 0) {
+                    step.tool_calls.forEach(toolCall => {
+                        if (toolCall.function) {
+                            contentHtml += `<strong>Tool Call:</strong> ${toolCall.function.name}\n`;
+                            if (toolCall.function.arguments) {
+                                contentHtml += `<strong>Arguments:</strong>\n${toolCall.function.arguments}\n\n`;
+                            }
+                        }
+                    });
+                }
+                // Model reasoning
+                if (step.model_output) {
+                    contentHtml += `<strong>Model Reasoning:</strong>\n${step.model_output}\n\n`;
+                }
+                // Observations
+                if (step.observations) {
+                    contentHtml += `<strong>Observations:</strong>\n${step.observations}\n\n`;
+                }
+                // Action output
+                if (step.action_output) {
+                    contentHtml += `<strong>Action Output:</strong>\n${step.action_output}\n\n`;
+                }
+                // Errors
+                if (step.error) {
+                    contentHtml += `<strong>Error Type:</strong> ${step.error.type || 'Unknown'}\n`;
+                    if (step.error.message) {
+                        contentHtml += `<strong>Error Message:</strong> ${step.error.message}\n`;
+                    }
+                }
+                contentDiv.textContent = contentHtml || "No content available for this step";
+                stepDiv.appendChild(contentDiv);
+                // Add click handler to toggle content
+                headerDiv.addEventListener('click', () => {
+                    const isHidden = contentDiv.style.display === 'none';
+                    contentDiv.style.display = isHidden ? 'block' : 'none';
+                    headerDiv.querySelector('span:last-child').textContent = isHidden ? '▲' : '▼';
+                });
+                agentSteps.appendChild(stepDiv);
+            });
+            // No need to expand the first step by default since all are now expanded
+        }
+        // Handle keyboard navigation for images
+        document.addEventListener('keydown', (e) => {
+            if (!appState.currentImages || appState.currentImages.length === 0) return;
+            // Check if the screenshots tab is active
+            const screenshotsTab = document.getElementById('screenshots-tab');
+            if (!screenshotsTab.classList.contains('active')) return;
+            if (e.key === 'ArrowLeft' && appState.currentImageIndex > 0) {
+                appState.currentImageIndex--;
+                updateImageDisplay();
+            } else if (e.key === 'ArrowRight' && appState.currentImageIndex < appState.currentImages.length - 1) {
+                appState.currentImageIndex++;
+                updateImageDisplay();
+            }
+        });
+        // Load evaluations on page load
+        document.addEventListener('DOMContentLoaded', loadEvaluations);
+    </script>
+</body>
+</html>

voice_interface.py ADDED Viewed

	@@ -0,0 +1,137 @@

+"""
+voice_interface.py — Voice I/O for the Computer Agent
+======================================================
+Speech-to-Text (Whisper / Faster-Whisper) and TTS (HF Inference API)
+"""
+import os
+import io
+import tempfile
+import base64
+from typing import Optional, Dict, Any
+import numpy as np
+# STT
+try:
+    from faster_whisper import WhisperModel
+    HAS_FASTER_WHISPER = True
+except ImportError:
+    HAS_FASTER_WHISPER = False
+# TTS via HF Inference
+try:
+    from huggingface_hub import InferenceClient
+    HAS_HF_INFERENCE = True
+except ImportError:
+    HAS_HF_INFERENCE = False
+class VoiceInterface:
+    """Handles audio input (STT) and output (TTS) for the agent."""
+    def __init__(
+        self,
+        stt_model_size: str = "base",
+        tts_model: str = "hexgrad/Kokoro-82M",
+        hf_token: Optional[str] = None,
+    ):
+        self.stt_model_size = stt_model_size
+        self.tts_model = tts_model
+        self.hf_token = hf_token or os.getenv("HF_TOKEN")
+        self._stt: Optional[Any] = None
+        self._tts_client: Optional[Any] = None
+    # ------------------------------------------------------------------
+    # STT
+    # ------------------------------------------------------------------
+    def _load_stt(self) -> Any:
+        if self._stt is None:
+            if HAS_FASTER_WHISPER:
+                # Use CPU for Spaces compatibility; auto-detect compute type
+                self._stt = WhisperModel(self.stt_model_size, device="cpu", compute_type="int8")
+            else:
+                raise RuntimeError("faster-whisper not installed. Run: pip install faster-whisper")
+        return self._stt
+    def transcribe(self, audio_np: np.ndarray, sample_rate: int = 16000) -> Dict[str, Any]:
+        """Transcribe audio waveform to text.
+        audio_np: numpy array of float32 audio samples
+        """
+        model = self._load_stt()
+        # faster-whisper expects a file path or bytes; save to temp wav
+        import soundfile as sf
+        with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
+            sf.write(f.name, audio_np, sample_rate)
+            segments, info = model.transcribe(f.name, beam_size=5)
+            text = " ".join([seg.text for seg in segments])
+            os.unlink(f.name)
+        return {
+            "text": text.strip(),
+            "language": info.language,
+            "probability": info.language_probability,
+        }
+    def transcribe_from_file(self, file_path: str) -> Dict[str, Any]:
+        model = self._load_stt()
+        segments, info = model.transcribe(file_path, beam_size=5)
+        text = " ".join([seg.text for seg in segments])
+        return {
+            "text": text.strip(),
+            "language": info.language,
+            "probability": info.language_probability,
+        }
+    # ------------------------------------------------------------------
+    # TTS
+    # ------------------------------------------------------------------
+    def _load_tts(self) -> Any:
+        if self._tts_client is None:
+            if HAS_HF_INFERENCE:
+                self._tts_client = InferenceClient(model=self.tts_model, token=self.hf_token)
+            else:
+                raise RuntimeError("huggingface_hub not installed")
+        return self._tts_client
+    def synthesize(self, text: str, voice: str = "af") -> bytes:
+        """Synthesize text to speech bytes.
+        Returns raw audio bytes (usually WAV or MP3 depending on model).
+        """
+        client = self._load_tts()
+        try:
+            audio = client.text_to_speech(text, model=self.tts_model)
+            if hasattr(audio, "read"):
+                return audio.read()
+            return audio
+        except Exception as e:
+            # Fallback to standard TTS endpoint
+            alt_client = InferenceClient(token=self.hf_token)
+            audio = alt_client.text_to_speech(text, model="espnet/kan-bayashi_ljspeech_vits")
+            if hasattr(audio, "read"):
+                return audio.read()
+            return audio
+    def synthesize_to_file(self, text: str, output_path: str, voice: str = "af") -> str:
+        audio_bytes = self.synthesize(text, voice)
+        with open(output_path, "wb") as f:
+            f.write(audio_bytes)
+        return output_path
+    # ------------------------------------------------------------------
+    # Gradio helpers
+    # ------------------------------------------------------------------
+    def process_gradio_audio(self, audio_tuple) -> str:
+        """Process Gradio audio input (tuple of sample_rate, numpy_array)."""
+        if audio_tuple is None:
+            return ""
+        sample_rate, audio_np = audio_tuple
+        # Convert to mono float32 if needed
+        if audio_np.ndim > 1:
+            audio_np = audio_np.mean(axis=1)
+        if audio_np.dtype != np.float32:
+            audio_np = audio_np.astype(np.float32)
+        result = self.transcribe(audio_np, sample_rate)
+        return result["text"]