jkorstad commited on
Commit
31e5b3a
·
verified ·
1 Parent(s): ce0a449

Deploy Computer Agent v2.0 full stack

Browse files
Files changed (9) hide show
  1. .gitignore +6 -0
  2. README.md +93 -0
  3. core_agent.py +707 -0
  4. e2bqwen.py +500 -0
  5. eval_harness.py +366 -0
  6. mcp_tools.py +479 -0
  7. requirements.txt +25 -0
  8. templates/viewer.html +753 -0
  9. voice_interface.py +137 -0
.gitignore ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ .local/
2
+ __pycache__/
3
+ *.pyc
4
+ memory_db/
5
+ tmp/
6
+ eval_results/
README.md ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Computer Agent v2.0
3
+ emoji: 🤖
4
+ colorFrom: purple
5
+ colorTo: blue
6
+ sdk: gradio
7
+ sdk_version: "5.0.0"
8
+ app_file: app.py
9
+ pinned: false
10
+ license: apache-2.0
11
+ short_description: "Enhanced universal computer agent with planner, MCP, memory & voice"
12
+ ---
13
+
14
+ # 🤖 Open Computer Agent v2.0
15
+
16
+ An **enhanced** universal computer-use agent built on [smolagents](https://github.com/huggingface/smolagents), [E2B Desktop](https://e2b.dev), and [Playwright](https://playwright.dev).
17
+
18
+ ## What's New in v2.0
19
+
20
+ | Feature | Description |
21
+ |---------|-------------|
22
+ | 🧠 **Hierarchical Planner** | Breaks goals into subtasks before execution using a cheap text model |
23
+ | 🔌 **Playwright MCP** | Semantic browser control (click by text/role, extract tables/links, evaluate JS) |
24
+ | 🎯 **Multi-Model Router** | Auto-selects the cheapest capable model (fast vision ↔ powerful vision ↔ fast text ↔ powerful text) |
25
+ | 🧩 **Set-of-Marks Vision** | Overlays numbered bounding boxes on UI elements for coordinate-free interaction |
26
+ | 🗄️ **Long-Term Memory** | ChromaDB vector store retrieves similar past tasks and proven strategies |
27
+ | 🔍 **Verifier Agent** | Checks subtask completion and triggers recovery loops |
28
+ | 🛑 **Human-in-the-Loop** | Pauses on sensitive actions (payments, emails, deletes) for user approval |
29
+ | 🎙️ **Voice I/O** | Speak tasks and hear responses via Whisper STT + Kokoro TTS |
30
+ | 💰 **Cost Dashboard** | Real-time $/task, token usage, and latency tracking |
31
+ | 📹 **Session Recording** | Saves every step as replayable macros with GIF/MP4 export potential |
32
+ | 🧪 **Enhanced Eval** | Built-in benchmark suite with LLM-as-a-Judge grading and A/B testing |
33
+
34
+ ## Architecture
35
+
36
+ ```
37
+ User Input (Text / Voice / File)
38
+ |
39
+ v
40
+ [Intelligence Router] ----> Planner (JSON DAG)
41
+ |
42
+ v
43
+ [Memory Retrieval] (ChromaDB)
44
+ |
45
+ v
46
+ [Plan Executor]
47
+ |
48
+ +---> [Browser Sub-Agent] (Playwright MCP)
49
+ +---> [Desktop Sub-Agent] (E2B + SoM Vision)
50
+ +---> [Coder Sub-Agent] (Code Interpreter)
51
+ +---> [HF Hub Sub-Agent] (Search / Upload)
52
+ |
53
+ v
54
+ [Verifier] -> Retry / Alternative / Continue
55
+ |
56
+ v
57
+ [Macro Saver] + Cost Report + Session Recording
58
+ ```
59
+
60
+ ## Quick Start
61
+
62
+ 1. Set your **HF_TOKEN** and **E2B_API_KEY** in the Space Secrets.
63
+ 2. Type a task (or speak it) and hit **🚀 Let's go!**.
64
+ 3. Watch the agent plan, execute, verify, and report costs.
65
+
66
+ ## Sensitive Actions
67
+
68
+ By default, the agent pauses before:
69
+ - Payments, purchases, subscriptions
70
+ - Sending emails/messages/posts
71
+ - Deleting files or uninstalling software
72
+ - Password/credit-card fields
73
+
74
+ Enable **Auto-approve all actions** in Advanced Options to disable HITL.
75
+
76
+ ## Cost Budget
77
+
78
+ Default budget is **$2.00 USD per session**. The router automatically downgrades to cheaper models as the budget is consumed.
79
+
80
+ ## Benchmarks
81
+
82
+ Run the built-in eval suite:
83
+ ```python
84
+ from eval_harness import EvaluationHarness
85
+ # See eval_harness.py for usage
86
+ ```
87
+
88
+ ## Credits
89
+
90
+ - [smolagents](https://github.com/huggingface/smolagents) by Hugging Face
91
+ - [E2B](https://e2b.dev) for secure sandboxed desktops
92
+ - [Playwright](https://playwright.dev) for browser automation
93
+ - [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct) for vision reasoning
core_agent.py ADDED
@@ -0,0 +1,707 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ core_agent.py — Enhanced Computer Agent Brain
3
+ =============================================
4
+ Hierarchical Planner + Verifier + Multi-Model Router + Long-Term Memory
5
+ """
6
+
7
+ import os
8
+ import json
9
+ import time
10
+ import uuid
11
+ from datetime import datetime
12
+ from typing import Any, Dict, List, Optional, Tuple
13
+ from dataclasses import dataclass, field
14
+
15
+ import numpy as np
16
+ from PIL import Image, ImageDraw, ImageFont
17
+
18
+ # Smolagents
19
+ from smolagents import CodeAgent, tool
20
+ from smolagents.agent_types import AgentImage
21
+ from smolagents.memory import ActionStep, TaskStep
22
+ from smolagents.models import ChatMessage, Model, HfApiModel
23
+ from smolagents.monitoring import LogLevel
24
+
25
+ # Local model fallback
26
+ from huggingface_hub import InferenceClient
27
+
28
+ # Try ChromaDB for memory
29
+ try:
30
+ import chromadb
31
+ from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction
32
+ HAS_CHROMA = True
33
+ except ImportError:
34
+ HAS_CHROMA = False
35
+
36
+ # Try sentence-transformers for embeddings
37
+ try:
38
+ from sentence_transformers import SentenceTransformer
39
+ HAS_ST = True
40
+ except ImportError:
41
+ HAS_ST = False
42
+
43
+
44
+ # ---------------------------------------------------------------------------
45
+ # Data models
46
+ # ---------------------------------------------------------------------------
47
+
48
+ @dataclass
49
+ class Subtask:
50
+ id: str
51
+ description: str
52
+ status: str = "pending" # pending | running | completed | failed
53
+ strategy: str = "auto" # browser | desktop | code | vision
54
+ depends_on: List[str] = field(default_factory=list)
55
+ result: Any = None
56
+ retries: int = 0
57
+ max_retries: int = 2
58
+
59
+
60
+ @dataclass
61
+ class Plan:
62
+ goal: str
63
+ subtasks: List[Subtask]
64
+ created_at: float = field(default_factory=time.time)
65
+
66
+
67
+ @dataclass
68
+ class ModelCall:
69
+ model_id: str
70
+ tokens_in: int = 0
71
+ tokens_out: int = 0
72
+ latency_ms: float = 0.0
73
+ cost_usd: float = 0.0
74
+ timestamp: float = field(default_factory=time.time)
75
+
76
+
77
+ # ---------------------------------------------------------------------------
78
+ # Multi-Model Intelligence Router
79
+ # ---------------------------------------------------------------------------
80
+
81
+ MODEL_REGISTRY = {
82
+ "fast_vision": {
83
+ "model_id": "Qwen/Qwen2.5-VL-7B-Instruct",
84
+ "endpoint": None, # Use HF Inference API
85
+ "type": "vision",
86
+ "cost_per_1k_in": 0.0001,
87
+ "cost_per_1k_out": 0.0002,
88
+ "max_tokens": 2048,
89
+ },
90
+ "powerful_vision": {
91
+ "model_id": "Qwen/Qwen2.5-VL-72B-Instruct",
92
+ "endpoint": None,
93
+ "type": "vision",
94
+ "cost_per_1k_in": 0.001,
95
+ "cost_per_1k_out": 0.002,
96
+ "max_tokens": 4096,
97
+ },
98
+ "fast_text": {
99
+ "model_id": "Qwen/Qwen2.5-32B-Instruct",
100
+ "endpoint": None,
101
+ "type": "text",
102
+ "cost_per_1k_in": 0.0002,
103
+ "cost_per_1k_out": 0.0004,
104
+ "max_tokens": 4096,
105
+ },
106
+ "powerful_text": {
107
+ "model_id": "Qwen/Qwen3-235B-A22B",
108
+ "endpoint": None,
109
+ "type": "text",
110
+ "cost_per_1k_in": 0.0015,
111
+ "cost_per_1k_out": 0.003,
112
+ "max_tokens": 8192,
113
+ },
114
+ }
115
+
116
+
117
+ class IntelligenceRouter(Model):
118
+ """Routes tasks to the optimal model based on complexity, modality, and cost."""
119
+
120
+ def __init__(
121
+ self,
122
+ hf_token: Optional[str] = None,
123
+ default_vision: str = "powerful_vision",
124
+ default_text: str = "fast_text",
125
+ cost_budget_usd: float = 1.0,
126
+ ):
127
+ super().__init__()
128
+ self.hf_token = hf_token or os.getenv("HF_TOKEN") or os.getenv("HUGGINGFACE_API_KEY")
129
+ self.default_vision = default_vision
130
+ self.default_text = default_text
131
+ self.cost_budget_usd = cost_budget_usd
132
+ self.cost_so_far_usd = 0.0
133
+ self.call_history: List[ModelCall] = []
134
+ self._clients: Dict[str, InferenceClient] = {}
135
+
136
+ def _get_client(self, model_key: str) -> InferenceClient:
137
+ if model_key not in self._clients:
138
+ cfg = MODEL_REGISTRY[model_key]
139
+ self._clients[model_key] = InferenceClient(
140
+ model=cfg["model_id"],
141
+ token=self.hf_token,
142
+ )
143
+ return self._clients[model_key]
144
+
145
+ def select_model(
146
+ self,
147
+ task_type: str = "vision",
148
+ complexity: str = "medium",
149
+ has_images: bool = False,
150
+ ) -> str:
151
+ """Select the best model for a given task."""
152
+ if self.cost_so_far_usd >= self.cost_budget_usd * 0.9:
153
+ # Budget nearly exhausted — use cheapest
154
+ return "fast_vision" if has_images else "fast_text"
155
+
156
+ if has_images or task_type == "vision":
157
+ if complexity in ("high", "complex", "spatial"):
158
+ return self.default_vision
159
+ return "fast_vision"
160
+
161
+ if complexity in ("high", "complex", "reasoning"):
162
+ return "powerful_text"
163
+ return self.default_text
164
+
165
+ def __call__(
166
+ self,
167
+ messages: List[Dict[str, Any]],
168
+ stop_sequences: Optional[List[str]] = None,
169
+ task_type: str = "vision",
170
+ complexity: str = "medium",
171
+ has_images: bool = False,
172
+ **kwargs,
173
+ ) -> ChatMessage:
174
+ model_key = self.select_model(task_type, complexity, has_images)
175
+ cfg = MODEL_REGISTRY[model_key]
176
+ client = self._get_client(model_key)
177
+
178
+ start = time.time()
179
+ try:
180
+ # HF InferenceClient chat_completion
181
+ response = client.chat_completion(
182
+ messages=messages,
183
+ max_tokens=cfg["max_tokens"],
184
+ stop=stop_sequences,
185
+ )
186
+ latency = (time.time() - start) * 1000
187
+
188
+ # Estimate cost (rough token counting)
189
+ content = response.choices[0].message.content or ""
190
+ tok_in = self._estimate_tokens(messages)
191
+ tok_out = len(content.split()) * 1.3 # rough
192
+ cost = (tok_in / 1000) * cfg["cost_per_1k_in"] + (tok_out / 1000) * cfg["cost_per_1k_out"]
193
+ self.cost_so_far_usd += cost
194
+
195
+ self.call_history.append(ModelCall(
196
+ model_id=cfg["model_id"],
197
+ tokens_in=int(tok_in),
198
+ tokens_out=int(tok_out),
199
+ latency_ms=latency,
200
+ cost_usd=cost,
201
+ ))
202
+
203
+ return ChatMessage(role="assistant", content=content)
204
+ except Exception as e:
205
+ # Fallback to default vision/text
206
+ fallback = self.default_vision if has_images else self.default_text
207
+ if model_key == fallback:
208
+ raise
209
+ print(f"[{model_key}] failed: {e}. Falling back to {fallback}")
210
+ return self.__call__(
211
+ messages, stop_sequences, task_type, complexity, has_images, **kwargs
212
+ )
213
+
214
+ def _estimate_tokens(self, messages: List[Dict[str, Any]]) -> int:
215
+ # Very rough estimate: 4 chars ~= 1 token
216
+ total = 0
217
+ for msg in messages:
218
+ content = msg.get("content", "")
219
+ if isinstance(content, str):
220
+ total += len(content) // 4
221
+ elif isinstance(content, list):
222
+ for item in content:
223
+ if isinstance(item, dict) and "text" in item:
224
+ total += len(item["text"]) // 4
225
+ return max(total, 1)
226
+
227
+ def get_cost_report(self) -> Dict[str, Any]:
228
+ return {
229
+ "budget_usd": self.cost_budget_usd,
230
+ "spent_usd": round(self.cost_so_far_usd, 6),
231
+ "remaining_usd": round(self.cost_budget_usd - self.cost_so_far_usd, 6),
232
+ "calls": len(self.call_history),
233
+ "by_model": self._aggregate_by_model(),
234
+ }
235
+
236
+ def _aggregate_by_model(self) -> Dict[str, Dict[str, float]]:
237
+ agg = {}
238
+ for c in self.call_history:
239
+ agg.setdefault(c.model_id, {"calls": 0, "tokens_in": 0, "tokens_out": 0, "cost": 0.0})
240
+ agg[c.model_id]["calls"] += 1
241
+ agg[c.model_id]["tokens_in"] += c.tokens_in
242
+ agg[c.model_id]["tokens_out"] += c.tokens_out
243
+ agg[c.model_id]["cost"] += c.cost_usd
244
+ return agg
245
+
246
+
247
+ # ---------------------------------------------------------------------------
248
+ # Hierarchical Planner
249
+ # ---------------------------------------------------------------------------
250
+
251
+ PLANNER_SYSTEM_PROMPT = """You are a Task Planner for a computer automation agent.
252
+ Given a user's high-level goal, break it into a JSON list of subtasks.
253
+ Each subtask must have:
254
+ - description: concise action description
255
+ - strategy: one of [browser, desktop, code, vision]
256
+ - depends_on: list of subtask indices (0-based) that must finish before this one
257
+
258
+ Rules:
259
+ 1. Use "browser" for web navigation, "desktop" for OS-level GUI actions,
260
+ "code" for writing/running scripts, "vision" for visual reasoning.
261
+ 2. Keep subtasks atomic (1-3 actions each).
262
+ 3. Start with gathering info, then acting, then verifying.
263
+ 4. Output ONLY valid JSON. No markdown fences.
264
+
265
+ Example input: "Find Hugging Face HQ in Paris using Google Maps"
266
+ Example output:
267
+ [
268
+ {"description": "Open Google Maps in browser", "strategy": "browser", "depends_on": []},
269
+ {"description": "Search for 'Hugging Face Paris'", "strategy": "browser", "depends_on": [0]},
270
+ {"description": "Extract the address from the result card", "strategy": "vision", "depends_on": [1]},
271
+ {"description": "Verify the address contains 'Paris'", "strategy": "code", "depends_on": [2]}
272
+ ]
273
+ """
274
+
275
+
276
+ class HierarchicalPlanner:
277
+ """Breaks a user goal into a DAG of subtasks using a cheap text model."""
278
+
279
+ def __init__(self, router: IntelligenceRouter):
280
+ self.router = router
281
+
282
+ def plan(self, goal: str, context: str = "") -> Plan:
283
+ messages = [
284
+ {"role": "system", "content": PLANNER_SYSTEM_PROMPT},
285
+ {"role": "user", "content": f"Goal: {goal}\nContext: {context}\n\nGenerate the subtask JSON list."},
286
+ ]
287
+ response = self.router(
288
+ messages,
289
+ task_type="text",
290
+ complexity="medium",
291
+ has_images=False,
292
+ )
293
+ raw = response.content.strip()
294
+ # Strip markdown fences if present
295
+ if raw.startswith("```"):
296
+ raw = raw.split("```", 2)[-1]
297
+ if raw.startswith("json"):
298
+ raw = raw[4:]
299
+ raw = raw.strip()
300
+
301
+ try:
302
+ data = json.loads(raw)
303
+ except json.JSONDecodeError:
304
+ # Fallback: single subtask with the whole goal
305
+ data = [{"description": goal, "strategy": "auto", "depends_on": []}]
306
+
307
+ subtasks = []
308
+ for i, item in enumerate(data):
309
+ subtasks.append(Subtask(
310
+ id=f"st_{i:03d}",
311
+ description=item.get("description", str(item)),
312
+ strategy=item.get("strategy", "auto"),
313
+ depends_on=item.get("depends_on", []),
314
+ ))
315
+ return Plan(goal=goal, subtasks=subtasks)
316
+
317
+
318
+ # ---------------------------------------------------------------------------
319
+ # Verifier & Recovery
320
+ # ---------------------------------------------------------------------------
321
+
322
+ VERIFIER_SYSTEM_PROMPT = """You are a Verifier agent. Given a subtask description, the agent's action trace, and a screenshot, determine if the subtask was completed successfully.
323
+
324
+ Respond with ONLY a JSON object:
325
+ {"success": true/false, "reason": "short explanation", "next_action": "continue|retry|alternative"}
326
+
327
+ Rules:
328
+ - success=true if the intended outcome is clearly visible in the screenshot or trace.
329
+ - next_action=retry if the agent seems close but missed a click.
330
+ - next_action=alternative if the approach is fundamentally wrong.
331
+ """
332
+
333
+
334
+ class VerifierAgent:
335
+ """Checks if a subtask succeeded and suggests recovery."""
336
+
337
+ def __init__(self, router: IntelligenceRouter):
338
+ self.router = router
339
+
340
+ def verify(
341
+ self,
342
+ subtask: Subtask,
343
+ action_trace: List[str],
344
+ screenshot: Optional[Image.Image] = None,
345
+ ) -> Dict[str, Any]:
346
+ trace_text = "\n".join(action_trace[-10:]) # last 10 actions
347
+ content = [
348
+ {"type": "text", "text": f"Subtask: {subtask.description}\nAction trace:\n{trace_text}\n\nWas this completed successfully?"},
349
+ ]
350
+ if screenshot:
351
+ # In a real implementation we'd base64 encode the image
352
+ content.append({"type": "text", "text": "[Screenshot available — analyze it]"})
353
+
354
+ messages = [
355
+ {"role": "system", "content": VERIFIER_SYSTEM_PROMPT},
356
+ {"role": "user", "content": content},
357
+ ]
358
+ response = self.router(
359
+ messages,
360
+ task_type="vision" if screenshot else "text",
361
+ complexity="medium",
362
+ has_images=screenshot is not None,
363
+ )
364
+ raw = response.content.strip()
365
+ if raw.startswith("```"):
366
+ raw = raw.split("```", 2)[-1]
367
+ if raw.startswith("json"):
368
+ raw = raw[4:]
369
+ raw = raw.strip()
370
+ try:
371
+ return json.loads(raw)
372
+ except json.JSONDecodeError:
373
+ return {"success": True, "reason": "Parsing failed, assuming success", "next_action": "continue"}
374
+
375
+
376
+ # ---------------------------------------------------------------------------
377
+ # Long-Term Memory (ChromaDB)
378
+ # ---------------------------------------------------------------------------
379
+
380
+ class AgentMemory:
381
+ """Stores and retrieves past task trajectories for few-shot prompting."""
382
+
383
+ def __init__(self, persist_dir: str = "./memory_db"):
384
+ self.persist_dir = persist_dir
385
+ os.makedirs(persist_dir, exist_ok=True)
386
+ self.collection = None
387
+ if HAS_CHROMA and HAS_ST:
388
+ self.client = chromadb.PersistentClient(path=persist_dir)
389
+ self.ef = SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")
390
+ self.collection = self.client.get_or_create_collection(
391
+ name="task_memory",
392
+ embedding_function=self.ef,
393
+ )
394
+ elif HAS_ST:
395
+ # Fallback: in-memory similarity with numpy
396
+ self.embedder = SentenceTransformer("all-MiniLM-L6-v2")
397
+ self._memories: List[Dict] = []
398
+ else:
399
+ self._memories: List[Dict] = []
400
+
401
+ def embed(self, text: str) -> List[float]:
402
+ if HAS_ST:
403
+ return self.embedder.encode(text).tolist()
404
+ return []
405
+
406
+ def add_task(
407
+ self,
408
+ task: str,
409
+ strategy_summary: str,
410
+ success: bool,
411
+ final_answer: str = "",
412
+ domain: str = "general",
413
+ ):
414
+ entry = {
415
+ "task": task,
416
+ "strategy_summary": strategy_summary,
417
+ "success": success,
418
+ "final_answer": final_answer,
419
+ "domain": domain,
420
+ "timestamp": time.time(),
421
+ }
422
+ if self.collection:
423
+ self.collection.add(
424
+ documents=[task],
425
+ metadatas=[entry],
426
+ ids=[str(uuid.uuid4())],
427
+ )
428
+ else:
429
+ self._memories.append(entry)
430
+
431
+ def retrieve_similar(
432
+ self,
433
+ query: str,
434
+ n_results: int = 3,
435
+ filter_success: bool = True,
436
+ ) -> List[Dict[str, Any]]:
437
+ if self.collection:
438
+ where = {"success": True} if filter_success else None
439
+ results = self.collection.query(
440
+ query_texts=[query],
441
+ n_results=n_results,
442
+ where=where,
443
+ )
444
+ out = []
445
+ for meta in results.get("metadatas", [[]])[0]:
446
+ out.append(meta)
447
+ return out
448
+ else:
449
+ # Simple exact/contains match fallback
450
+ query_lower = query.lower()
451
+ scored = []
452
+ for m in self._memories:
453
+ score = 0
454
+ if query_lower in m["task"].lower():
455
+ score += 10
456
+ if m.get("domain", "") in query_lower:
457
+ score += 5
458
+ if filter_success and not m.get("success", False):
459
+ score -= 100
460
+ scored.append((score, m))
461
+ scored.sort(key=lambda x: x[0], reverse=True)
462
+ return [x[1] for x in scored[:n_results]]
463
+
464
+ def get_domain_tips(self, domain: str) -> List[str]:
465
+ tips = []
466
+ for m in self._memories:
467
+ if m.get("domain") == domain and m.get("success"):
468
+ tips.append(m.get("strategy_summary", ""))
469
+ return tips[:5]
470
+
471
+
472
+ # ---------------------------------------------------------------------------
473
+ # Set-of-Marks (SoM) Preprocessor
474
+ # ---------------------------------------------------------------------------
475
+
476
+ class SoMPreprocessor:
477
+ """Overlays numbered bounding boxes on UI elements for the agent to reference by ID."""
478
+
479
+ def __init__(self, use_icon_detection: bool = False):
480
+ self.use_icon_detection = use_icon_detection
481
+ self.element_registry: Dict[int, Tuple[int, int, int, int]] = {}
482
+ self.next_id = 1
483
+
484
+ def detect_elements(self, image: Image.Image) -> List[Tuple[int, int, int, int]]:
485
+ """Lightweight heuristic element detection.
486
+ In production, replace with OmniParser or seeclick model.
487
+ """
488
+ # Simple grid-based + edge heuristic fallback
489
+ w, h = image.size
490
+ boxes = []
491
+ # Detect potential buttons/links by looking for rectangular regions
492
+ # This is a placeholder — real implementation would use a vision model
493
+ # For now, divide screen into a coarse grid and let agent pick grid cells
494
+ cols, rows = 8, 6
495
+ cell_w, cell_h = w // cols, h // rows
496
+ for r in range(rows):
497
+ for c in range(cols):
498
+ x1, y1 = c * cell_w, r * cell_h
499
+ x2, y2 = x1 + cell_w, y1 + cell_h
500
+ boxes.append((x1, y1, x2, y2))
501
+ return boxes
502
+
503
+ def preprocess(self, image: Image.Image) -> Tuple[Image.Image, Dict[int, Tuple[int, int, int, int]]]:
504
+ """Return annotated image + element registry mapping ID -> bbox."""
505
+ boxes = self.detect_elements(image)
506
+ annotated = image.copy()
507
+ draw = ImageDraw.Draw(annotated)
508
+ registry = {}
509
+ try:
510
+ font = ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf", 14)
511
+ except Exception:
512
+ font = ImageFont.load_default()
513
+
514
+ for i, (x1, y1, x2, y2) in enumerate(boxes, start=1):
515
+ registry[i] = (x1, y1, x2, y2)
516
+ # Draw bounding box
517
+ draw.rectangle([x1, y1, x2, y2], outline="#00FF00", width=2)
518
+ # Draw label background
519
+ label = str(i)
520
+ bbox = draw.textbbox((0, 0), label, font=font)
521
+ tw, th = bbox[2] - bbox[0], bbox[3] - bbox[1]
522
+ draw.rectangle([x1, y1, x1 + tw + 4, y1 + th + 4], fill="#00FF00")
523
+ draw.text((x1 + 2, y1 + 2), label, fill="#000000", font=font)
524
+
525
+ self.element_registry = registry
526
+ self.next_id = len(registry) + 1
527
+ return annotated, registry
528
+
529
+ def get_center(self, element_id: int) -> Tuple[int, int]:
530
+ x1, y1, x2, y2 = self.element_registry[element_id]
531
+ return (x1 + x2) // 2, (y1 + y2) // 2
532
+
533
+
534
+ # ---------------------------------------------------------------------------
535
+ # Session Recorder & Macro Saver
536
+ # ---------------------------------------------------------------------------
537
+
538
+ @dataclass
539
+ class SessionFrame:
540
+ step: int
541
+ screenshot_path: Optional[str]
542
+ action: str
543
+ observation: str
544
+ timestamp: float
545
+
546
+
547
+ class SessionRecorder:
548
+ """Records every step for replay, GIF generation, and macro creation."""
549
+
550
+ def __init__(self, session_id: str, output_dir: str = "./sessions"):
551
+ self.session_id = session_id
552
+ self.output_dir = os.path.join(output_dir, session_id)
553
+ os.makedirs(self.output_dir, exist_ok=True)
554
+ self.frames: List[SessionFrame] = []
555
+ self.start_time = time.time()
556
+
557
+ def log_step(
558
+ self,
559
+ step: int,
560
+ screenshot: Optional[Image.Image],
561
+ action: str,
562
+ observation: str,
563
+ ):
564
+ path = None
565
+ if screenshot:
566
+ path = os.path.join(self.output_dir, f"step_{step:03d}.png")
567
+ screenshot.save(path)
568
+ frame = SessionFrame(
569
+ step=step,
570
+ screenshot_path=path,
571
+ action=action,
572
+ observation=observation,
573
+ timestamp=time.time(),
574
+ )
575
+ self.frames.append(frame)
576
+ # Also append to JSONL
577
+ with open(os.path.join(self.output_dir, "session.jsonl"), "a") as f:
578
+ f.write(json.dumps({
579
+ "step": step,
580
+ "action": action,
581
+ "observation": observation,
582
+ "timestamp": frame.timestamp,
583
+ "screenshot": path,
584
+ }) + "\n")
585
+
586
+ def save_macro(self, name: str) -> str:
587
+ """Save successful trajectory as a replayable macro."""
588
+ macro = {
589
+ "name": name,
590
+ "session_id": self.session_id,
591
+ "frames": [
592
+ {"action": f.action, "observation": f.observation, "timestamp": f.timestamp}
593
+ for f in self.frames
594
+ ],
595
+ }
596
+ path = os.path.join(self.output_dir, f"macro_{name}.json")
597
+ with open(path, "w") as f:
598
+ json.dump(macro, f, indent=2)
599
+ return path
600
+
601
+ def generate_summary(self) -> Dict[str, Any]:
602
+ duration = time.time() - self.start_time
603
+ actions = [f.action for f in self.frames]
604
+ return {
605
+ "session_id": self.session_id,
606
+ "duration_sec": round(duration, 2),
607
+ "steps": len(self.frames),
608
+ "actions": actions,
609
+ }
610
+
611
+
612
+ # ---------------------------------------------------------------------------
613
+ # HITL (Human-in-the-Loop) Checkpoint
614
+ # ---------------------------------------------------------------------------
615
+
616
+ class HITLCheckpoint:
617
+ """Defines categories of actions that require human approval."""
618
+
619
+ SENSITIVE_KEYWORDS = [
620
+ "password", "credit card", "ssn", "social security",
621
+ "payment", "checkout", "buy", "purchase", "subscribe",
622
+ "delete", "remove", "uninstall", "format",
623
+ "send email", "send message", "post to", "tweet",
624
+ ]
625
+
626
+ def __init__(self, auto_approve: bool = False):
627
+ self.auto_approve = auto_approve
628
+ self.pending_approvals: List[Dict[str, Any]] = []
629
+
630
+ def check_action(self, action: str, context: str = "") -> Tuple[bool, Optional[str]]:
631
+ """Returns (approved, reason). If not approved, reason explains why."""
632
+ if self.auto_approve:
633
+ return True, None
634
+ action_lower = action.lower()
635
+ for kw in self.SENSITIVE_KEYWORDS:
636
+ if kw in action_lower:
637
+ return False, f"Sensitive action detected: '{kw}'. Requires human approval."
638
+ return True, None
639
+
640
+ def request_approval(self, action: str, screenshot_path: Optional[str] = None) -> Dict[str, Any]:
641
+ req = {
642
+ "id": str(uuid.uuid4()),
643
+ "action": action,
644
+ "screenshot": screenshot_path,
645
+ "status": "pending",
646
+ "requested_at": time.time(),
647
+ }
648
+ self.pending_approvals.append(req)
649
+ return req
650
+
651
+
652
+ # ---------------------------------------------------------------------------
653
+ # Cost Tracker
654
+ # ---------------------------------------------------------------------------
655
+
656
+ class CostTracker:
657
+ """Tracks per-task and cumulative costs across all model calls."""
658
+
659
+ def __init__(self):
660
+ self.tasks: Dict[str, List[ModelCall]] = {}
661
+
662
+ def start_task(self, task_id: str):
663
+ self.tasks[task_id] = []
664
+
665
+ def log_call(self, task_id: str, call: ModelCall):
666
+ self.tasks.setdefault(task_id, []).append(call)
667
+
668
+ def get_task_report(self, task_id: str) -> Dict[str, Any]:
669
+ calls = self.tasks.get(task_id, [])
670
+ total_cost = sum(c.cost_usd for c in calls)
671
+ total_tokens = sum(c.tokens_in + c.tokens_out for c in calls)
672
+ total_latency = sum(c.latency_ms for c in calls)
673
+ return {
674
+ "task_id": task_id,
675
+ "calls": len(calls),
676
+ "total_cost_usd": round(total_cost, 6),
677
+ "total_tokens": total_tokens,
678
+ "avg_latency_ms": round(total_latency / max(len(calls), 1), 2),
679
+ "by_model": self._aggregate(calls),
680
+ }
681
+
682
+ def _aggregate(self, calls: List[ModelCall]) -> Dict[str, Dict[str, float]]:
683
+ agg = {}
684
+ for c in calls:
685
+ agg.setdefault(c.model_id, {"calls": 0, "cost": 0.0, "tokens": 0})
686
+ agg[c.model_id]["calls"] += 1
687
+ agg[c.model_id]["cost"] += c.cost_usd
688
+ agg[c.model_id]["tokens"] += c.tokens_in + c.tokens_out
689
+ return agg
690
+
691
+
692
+ # ---------------------------------------------------------------------------
693
+ # Convenience: Compose everything into an AgentConfig
694
+ # ---------------------------------------------------------------------------
695
+
696
+ @dataclass
697
+ class AgentConfig:
698
+ hf_token: Optional[str] = None
699
+ cost_budget_usd: float = 2.0
700
+ use_planner: bool = True
701
+ use_verifier: bool = True
702
+ use_memory: bool = True
703
+ use_som: bool = True
704
+ use_hitl: bool = True
705
+ use_recorder: bool = True
706
+ memory_dir: str = "./memory_db"
707
+ auto_approve: bool = False
e2bqwen.py ADDED
@@ -0,0 +1,500 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import time
3
+ import unicodedata
4
+ import spaces
5
+ from datetime import datetime
6
+ from io import BytesIO
7
+ from typing import Any, Dict, List, Optional
8
+
9
+ # E2B imports
10
+ from e2b_desktop import Sandbox
11
+ from PIL import Image, ImageDraw
12
+
13
+ # SmolaAgents imports
14
+ from smolagents import CodeAgent, HfApiModel, tool
15
+ from smolagents.agent_types import AgentImage
16
+ from smolagents.memory import ActionStep, TaskStep
17
+ from smolagents.models import ChatMessage, Model
18
+ from smolagents.monitoring import LogLevel
19
+
20
+ E2B_SYSTEM_PROMPT_TEMPLATE = """You are a desktop automation assistant that can control a remote desktop environment. The current date is <<current_date>>.
21
+
22
+ <action process>
23
+ You will be given a task to solve in several steps. At each step you will perform an action.
24
+ After each action, you'll receive an updated screenshot.
25
+ Then you will proceed as follows, with these sections: don't skip any!
26
+
27
+ Short term goal: ...
28
+ What I see: ...
29
+ Reflection: ...
30
+ Action:
31
+ ```python
32
+ click(254, 308)
33
+ ```<end_code>
34
+
35
+ Akways format your action ('Action:' part) as Python code blocks as shown above.
36
+ </action_process>
37
+
38
+ <tools>
39
+ On top of performing computations in the Python code snippets that you create, you only have access to these tools to interact with the desktop, no additional ones:
40
+ {%- for tool in tools.values() %}
41
+ - {{ tool.name }}: {{ tool.description }}
42
+ Takes inputs: {{tool.inputs}}
43
+ Returns an output of type: {{tool.output_type}}
44
+ {%- endfor %}
45
+ </tools>
46
+
47
+ <click_guidelines>
48
+ Look at elements on the screen to determine what to click or interact with.
49
+ The desktop has a resolution of <<resolution_x>>x<<resolution_y>> pixels, take it into account to decide clicking coordinates. NEVER USE HYPOTHETIC OR ASSUMED COORDINATES, USE TRUE COORDINATES that you can see from the screenshot.
50
+ Use precise coordinates based on the current screenshot for mouse movements and clicks.
51
+ Whenever you click, MAKE SURE to click in the middle of the button, text, link or any other clickable element. Not under, not on the side. IN THE MIDDLE, else you risk to miss it.
52
+ In menus it is always better to click in the middle of the text rather than in the tiny icon. Calculate extremelly well the coordinates. A mistake here can make the full task fail.
53
+ Sometimes you may have missed a click, so never assume that you're on the right page, always make sure that your previous action worked.
54
+ In the screenshot you will see a green crosshair displayed over the position of your last click: this way can inspect if the mouse pointer is off of the targeted element, pay special attention to it.
55
+ </click_guidelines>
56
+
57
+ <task_resolution_example>
58
+ For a task like "Open a text editor and type 'Hello World'":
59
+ Step 1:
60
+ Short term goal: I want to open a text editor.
61
+ What I see: I am on the homepage of my desktop. I see the applications
62
+ Reflection: I think that a notes application would fit in the Applications menu, let's open it. I'll carefully click in the middle of the text 'Applications'/
63
+ Action:
64
+ ```python
65
+ click(51, 8)
66
+ ```<end_code>
67
+
68
+ Step 2:
69
+ Short term goal: I want to open a text editor.
70
+ What I see: I am on the homepage of my desktop, with the applications menu open. I see an Accessories section, I see it is a section in the menu thanks to the tiny white triangle after the text accessories.
71
+ Reflection: I think that a notes application would fit the Accessories section. I SHOULD NOT try to move through the menus with scroll, it won't work:
72
+ I'll look for Accessories and click on it being very precise, clicking in the middle of the text 'Accessories'.
73
+ Action:
74
+ ```python
75
+ click(76, 195)
76
+ ```<end_code>
77
+
78
+ Step 3:
79
+ Short term goal: I want to open a text editor.
80
+ What I see: I am under the Accessories menu. Under the open submenu Accessories, I've found 'Text Editor'.
81
+ Reflection: This must be my notes app. I remember that menus are navigated through clicking. I will now click on it being very precise, clicking in the middle of the text 'Text Editor'.
82
+ Action:
83
+ ```python
84
+ click(251, 441)
85
+ ```<end_code>
86
+
87
+ Step 4:
88
+ Short term goal: I want to open a text editor.
89
+ What I see: I am still under the Accessories menu. Nothing has changed compared to previous screenshot. Under the open submenu Accessories, I still see 'Text Editor'. The green cross is off from the element.
90
+ Reflection: My last click must have been off. Let's correct this. I will click the correct place, right in the middle of the element.
91
+ Action:
92
+ ```python
93
+ click(241, 441)
94
+ ```<end_code>
95
+
96
+ Step 5:
97
+ Short term goal: I want to type 'Hello World'.
98
+ What I see: I have opened a Notepad. The Notepad app is open on an empty page
99
+ Reflection: Now Notepad is open as intended, time to type text.
100
+ Action:
101
+ ```python
102
+ type_text("Hello World")
103
+ ```<end_code>
104
+
105
+ Step 6:
106
+ Short term goal: I want to type 'Hello World'.
107
+ What I see: The Notepad app displays 'Hello World'
108
+ Reflection: Now that I've 1. Opened the notepad and 2. typed 'Hello World', and 3. the result seems correct, I think the Task is completed. I will return a confirmation that the task is completed.
109
+ Action:
110
+ ```python
111
+ final_answer("Done")
112
+ ```<end_code>
113
+ </task_resolution_example>
114
+
115
+ <general_guidelines>
116
+ Always analyze the latest screenshot carefully before performing actions.
117
+ You can wait for appropriate loading times using the wait() tool. But don't wait forever, sometimes you've just misclicked and the process didn't launch.
118
+ Execute one action at a time: don't try to pack a click and typing in one action.
119
+ On each step, look at the last screenshot and action to validate if previous steps worked and decide the next action. If you repeated an action already without effect, it means that this action is useless: don't repeat it and try something else.
120
+ Use click to move through menus on the desktop and scroll for web and specific applications.
121
+ Always analyze the latest screenshot carefully before performing actions.
122
+ Desktop menus usually expand with more options, the tiny triangle next to some text in a menu means that menu expands. For example in Office in the Applications menu expands showing presentation or writing applications.
123
+ NEVER CLICK THE WEB BROWSER ICON TO OPEN THE WEB BROWSER: use open_url directly.
124
+ In browser, ignore any sign-in popups while they don't interfere with the elements you want to interact with.
125
+ </general_guidelines>
126
+ """.replace("<<current_date>>", datetime.now().strftime("%A, %d-%B-%Y"))
127
+
128
+ @spaces.GPU
129
+ def draw_marker_on_image(image_copy, click_coordinates):
130
+ x, y = click_coordinates
131
+ draw = ImageDraw.Draw(image_copy)
132
+ cross_size, linewidth = 10, 3
133
+ # Draw cross
134
+ draw.line((x - cross_size, y, x + cross_size, y), fill="green", width=linewidth)
135
+ draw.line((x, y - cross_size, x, y + cross_size), fill="green", width=linewidth)
136
+ # Add a circle around it for better visibility
137
+ draw.ellipse(
138
+ (
139
+ x - cross_size * 2,
140
+ y - cross_size * 2,
141
+ x + cross_size * 2,
142
+ y + cross_size * 2,
143
+ ),
144
+ outline="green",
145
+ width=linewidth,
146
+ )
147
+ return image_copy
148
+
149
+ @spaces.GPU
150
+ def get_agent_summary_erase_images(agent):
151
+ for memory_step in agent.memory.steps:
152
+ if hasattr(memory_step, "observations_images"):
153
+ memory_step.observations_images = None
154
+ if hasattr(memory_step, "task_images"):
155
+ memory_step.task_images = None
156
+ return agent.write_memory_to_messages()
157
+
158
+ @spaces.GPU
159
+ class E2BVisionAgent(CodeAgent):
160
+ """Agent for e2b desktop automation with Qwen2.5VL vision capabilities"""
161
+
162
+ def __init__(
163
+ self,
164
+ model: HfApiModel,
165
+ data_dir: str,
166
+ desktop: Sandbox,
167
+ tools: List[tool] = None,
168
+ max_steps: int = 200,
169
+ verbosity_level: LogLevel = 2,
170
+ planning_interval: int = None,
171
+ use_v1_prompt: bool = False,
172
+ **kwargs,
173
+ ):
174
+ self.desktop = desktop
175
+ self.data_dir = data_dir
176
+ self.planning_interval = planning_interval
177
+ # Initialize Desktop
178
+ self.width, self.height = self.desktop.get_screen_size()
179
+ print(f"Screen size: {self.width}x{self.height}")
180
+
181
+ # Set up temp directory
182
+ os.makedirs(self.data_dir, exist_ok=True)
183
+ print(f"Screenshots and steps will be saved to: {self.data_dir}")
184
+
185
+ self.use_v1_prompt = use_v1_prompt
186
+ # Initialize base agent
187
+ super().__init__(
188
+ tools=tools or [],
189
+ model=model,
190
+ max_steps=max_steps,
191
+ verbosity_level=verbosity_level,
192
+ planning_interval=self.planning_interval,
193
+ **kwargs,
194
+ )
195
+ self.prompt_templates["system_prompt"] = E2B_SYSTEM_PROMPT_TEMPLATE.replace(
196
+ "<<resolution_x>>", str(self.width)
197
+ ).replace("<<resolution_y>>", str(self.height))
198
+
199
+ # Add screen info to state
200
+ self.state["screen_width"] = self.width
201
+ self.state["screen_height"] = self.height
202
+
203
+ # Add default tools
204
+ self.logger.log("Setting up agent tools...")
205
+ self._setup_desktop_tools()
206
+ self.step_callbacks.append(self.take_screenshot_callback)
207
+
208
+ def _setup_desktop_tools(self):
209
+ """Register all desktop tools"""
210
+
211
+ @tool
212
+ def click(x: int, y: int) -> str:
213
+ """
214
+ Performs a left-click at the specified coordinates
215
+ Args:
216
+ x: The x coordinate (horizontal position)
217
+ y: The y coordinate (vertical position)
218
+ """
219
+ self.desktop.move_mouse(x, y)
220
+ self.desktop.left_click()
221
+ self.click_coordinates = [x, y]
222
+ self.logger.log(f"Clicked at coordinates ({x}, {y})")
223
+ return f"Clicked at coordinates ({x}, {y})"
224
+
225
+ @tool
226
+ def right_click(x: int, y: int) -> str:
227
+ """
228
+ Performs a right-click at the specified coordinates
229
+ Args:
230
+ x: The x coordinate (horizontal position)
231
+ y: The y coordinate (vertical position)
232
+ """
233
+ self.desktop.move_mouse(x, y)
234
+ self.desktop.right_click()
235
+ self.click_coordinates = [x, y]
236
+ self.logger.log(f"Right-clicked at coordinates ({x}, {y})")
237
+ return f"Right-clicked at coordinates ({x}, {y})"
238
+
239
+ @tool
240
+ def double_click(x: int, y: int) -> str:
241
+ """
242
+ Performs a double-click at the specified coordinates
243
+ Args:
244
+ x: The x coordinate (horizontal position)
245
+ y: The y coordinate (vertical position)
246
+ """
247
+ self.desktop.move_mouse(x, y)
248
+ self.desktop.double_click()
249
+ self.click_coordinates = [x, y]
250
+ self.logger.log(f"Double-clicked at coordinates ({x}, {y})")
251
+ return f"Double-clicked at coordinates ({x}, {y})"
252
+
253
+ @tool
254
+ def move_mouse(x: int, y: int) -> str:
255
+ """
256
+ Moves the mouse cursor to the specified coordinates
257
+ Args:
258
+ x: The x coordinate (horizontal position)
259
+ y: The y coordinate (vertical position)
260
+ """
261
+ self.desktop.move_mouse(x, y)
262
+ self.logger.log(f"Moved mouse to coordinates ({x}, {y})")
263
+ return f"Moved mouse to coordinates ({x}, {y})"
264
+
265
+ def normalize_text(text):
266
+ return "".join(
267
+ c
268
+ for c in unicodedata.normalize("NFD", text)
269
+ if not unicodedata.combining(c)
270
+ )
271
+
272
+ @tool
273
+ def type_text(text: str) -> str:
274
+ """
275
+ Types the specified text at the current cursor position.
276
+ Args:
277
+ text: The text to type
278
+ """
279
+ clean_text = normalize_text(text)
280
+ self.desktop.write(clean_text, delay_in_ms=75)
281
+ self.logger.log(f"Typed text: '{clean_text}'")
282
+ return f"Typed text: '{clean_text}'"
283
+
284
+ @tool
285
+ def press_key(key: str) -> str:
286
+ """
287
+ Presses a keyboard key
288
+ Args:
289
+ key: The key to press (e.g. "enter", "space", "backspace", etc.).
290
+ """
291
+ self.desktop.press(key)
292
+ self.logger.log(f"Pressed key: {key}")
293
+ return f"Pressed key: {key}"
294
+
295
+ @tool
296
+ def go_back() -> str:
297
+ """
298
+ Goes back to the previous page in the browser. If using this tool doesn't work, just click the button directly.
299
+ Args:
300
+ """
301
+ self.desktop.press(["alt", "left"])
302
+ self.logger.log("Went back one page")
303
+ return "Went back one page"
304
+
305
+ @tool
306
+ def drag_and_drop(x1: int, y1: int, x2: int, y2: int) -> str:
307
+ """
308
+ Clicks [x1, y1], drags mouse to [x2, y2], then release click.
309
+ Args:
310
+ x1: origin x coordinate
311
+ y1: origin y coordinate
312
+ x2: end x coordinate
313
+ y2: end y coordinate
314
+ """
315
+ self.desktop.drag([x1, y1], [x2, y2])
316
+ message = f"Dragged and dropped from [{x1}, {y1}] to [{x2}, {y2}]"
317
+ self.logger.log(message)
318
+ return message
319
+
320
+ @tool
321
+ def scroll(x: int, y: int, direction: str = "down", amount: int = 2) -> str:
322
+ """
323
+ Moves the mouse to selected coordinates, then uses the scroll button: this could scroll the page or zoom, depending on the app. DO NOT use scroll to move through linux desktop menus.
324
+ Args:
325
+ x: The x coordinate (horizontal position) of the element to scroll/zoom
326
+ y: The y coordinate (vertical position) of the element to scroll/zoom
327
+ direction: The direction to scroll ("up" or "down"), defaults to "down". For zoom, "up" zooms in, "down" zooms out.
328
+ amount: The amount to scroll. A good amount is 1 or 2.
329
+ """
330
+ self.desktop.move_mouse(x, y)
331
+ self.desktop.scroll(direction=direction, amount=amount)
332
+ message = f"Scrolled {direction} by {amount}"
333
+ self.logger.log(message)
334
+ return message
335
+
336
+ @tool
337
+ def wait(seconds: float) -> str:
338
+ """
339
+ Waits for the specified number of seconds. Very useful in case the prior order is still executing (for example starting very heavy applications like browsers or office apps)
340
+ Args:
341
+ seconds: Number of seconds to wait, generally 3 is enough.
342
+ """
343
+ time.sleep(seconds)
344
+ self.logger.log(f"Waited for {seconds} seconds")
345
+ return f"Waited for {seconds} seconds"
346
+
347
+ @tool
348
+ def open_url(url: str) -> str:
349
+ """
350
+ Directly opens a browser with the specified url: use this at start of web searches rather than trying to click the browser.
351
+ Args:
352
+ url: The URL to open
353
+ """
354
+ # Make sure URL has http/https prefix
355
+ if not url.startswith(("http://", "https://")):
356
+ url = "https://" + url
357
+
358
+ self.desktop.open(url)
359
+ # Give it time to load
360
+ time.sleep(2)
361
+ self.logger.log(f"Opening URL: {url}")
362
+ return f"Opened URL: {url}"
363
+
364
+ @tool
365
+ def find_on_page_ctrl_f(search_string: str) -> str:
366
+ """
367
+ Scroll the browser viewport to the first occurrence of the search string. This is equivalent to Ctrl+F. Use this to search on a pdf for instance.
368
+ Args:
369
+ search_string: The string to search for on the page.
370
+ """
371
+ self.desktop.press(["ctrl", "f"])
372
+ time.sleep(0.3)
373
+ clean_text = normalize_text(search_string)
374
+ self.desktop.write(clean_text, delay_in_ms=75)
375
+ time.sleep(0.3)
376
+ self.desktop.press("enter")
377
+ time.sleep(0.3)
378
+ self.desktop.press("esc")
379
+ output_message = f"Scrolled to the first occurrence of '{clean_text}'"
380
+ self.logger.log(output_message)
381
+ return output_message
382
+
383
+ # Register the tools
384
+ self.tools["click"] = click
385
+ self.tools["right_click"] = right_click
386
+ self.tools["double_click"] = double_click
387
+ self.tools["move_mouse"] = move_mouse
388
+ self.tools["type_text"] = type_text
389
+ self.tools["press_key"] = press_key
390
+ self.tools["scroll"] = scroll
391
+ self.tools["wait"] = wait
392
+ self.tools["open_url"] = open_url
393
+ self.tools["go_back"] = go_back
394
+ self.tools["drag_and_drop"] = drag_and_drop
395
+ self.tools["find_on_page_ctrl_f"] = find_on_page_ctrl_f
396
+
397
+ def take_screenshot_callback(self, memory_step: ActionStep, agent=None) -> None:
398
+ """Callback that takes a screenshot + memory snapshot after a step completes"""
399
+ self.logger.log("Analyzing screen content...")
400
+
401
+ current_step = memory_step.step_number
402
+
403
+ time.sleep(2.5) # Let things happen on the desktop
404
+ screenshot_bytes = self.desktop.screenshot(format="bytes")
405
+ image = Image.open(BytesIO(screenshot_bytes))
406
+
407
+ # Create a filename with step number
408
+ screenshot_path = os.path.join(self.data_dir, f"step_{current_step:03d}.png")
409
+ image.save(screenshot_path)
410
+
411
+ image_copy = image.copy()
412
+
413
+ if getattr(self, "click_coordinates", None):
414
+ print("DRAWING MARKER")
415
+ image_copy = draw_marker_on_image(image_copy, self.click_coordinates)
416
+
417
+ self.last_marked_screenshot = AgentImage(screenshot_path)
418
+ print(f"Saved screenshot for step {current_step} to {screenshot_path}")
419
+
420
+ for previous_memory_step in (
421
+ agent.memory.steps
422
+ ): # Remove previous screenshots from logs for lean processing
423
+ if (
424
+ isinstance(previous_memory_step, ActionStep)
425
+ and previous_memory_step.step_number <= current_step - 1
426
+ ):
427
+ previous_memory_step.observations_images = None
428
+ elif isinstance(previous_memory_step, TaskStep):
429
+ previous_memory_step.task_images = None
430
+
431
+ if (
432
+ isinstance(previous_memory_step, ActionStep)
433
+ and previous_memory_step.step_number == current_step - 1
434
+ ):
435
+ if (
436
+ previous_memory_step.tool_calls
437
+ and getattr(previous_memory_step.tool_calls[0], "arguments", None)
438
+ and memory_step.tool_calls
439
+ and getattr(memory_step.tool_calls[0], "arguments", None)
440
+ ):
441
+ if (
442
+ previous_memory_step.tool_calls[0].arguments
443
+ == memory_step.tool_calls[0].arguments
444
+ ):
445
+ memory_step.observations += "\nWARNING: You've executed the same action several times in a row. MAKE SURE TO NOT UNNECESSARILY REPEAT ACTIONS."
446
+
447
+ # Add the marker-edited image to the current memory step
448
+ memory_step.observations_images = [image_copy]
449
+
450
+ # memory_step.observations_images = [screenshot_path] # IF YOU USE THIS INSTEAD OF ABOVE, LAUNCHING A SECOND TASK BREAKS
451
+
452
+ self.click_coordinates = None # Reset click marker
453
+
454
+ def close(self):
455
+ """Clean up resources"""
456
+ if self.desktop:
457
+ print("Stopping e2b stream and killing sandbox...")
458
+ self.desktop.stream.stop()
459
+ self.desktop.kill()
460
+ print("E2B sandbox terminated")
461
+
462
+
463
+ class QwenVLAPIModel(Model):
464
+ """Model wrapper for Qwen2.5VL API with fallback mechanism"""
465
+
466
+ def __init__(
467
+ self,
468
+ model_id: str = "Qwen/Qwen2.5-VL-72B-Instruct",
469
+ hf_token: str = None,
470
+ ):
471
+ super().__init__()
472
+ self.model_id = model_id
473
+ self.base_model = HfApiModel(
474
+ model_id="https://n5wr7lfx6wp94tvl.us-east-1.aws.endpoints.huggingface.cloud",
475
+ token=hf_token,
476
+ max_tokens=4096,
477
+ )
478
+ self.fallback_model = HfApiModel(
479
+ model_id="https://ahbeihft09ulicbf.us-east-1.aws.endpoints.huggingface.cloud",
480
+ token=hf_token,
481
+ max_tokens=4096,
482
+ )
483
+
484
+ def __call__(
485
+ self,
486
+ messages: List[Dict[str, Any]],
487
+ stop_sequences: Optional[List[str]] = None,
488
+ **kwargs,
489
+ ) -> ChatMessage:
490
+ try:
491
+ message = self.base_model(messages, stop_sequences, **kwargs)
492
+ return message
493
+ except Exception as e:
494
+ print(f"Base model failed with error: {e}. Calling fallback model.")
495
+ # Continue to fallback
496
+ try:
497
+ message = self.fallback_model(messages, stop_sequences, **kwargs)
498
+ return message
499
+ except Exception as e:
500
+ raise Exception(f"Both endpoints failed. Last error: {e}")
eval_harness.py ADDED
@@ -0,0 +1,366 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ eval_harness.py — Enhanced Evaluation Framework
3
+ ================================================
4
+ Supports custom benchmarks, WebArena-style tasks, GAIA-style tasks,
5
+ A/B testing, and LLM-as-a-judge grading.
6
+ """
7
+
8
+ import os
9
+ import json
10
+ import time
11
+ import random
12
+ from concurrent.futures import ThreadPoolExecutor
13
+ from typing import Any, Dict, List, Optional, Callable
14
+ from dataclasses import dataclass, field, asdict
15
+
16
+
17
+ # ---------------------------------------------------------------------------
18
+ # Benchmark Tasks
19
+ # ---------------------------------------------------------------------------
20
+
21
+ @dataclass
22
+ class BenchmarkTask:
23
+ id: str
24
+ category: str
25
+ description: str
26
+ expected_answer: Optional[str] = None
27
+ expected_contains: Optional[List[str]] = None
28
+ max_steps: int = 50
29
+ setup_script: Optional[str] = None # Shell commands to prep the sandbox
30
+ teardown_script: Optional[str] = None
31
+ weight: float = 1.0
32
+
33
+
34
+ DEFAULT_BENCHMARKS: List[BenchmarkTask] = [
35
+ # Web navigation
36
+ BenchmarkTask(
37
+ id="puppies",
38
+ category="web_search",
39
+ description="Find me pictures of cute puppies",
40
+ expected_contains=["puppy", "dog", "image"],
41
+ max_steps=30,
42
+ ),
43
+ BenchmarkTask(
44
+ id="gmaps_hf_hq",
45
+ category="web_navigation",
46
+ description="Use Google Maps to find the Hugging Face HQ in Paris",
47
+ expected_contains=["Paris", "Hugging Face", "5/7"],
48
+ max_steps=40,
49
+ ),
50
+ BenchmarkTask(
51
+ id="wikipedia_april4",
52
+ category="web_research",
53
+ description="Go to Wikipedia and find what happened on April 4th",
54
+ expected_contains=["April", "4"],
55
+ max_steps=30,
56
+ ),
57
+ BenchmarkTask(
58
+ id="commute_bern_basel",
59
+ category="web_navigation",
60
+ description="Find out the travel time by train from Bern to Basel on Google Maps",
61
+ expected_contains=["Bern", "Basel", "hour", "min"],
62
+ max_steps=40,
63
+ ),
64
+ BenchmarkTask(
65
+ id="hf_flux_gpu",
66
+ category="hf_ecosystem",
67
+ description="Go to Hugging Face Spaces and find the Space flux.1 schnell. Use it to generate an image of a GPU",
68
+ expected_contains=["GPU", "image"],
69
+ max_steps=60,
70
+ ),
71
+ BenchmarkTask(
72
+ id="github_trending",
73
+ category="web_research",
74
+ description="Go to GitHub trending and find the top Python repository today",
75
+ expected_contains=["Python", "github.com"],
76
+ max_steps=35,
77
+ ),
78
+ BenchmarkTask(
79
+ id="pdf_extract",
80
+ category="document",
81
+ description="Download a sample PDF from the internet and extract the first paragraph",
82
+ expected_contains=["PDF", "paragraph"],
83
+ max_steps=40,
84
+ ),
85
+ BenchmarkTask(
86
+ id="calc_sum",
87
+ category="code_execution",
88
+ description="Calculate the sum of the first 100 prime numbers using Python",
89
+ expected_answer="24133",
90
+ max_steps=20,
91
+ ),
92
+ BenchmarkTask(
93
+ id="dark_mode_maps",
94
+ category="web_navigation",
95
+ description="Open Google Maps and switch to dark mode if available",
96
+ expected_contains=["dark", "theme"],
97
+ max_steps=30,
98
+ ),
99
+ BenchmarkTask(
100
+ id="hf_model_search",
101
+ category="hf_ecosystem",
102
+ description="Search Hugging Face Hub for 'text-to-video' models and list the top 3 by downloads",
103
+ expected_contains=["text-to-video", "model"],
104
+ max_steps=35,
105
+ ),
106
+ ]
107
+
108
+
109
+ # ---------------------------------------------------------------------------
110
+ # LLM-as-a-Judge
111
+ # ---------------------------------------------------------------------------
112
+
113
+ class LLMJudge:
114
+ """Grades agent outputs using a language model."""
115
+
116
+ def __init__(self, model_call: Callable[[List[Dict[str, Any]]], str]):
117
+ self.model_call = model_call
118
+
119
+ def grade_exact(self, predicted: str, expected: str) -> float:
120
+ return 1.0 if expected.lower().strip() in predicted.lower().strip() else 0.0
121
+
122
+ def grade_contains(self, predicted: str, expected_list: List[str]) -> float:
123
+ if not expected_list:
124
+ return 1.0
125
+ matched = sum(1 for e in expected_list if e.lower() in predicted.lower())
126
+ return matched / len(expected_list)
127
+
128
+ def grade_semantic(
129
+ self,
130
+ task_description: str,
131
+ agent_trace: str,
132
+ predicted: str,
133
+ expected: Optional[str] = None,
134
+ expected_contains: Optional[List[str]] = None,
135
+ ) -> Dict[str, Any]:
136
+ """Use an LLM to judge success on a 0-1 scale."""
137
+ prompt = f"""You are an expert evaluator. A computer agent was given this task:
138
+
139
+ Task: {task_description}
140
+
141
+ The agent's final response / trace summary:
142
+ {predicted[:2000]}
143
+
144
+ Expected answer (if any): {expected or 'N/A'}
145
+ Expected keywords (if any): {expected_contains or 'N/A'}
146
+
147
+ Rate the agent's success on a scale from 0.0 to 1.0, where:
148
+ - 1.0 = fully completed and correct
149
+ - 0.5 = partially correct or incomplete
150
+ - 0.0 = completely wrong or failed
151
+
152
+ Respond ONLY with a JSON object:
153
+ {{"score": float, "reason": "short explanation", "missing": "what was missing"}}
154
+ """
155
+ response = self.model_call([{"role": "user", "content": prompt}])
156
+ content = response.strip()
157
+ if content.startswith("```"):
158
+ content = content.split("```", 2)[-1]
159
+ if content.startswith("json"):
160
+ content = content[4:]
161
+ content = content.strip()
162
+ try:
163
+ result = json.loads(content)
164
+ return {
165
+ "score": float(result.get("score", 0.0)),
166
+ "reason": result.get("reason", ""),
167
+ "missing": result.get("missing", ""),
168
+ }
169
+ except (json.JSONDecodeError, ValueError):
170
+ # Fallback heuristic
171
+ score = 0.5 if "success" in predicted.lower() or "done" in predicted.lower() else 0.0
172
+ return {"score": score, "reason": "LLM judge parsing failed, heuristic fallback", "missing": ""}
173
+
174
+
175
+ # ---------------------------------------------------------------------------
176
+ # Evaluation Harness
177
+ # ---------------------------------------------------------------------------
178
+
179
+ @dataclass
180
+ class TaskResult:
181
+ task_id: str
182
+ success: bool
183
+ score: float
184
+ duration_sec: float
185
+ steps_taken: int
186
+ final_output: str
187
+ error: Optional[str] = None
188
+ judge_reason: Optional[str] = None
189
+
190
+
191
+ @dataclass
192
+ class EvalSummary:
193
+ total_tasks: int
194
+ passed: int
195
+ failed: int
196
+ avg_score: float
197
+ avg_duration: float
198
+ by_category: Dict[str, Dict[str, Any]]
199
+ results: List[TaskResult]
200
+ timestamp: float = field(default_factory=time.time)
201
+
202
+
203
+ class EvaluationHarness:
204
+ """Run benchmarks against the agent and produce reports."""
205
+
206
+ def __init__(
207
+ self,
208
+ agent_factory: Callable[[], Any],
209
+ judge_model_call: Optional[Callable] = None,
210
+ output_dir: str = "./eval_results",
211
+ ):
212
+ self.agent_factory = agent_factory
213
+ self.judge = LLMJudge(judge_model_call) if judge_model_call else None
214
+ self.output_dir = output_dir
215
+ os.makedirs(output_dir, exist_ok=True)
216
+
217
+ def run_task(
218
+ self,
219
+ task: BenchmarkTask,
220
+ num_runs: int = 1,
221
+ ) -> List[TaskResult]:
222
+ results = []
223
+ for run_idx in range(num_runs):
224
+ start = time.time()
225
+ agent = self.agent_factory()
226
+ try:
227
+ # Run the agent
228
+ output = agent.run(task.description, max_steps=task.max_steps)
229
+ duration = time.time() - start
230
+
231
+ # Grade
232
+ if self.judge:
233
+ judge_result = self.judge.grade_semantic(
234
+ task.description,
235
+ str(output),
236
+ str(output),
237
+ task.expected_answer,
238
+ task.expected_contains,
239
+ )
240
+ score = judge_result["score"]
241
+ reason = judge_result["reason"]
242
+ else:
243
+ if task.expected_answer:
244
+ score = self.judge.grade_exact(str(output), task.expected_answer) if self.judge else 0.0
245
+ elif task.expected_contains:
246
+ score = self.judge.grade_contains(str(output), task.expected_contains) if self.judge else 0.0
247
+ else:
248
+ score = 0.5
249
+ reason = "Heuristic grading (no LLM judge)"
250
+
251
+ success = score >= 0.7
252
+ results.append(TaskResult(
253
+ task_id=f"{task.id}_run{run_idx}",
254
+ success=success,
255
+ score=score,
256
+ duration_sec=round(duration, 2),
257
+ steps_taken=getattr(agent, "step_number", 0),
258
+ final_output=str(output)[:2000],
259
+ error=None,
260
+ judge_reason=reason,
261
+ ))
262
+ except Exception as e:
263
+ duration = time.time() - start
264
+ results.append(TaskResult(
265
+ task_id=f"{task.id}_run{run_idx}",
266
+ success=False,
267
+ score=0.0,
268
+ duration_sec=round(duration, 2),
269
+ steps_taken=0,
270
+ final_output="",
271
+ error=str(e),
272
+ judge_reason="Exception during execution",
273
+ ))
274
+ return results
275
+
276
+ def run_suite(
277
+ self,
278
+ tasks: Optional[List[BenchmarkTask]] = None,
279
+ num_runs: int = 1,
280
+ max_parallel: int = 2,
281
+ ) -> EvalSummary:
282
+ tasks = tasks or DEFAULT_BENCHMARKS
283
+ all_results: List[TaskResult] = []
284
+
285
+ def run_single(task):
286
+ return self.run_task(task, num_runs=num_runs)
287
+
288
+ with ThreadPoolExecutor(max_workers=max_parallel) as executor:
289
+ futures = [executor.submit(run_single, t) for t in tasks]
290
+ for future in futures:
291
+ all_results.extend(future.result())
292
+
293
+ # Aggregate
294
+ passed = sum(1 for r in all_results if r.success)
295
+ total = len(all_results)
296
+ avg_score = sum(r.score for r in all_results) / max(total, 1)
297
+ avg_duration = sum(r.duration_sec for r in all_results) / max(total, 1)
298
+
299
+ by_category: Dict[str, Any] = {}
300
+ for r in all_results:
301
+ # Map back to category from task_id prefix
302
+ cat = "unknown"
303
+ for t in tasks:
304
+ if r.task_id.startswith(t.id):
305
+ cat = t.category
306
+ break
307
+ by_category.setdefault(cat, {"count": 0, "passed": 0, "avg_score": 0.0, "scores": []})
308
+ by_category[cat]["count"] += 1
309
+ if r.success:
310
+ by_category[cat]["passed"] += 1
311
+ by_category[cat]["scores"].append(r.score)
312
+
313
+ for cat, data in by_category.items():
314
+ data["avg_score"] = round(sum(data["scores"]) / max(len(data["scores"]), 1), 3)
315
+ del data["scores"]
316
+
317
+ summary = EvalSummary(
318
+ total_tasks=total,
319
+ passed=passed,
320
+ failed=total - passed,
321
+ avg_score=round(avg_score, 3),
322
+ avg_duration=round(avg_duration, 2),
323
+ by_category=by_category,
324
+ results=all_results,
325
+ )
326
+
327
+ # Save
328
+ ts = int(time.time())
329
+ path = os.path.join(self.output_dir, f"eval_summary_{ts}.json")
330
+ with open(path, "w") as f:
331
+ json.dump(asdict(summary), f, indent=2, default=str)
332
+ print(f"Evaluation saved to {path}")
333
+ return summary
334
+
335
+ def compare_strategies(
336
+ self,
337
+ strategy_a_factory: Callable[[], Any],
338
+ strategy_b_factory: Callable[[], Any],
339
+ tasks: Optional[List[BenchmarkTask]] = None,
340
+ num_runs: int = 3,
341
+ ) -> Dict[str, Any]:
342
+ """A/B test two agent configurations."""
343
+ print("Running Strategy A...")
344
+ old_factory = self.agent_factory
345
+ self.agent_factory = strategy_a_factory
346
+ results_a = self.run_suite(tasks, num_runs=num_runs, max_parallel=1)
347
+
348
+ print("Running Strategy B...")
349
+ self.agent_factory = strategy_b_factory
350
+ results_b = self.run_suite(tasks, num_runs=num_runs, max_parallel=1)
351
+
352
+ self.agent_factory = old_factory
353
+
354
+ return {
355
+ "strategy_a": {
356
+ "avg_score": results_a.avg_score,
357
+ "pass_rate": results_a.passed / max(results_a.total_tasks, 1),
358
+ "avg_duration": results_a.avg_duration,
359
+ },
360
+ "strategy_b": {
361
+ "avg_score": results_b.avg_score,
362
+ "pass_rate": results_b.passed / max(results_b.total_tasks, 1),
363
+ "avg_duration": results_b.avg_duration,
364
+ },
365
+ "winner": "A" if results_a.avg_score > results_b.avg_score else "B",
366
+ }
mcp_tools.py ADDED
@@ -0,0 +1,479 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ mcp_tools.py — MCP Bridge for Enhanced Computer Control
3
+ =======================================================
4
+ Playwright Browser MCP + Code Execution + FileSystem + HF Hub MCP
5
+ """
6
+
7
+ import os
8
+ import json
9
+ import time
10
+ import base64
11
+ import tempfile
12
+ from typing import Any, Dict, List, Optional, Tuple
13
+ from dataclasses import dataclass
14
+ from io import BytesIO
15
+
16
+ from PIL import Image
17
+
18
+ # Smolagents tool decorator
19
+ from smolagents import tool
20
+
21
+ # Playwright
22
+ try:
23
+ from playwright.sync_api import sync_playwright, Page, Browser, BrowserContext
24
+ HAS_PLAYWRIGHT = True
25
+ except ImportError:
26
+ HAS_PLAYWRIGHT = False
27
+ sync_playwright = None
28
+ Page = Browser = BrowserContext = Any
29
+
30
+ # E2B code execution
31
+ try:
32
+ from e2b_code_interpreter import Sandbox as CodeSandbox
33
+ HAS_E2B_CODE = True
34
+ except ImportError:
35
+ HAS_E2B_CODE = False
36
+ CodeSandbox = Any
37
+
38
+
39
+ # ---------------------------------------------------------------------------
40
+ # Playwright Browser MCP
41
+ # ---------------------------------------------------------------------------
42
+
43
+ class BrowserMCP:
44
+ """High-level browser automation via Playwright.
45
+ Replaces raw coordinate clicking with semantic selectors.
46
+ """
47
+
48
+ def __init__(self, headless: bool = True, browser_type: str = "chromium"):
49
+ self.headless = headless
50
+ self.browser_type = browser_type
51
+ self._playwright = None
52
+ self._browser: Optional[Browser] = None
53
+ self._context: Optional[BrowserContext] = None
54
+ self._page: Optional[Page] = None
55
+ self._closed = True
56
+
57
+ def start(self):
58
+ if not HAS_PLAYWRIGHT:
59
+ raise RuntimeError("Playwright not installed. Run: pip install playwright && playwright install chromium")
60
+ self._playwright = sync_playwright().start()
61
+ browser_cls = getattr(self._playwright, self.browser_type)
62
+ self._browser = browser_cls.launch(headless=self.headless)
63
+ self._context = self._browser.new_context(
64
+ viewport={"width": 1280, "height": 720},
65
+ user_agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
66
+ )
67
+ self._page = self._context.new_page()
68
+ self._closed = False
69
+ return self._page
70
+
71
+ def close(self):
72
+ if self._context:
73
+ self._context.close()
74
+ if self._browser:
75
+ self._browser.close()
76
+ if self._playwright:
77
+ self._playwright.stop()
78
+ self._closed = True
79
+
80
+ def ensure_page(self) -> Page:
81
+ if self._closed or self._page is None:
82
+ self.start()
83
+ return self._page
84
+
85
+ def goto(self, url: str, wait_until: str = "networkidle") -> str:
86
+ page = self.ensure_page()
87
+ if not url.startswith(("http://", "https://")):
88
+ url = "https://" + url
89
+ page.goto(url, wait_until=wait_until, timeout=30000)
90
+ return f"Navigated to {url}"
91
+
92
+ def click(self, selector: str, by: str = "css") -> str:
93
+ page = self.ensure_page()
94
+ if by == "text":
95
+ page.get_by_text(selector).first.click()
96
+ elif by == "role":
97
+ role, name = selector.split("::", 1)
98
+ page.get_by_role(role.strip(), name=name.strip()).first.click()
99
+ else:
100
+ page.locator(selector).first.click()
101
+ return f"Clicked element: {selector}"
102
+
103
+ def fill(self, selector: str, text: str, by: str = "css") -> str:
104
+ page = self.ensure_page()
105
+ if by == "text":
106
+ el = page.get_by_text(selector).first
107
+ elif by == "role":
108
+ role, name = selector.split("::", 1)
109
+ el = page.get_by_role(role.strip(), name=name.strip()).first
110
+ else:
111
+ el = page.locator(selector).first
112
+ el.fill(text)
113
+ return f"Filled '{selector}' with '{text}'"
114
+
115
+ def press(self, key: str) -> str:
116
+ page = self.ensure_page()
117
+ page.keyboard.press(key)
118
+ return f"Pressed key: {key}"
119
+
120
+ def scroll(self, direction: str = "down", amount: int = 500) -> str:
121
+ page = self.ensure_page()
122
+ if direction == "down":
123
+ page.mouse.wheel(0, amount)
124
+ else:
125
+ page.mouse.wheel(0, -amount)
126
+ return f"Scrolled {direction} by {amount}"
127
+
128
+ def get_text(self, selector: str = "body") -> str:
129
+ page = self.ensure_page()
130
+ return page.locator(selector).first.inner_text()
131
+
132
+ def get_html(self) -> str:
133
+ page = self.ensure_page()
134
+ return page.content()
135
+
136
+ def screenshot(self, path: Optional[str] = None) -> str:
137
+ page = self.ensure_page()
138
+ if path:
139
+ page.screenshot(path=path, full_page=True)
140
+ return f"Screenshot saved to {path}"
141
+ else:
142
+ buf = page.screenshot(full_page=True)
143
+ return base64.b64encode(buf).decode("utf-8")
144
+
145
+ def find_and_click(self, text: str) -> str:
146
+ """Semantic find-and-click by visible text."""
147
+ page = self.ensure_page()
148
+ page.get_by_text(text).first.click()
149
+ return f"Found and clicked text: {text}"
150
+
151
+ def search_on_page(self, query: str) -> str:
152
+ page = self.ensure_page()
153
+ page.keyboard.press("Control+f")
154
+ page.keyboard.insert_text(query)
155
+ page.keyboard.press("Enter")
156
+ time.sleep(0.5)
157
+ page.keyboard.press("Escape")
158
+ return f"Searched for '{query}' on page"
159
+
160
+ def download_file(self, url: str, save_path: str) -> str:
161
+ page = self.ensure_page()
162
+ with page.expect_download() as dl_info:
163
+ page.goto(url)
164
+ dl = dl_info.value
165
+ dl.save_as(save_path)
166
+ return f"Downloaded to {save_path}"
167
+
168
+ def extract_links(self) -> List[Dict[str, str]]:
169
+ page = self.ensure_page()
170
+ links = page.eval_on_selector_all("a", """elements => elements.map(a => ({href: a.href, text: a.innerText.trim()}))""")
171
+ return links
172
+
173
+ def extract_tables(self) -> List[List[List[str]]]:
174
+ page = self.ensure_page()
175
+ tables = page.eval_on_selector_all("table", """
176
+ tables => tables.map(t => {
177
+ return Array.from(t.querySelectorAll('tr')).map(row =>
178
+ Array.from(row.querySelectorAll('td, th')).map(cell => cell.innerText.trim())
179
+ );
180
+ })
181
+ """)
182
+ return tables
183
+
184
+ def evaluate_js(self, script: str) -> Any:
185
+ page = self.ensure_page()
186
+ return page.evaluate(script)
187
+
188
+
189
+ # ---------------------------------------------------------------------------
190
+ # Tool factory for smolagents integration
191
+ # ---------------------------------------------------------------------------
192
+
193
+ def make_browser_tools(browser_mcp: BrowserMCP) -> Dict[str, Any]:
194
+ """Generate smolagents @tool functions from BrowserMCP."""
195
+
196
+ @tool
197
+ def browser_goto(url: str) -> str:
198
+ """Navigate the browser to a URL. Prefer this over clicking browser icons."""
199
+ return browser_mcp.goto(url)
200
+
201
+ @tool
202
+ def browser_click(selector: str, by: str = "css") -> str:
203
+ """Click an element by CSS selector, text content, or ARIA role.
204
+ by can be 'css', 'text', or 'role' (role::name format)."""
205
+ return browser_mcp.click(selector, by)
206
+
207
+ @tool
208
+ def browser_fill(selector: str, text: str, by: str = "css") -> str:
209
+ """Fill a form field with text."""
210
+ return browser_mcp.fill(selector, text, by)
211
+
212
+ @tool
213
+ def browser_press_key(key: str) -> str:
214
+ """Press a keyboard key (e.g., 'Enter', 'Tab', 'Escape')."""
215
+ return browser_mcp.press(key)
216
+
217
+ @tool
218
+ def browser_scroll(direction: str = "down", amount: int = 500) -> str:
219
+ """Scroll the page up or down."""
220
+ return browser_mcp.scroll(direction, amount)
221
+
222
+ @tool
223
+ def browser_get_text(selector: str = "body") -> str:
224
+ """Extract text content from a page element."""
225
+ return browser_mcp.get_text(selector)
226
+
227
+ @tool
228
+ def browser_find_and_click(text: str) -> str:
229
+ """Find an element by its visible text and click it."""
230
+ return browser_mcp.find_and_click(text)
231
+
232
+ @tool
233
+ def browser_screenshot(path: str = "") -> str:
234
+ """Take a screenshot of the current page. If path is empty, returns base64."""
235
+ return browser_mcp.screenshot(path or None)
236
+
237
+ @tool
238
+ def browser_extract_links() -> str:
239
+ """Extract all links from the current page as JSON."""
240
+ links = browser_mcp.extract_links()
241
+ return json.dumps(links[:50], indent=2) # Limit to 50
242
+
243
+ @tool
244
+ def browser_extract_tables() -> str:
245
+ """Extract all tables from the current page as JSON."""
246
+ tables = browser_mcp.extract_tables()
247
+ return json.dumps(tables[:5], indent=2)
248
+
249
+ @tool
250
+ def browser_evaluate_js(script: str) -> str:
251
+ """Execute JavaScript in the browser context and return the result."""
252
+ result = browser_mcp.evaluate_js(script)
253
+ return json.dumps(result, default=str)
254
+
255
+ return {
256
+ "browser_goto": browser_goto,
257
+ "browser_click": browser_click,
258
+ "browser_fill": browser_fill,
259
+ "browser_press_key": browser_press_key,
260
+ "browser_scroll": browser_scroll,
261
+ "browser_get_text": browser_get_text,
262
+ "browser_find_and_click": browser_find_and_click,
263
+ "browser_screenshot": browser_screenshot,
264
+ "browser_extract_links": browser_extract_links,
265
+ "browser_extract_tables": browser_extract_tables,
266
+ "browser_evaluate_js": browser_evaluate_js,
267
+ }
268
+
269
+
270
+ # ---------------------------------------------------------------------------
271
+ # Code Execution MCP (E2B Code Interpreter)
272
+ # ---------------------------------------------------------------------------
273
+
274
+ class CodeExecutionMCP:
275
+ """Sandboxed Python/JS code execution via E2B."""
276
+
277
+ def __init__(self, api_key: Optional[str] = None):
278
+ self.api_key = api_key or os.getenv("E2B_API_KEY")
279
+ self._sandbox: Optional[Any] = None
280
+
281
+ def _get_sandbox(self):
282
+ if not HAS_E2B_CODE:
283
+ raise RuntimeError("e2b_code_interpreter not installed")
284
+ if self._sandbox is None:
285
+ self._sandbox = CodeSandbox(api_key=self.api_key)
286
+ return self._sandbox
287
+
288
+ def run_python(self, code: str, timeout: int = 30) -> Dict[str, Any]:
289
+ sb = self._get_sandbox()
290
+ execution = sb.run_code(code, timeout=timeout)
291
+ return {
292
+ "stdout": execution.logs.stdout,
293
+ "stderr": execution.logs.stderr,
294
+ "results": [str(r) for r in execution.results],
295
+ "error": execution.error,
296
+ }
297
+
298
+ def run_shell(self, command: str, timeout: int = 30) -> Dict[str, Any]:
299
+ sb = self._get_sandbox()
300
+ execution = sb.run_code(f"!{command}", timeout=timeout)
301
+ return {
302
+ "stdout": execution.logs.stdout,
303
+ "stderr": execution.logs.stderr,
304
+ "error": execution.error,
305
+ }
306
+
307
+ def install_package(self, package: str) -> str:
308
+ result = self.run_shell(f"pip install {package}")
309
+ return f"Installed {package}: {result['stdout'][:500]}"
310
+
311
+ def close(self):
312
+ if self._sandbox:
313
+ self._sandbox.kill()
314
+ self._sandbox = None
315
+
316
+
317
+ def make_code_tools(code_mcp: CodeExecutionMCP) -> Dict[str, Any]:
318
+
319
+ @tool
320
+ def execute_python(code: str) -> str:
321
+ """Execute Python code in a sandboxed environment. Use for data processing, calculations, or parsing."""
322
+ result = code_mcp.run_python(code)
323
+ if result["error"]:
324
+ return f"Error: {result['error']}\nStderr: {result['stderr']}"
325
+ out = "\n".join(result["stdout"])
326
+ if result["results"]:
327
+ out += f"\nResults: {result['results']}"
328
+ return out[:3000]
329
+
330
+ @tool
331
+ def execute_shell(command: str) -> str:
332
+ """Execute a shell command in the sandbox."""
333
+ result = code_mcp.run_shell(command)
334
+ if result["error"]:
335
+ return f"Error: {result['error']}"
336
+ return "\n".join(result["stdout"])[:3000]
337
+
338
+ @tool
339
+ def install_python_package(package: str) -> str:
340
+ """Install a Python package in the sandbox."""
341
+ return code_mcp.install_package(package)
342
+
343
+ return {
344
+ "execute_python": execute_python,
345
+ "execute_shell": execute_shell,
346
+ "install_python_package": install_python_package,
347
+ }
348
+
349
+
350
+ # ---------------------------------------------------------------------------
351
+ # FileSystem MCP (Local + E2B)
352
+ # ---------------------------------------------------------------------------
353
+
354
+ class FileSystemMCP:
355
+ """Read/write files either locally or in the E2B sandbox."""
356
+
357
+ def __init__(self, base_dir: str = "./workspace"):
358
+ self.base_dir = os.path.abspath(base_dir)
359
+ os.makedirs(self.base_dir, exist_ok=True)
360
+
361
+ def _safe_path(self, path: str) -> str:
362
+ abs_path = os.path.abspath(os.path.join(self.base_dir, path))
363
+ if not abs_path.startswith(self.base_dir):
364
+ raise ValueError("Path traversal attempt detected")
365
+ return abs_path
366
+
367
+ def read_file(self, path: str) -> str:
368
+ sp = self._safe_path(path)
369
+ with open(sp, "r", encoding="utf-8", errors="ignore") as f:
370
+ return f.read()
371
+
372
+ def write_file(self, path: str, content: str) -> str:
373
+ sp = self._safe_path(path)
374
+ os.makedirs(os.path.dirname(sp), exist_ok=True)
375
+ with open(sp, "w", encoding="utf-8") as f:
376
+ f.write(content)
377
+ return f"Wrote {len(content)} chars to {path}"
378
+
379
+ def list_dir(self, path: str = ".") -> List[str]:
380
+ sp = self._safe_path(path)
381
+ return os.listdir(sp)
382
+
383
+ def read_image(self, path: str) -> Image.Image:
384
+ sp = self._safe_path(path)
385
+ return Image.open(sp)
386
+
387
+
388
+ def make_fs_tools(fs_mcp: FileSystemMCP) -> Dict[str, Any]:
389
+
390
+ @tool
391
+ def fs_read(path: str) -> str:
392
+ """Read a text file from the workspace."""
393
+ return fs_mcp.read_file(path)
394
+
395
+ @tool
396
+ def fs_write(path: str, content: str) -> str:
397
+ """Write text content to a file in the workspace."""
398
+ return fs_mcp.write_file(path, content)
399
+
400
+ @tool
401
+ def fs_list(path: str = ".") -> str:
402
+ """List files in a workspace directory."""
403
+ return json.dumps(fs_mcp.list_dir(path))
404
+
405
+ return {
406
+ "fs_read": fs_read,
407
+ "fs_write": fs_write,
408
+ "fs_list": fs_list,
409
+ }
410
+
411
+
412
+ # ---------------------------------------------------------------------------
413
+ # HF Hub MCP (Hugging Face ecosystem integration)
414
+ # ---------------------------------------------------------------------------
415
+
416
+ class HFHubMCP:
417
+ """Interact with the Hugging Face Hub from within the agent."""
418
+
419
+ def __init__(self, token: Optional[str] = None):
420
+ self.token = token or os.getenv("HF_TOKEN")
421
+ from huggingface_hub import HfApi, upload_file, create_repo
422
+ self.api = HfApi(token=self.token)
423
+ self._upload_file = upload_file
424
+ self._create_repo = create_repo
425
+
426
+ def search_models(self, query: str, limit: int = 10) -> List[Dict[str, Any]]:
427
+ models = self.api.list_models(search=query, limit=limit, sort="downloads")
428
+ return [{"id": m.id, "downloads": m.downloads, "tags": m.tags} for m in models]
429
+
430
+ def search_datasets(self, query: str, limit: int = 10) -> List[Dict[str, Any]]:
431
+ datasets = self.api.list_datasets(search=query, limit=limit)
432
+ return [{"id": d.id, "tags": d.tags} for d in datasets]
433
+
434
+ def search_spaces(self, query: str, limit: int = 10) -> List[Dict[str, Any]]:
435
+ spaces = self.api.list_spaces(search=query, limit=limit)
436
+ return [{"id": s.id, "sdk": getattr(s, "sdk", "unknown")} for s in spaces]
437
+
438
+ def upload_to_dataset(self, repo_id: str, file_path: str, path_in_repo: str) -> str:
439
+ self._upload_file(
440
+ path_or_fileobj=file_path,
441
+ path_in_repo=path_in_repo,
442
+ repo_id=repo_id,
443
+ repo_type="dataset",
444
+ token=self.token,
445
+ )
446
+ return f"Uploaded {file_path} to {repo_id}/{path_in_repo}"
447
+
448
+
449
+ def make_hf_tools(hf_mcp: HFHubMCP) -> Dict[str, Any]:
450
+
451
+ @tool
452
+ def hf_search_models(query: str, limit: int = 10) -> str:
453
+ """Search Hugging Face Hub for models."""
454
+ results = hf_mcp.search_models(query, limit)
455
+ return json.dumps(results, indent=2)
456
+
457
+ @tool
458
+ def hf_search_datasets(query: str, limit: int = 10) -> str:
459
+ """Search Hugging Face Hub for datasets."""
460
+ results = hf_mcp.search_datasets(query, limit)
461
+ return json.dumps(results, indent=2)
462
+
463
+ @tool
464
+ def hf_search_spaces(query: str, limit: int = 10) -> str:
465
+ """Search Hugging Face Hub for Spaces."""
466
+ results = hf_mcp.search_spaces(query, limit)
467
+ return json.dumps(results, indent=2)
468
+
469
+ @tool
470
+ def hf_upload_dataset_file(repo_id: str, file_path: str, path_in_repo: str) -> str:
471
+ """Upload a file to a Hugging Face dataset repository."""
472
+ return hf_mcp.upload_to_dataset(repo_id, file_path, path_in_repo)
473
+
474
+ return {
475
+ "hf_search_models": hf_search_models,
476
+ "hf_search_datasets": hf_search_datasets,
477
+ "hf_search_spaces": hf_search_spaces,
478
+ "hf_upload_dataset_file": hf_upload_dataset_file,
479
+ }
requirements.txt ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ smolagents==1.14.0
2
+ e2b_desktop==1.6.5
3
+ Pillow
4
+ huggingface_hub
5
+ openai
6
+ gradio_modal
7
+ spaces
8
+ python-dotenv
9
+
10
+ # Enhanced stack
11
+ playwright>=1.40.0
12
+ chromadb>=0.6.0
13
+ sentence-transformers>=3.0.0
14
+ numpy>=1.26.0
15
+ tiktoken>=0.7.0
16
+ pydantic>=2.0.0
17
+ aiohttp>=3.9.0
18
+ httpx>=0.27.0
19
+ soundfile>=0.12.0
20
+
21
+ # Voice
22
+ faster-whisper>=1.0.0
23
+
24
+ # Optional: E2B code interpreter (falls back gracefully if missing)
25
+ e2b_code_interpreter>=1.0.0
templates/viewer.html ADDED
@@ -0,0 +1,753 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!DOCTYPE html>
2
+ <html lang="en">
3
+ <head>
4
+ <meta charset="UTF-8">
5
+ <meta name="viewport" content="width=device-width, initial-scale=1.0">
6
+ <title>Computer Agent Evaluation Viewer</title>
7
+ <style>
8
+ /* CSS styles here */
9
+ body {
10
+ font-family: Arial, sans-serif;
11
+ margin: 0;
12
+ padding: 20px;
13
+ background-color: #f5f5f5;
14
+ }
15
+ .container {
16
+ max-width: 1200px;
17
+ margin: 0 auto;
18
+ background-color: #fff;
19
+ padding: 20px;
20
+ border-radius: 8px;
21
+ box-shadow: 0 2px 10px rgba(0,0,0,0.1);
22
+ }
23
+ h1, h2, h3 {
24
+ color: #333;
25
+ }
26
+ select, input, button {
27
+ padding: 8px 12px;
28
+ margin: 5px 0;
29
+ border: 1px solid #ddd;
30
+ border-radius: 4px;
31
+ }
32
+ button {
33
+ background-color: #4a6cf7;
34
+ color: white;
35
+ cursor: pointer;
36
+ border: none;
37
+ }
38
+ button:hover {
39
+ background-color: #3a5ce5;
40
+ }
41
+ button:disabled {
42
+ background-color: #cccccc;
43
+ cursor: not-allowed;
44
+ }
45
+ .row {
46
+ display: flex;
47
+ margin-bottom: 20px;
48
+ }
49
+ .col {
50
+ flex: 1;
51
+ padding: 0 10px;
52
+ }
53
+ .image-viewer {
54
+ width: 100%;
55
+ max-height: 500px;
56
+ border: 1px solid #ddd;
57
+ border-radius: 4px;
58
+ overflow: hidden;
59
+ margin-bottom: 10px;
60
+ position: relative;
61
+ }
62
+ .image-viewer img {
63
+ max-width: 100%;
64
+ max-height: 450px;
65
+ display: block;
66
+ margin: 0 auto;
67
+ }
68
+ .image-controls {
69
+ display: flex;
70
+ justify-content: space-between;
71
+ align-items: center;
72
+ margin-top: 10px;
73
+ }
74
+ .nav-buttons {
75
+ display: flex;
76
+ gap: 10px;
77
+ }
78
+ .step {
79
+ border: 1px solid #ddd;
80
+ border-radius: 4px;
81
+ margin-bottom: 10px;
82
+ overflow: hidden;
83
+ }
84
+ .step-header {
85
+ background-color: #f0f0f0;
86
+ padding: 10px;
87
+ font-weight: bold;
88
+ cursor: pointer;
89
+ display: flex;
90
+ justify-content: space-between;
91
+ }
92
+ .step-content {
93
+ padding: 15px;
94
+ white-space: pre-wrap;
95
+ font-family: monospace;
96
+ background-color: #f9f9f9;
97
+ max-height: 300px;
98
+ overflow-y: auto;
99
+ }
100
+ .hidden {
101
+ display: none;
102
+ }
103
+ .status-success {
104
+ color: #22c55e;
105
+ font-weight: bold;
106
+ }
107
+ .status-failure {
108
+ color: #ef4444;
109
+ font-weight: bold;
110
+ }
111
+ .tabs {
112
+ display: flex;
113
+ border-bottom: 1px solid #ddd;
114
+ margin-bottom: 20px;
115
+ }
116
+ .tab {
117
+ padding: 10px 20px;
118
+ cursor: pointer;
119
+ border-bottom: 2px solid transparent;
120
+ }
121
+ .tab.active {
122
+ border-bottom-color: #4a6cf7;
123
+ font-weight: bold;
124
+ }
125
+ .tab-content {
126
+ display: none;
127
+ }
128
+ .tab-content.active {
129
+ display: block;
130
+ }
131
+ pre {
132
+ background-color: #f0f0f0;
133
+ padding: 10px;
134
+ border-radius: 4px;
135
+ overflow-x: auto;
136
+ white-space: pre-wrap;
137
+ }
138
+ .error-message {
139
+ background-color: #fee2e2;
140
+ color: #b91c1c;
141
+ padding: 10px;
142
+ border-radius: 4px;
143
+ margin: 10px 0;
144
+ }
145
+ .loading {
146
+ display: inline-block;
147
+ width: 20px;
148
+ height: 20px;
149
+ border: 2px solid #f3f3f3;
150
+ border-top: 2px solid #3498db;
151
+ border-radius: 50%;
152
+ animation: spin 1s linear infinite;
153
+ margin-left: 10px;
154
+ }
155
+ @keyframes spin {
156
+ 0% { transform: rotate(0deg); }
157
+ 100% { transform: rotate(360deg); }
158
+ }
159
+ </style>
160
+ </head>
161
+ <body>
162
+ <div class="container">
163
+ <h1>Computer Agent Evaluation Viewer</h1>
164
+
165
+ <!-- Path and Eval Selection -->
166
+ <div style="margin-bottom: 20px; padding: 15px; background-color: #f0f0f0; border-radius: 8px;">
167
+ <h2>Load Evaluation Data</h2>
168
+ <div style="display: flex; gap: 10px; margin-top: 10px;">
169
+ <input type="text" id="base-path" placeholder="Base directory path (leave empty for default)"
170
+ style="flex-grow: 1; padding: 8px; border: 1px solid #ddd; border-radius: 4px;">
171
+ <button id="refresh-evals-btn">Refresh</button>
172
+ </div>
173
+ <div style="margin-top: 10px;">
174
+ <label for="eval-select">Select Evaluation:</label>
175
+ <select id="eval-select" style="min-width: 300px;"></select>
176
+ </div>
177
+ <div id="load-status" style="margin-top: 10px; font-style: italic;"></div>
178
+ </div>
179
+
180
+ <!-- Example and Run Selectors -->
181
+ <div class="row">
182
+ <div class="col">
183
+ <label for="example-select">Select Example:</label>
184
+ <select id="example-select">
185
+ <option value="">-- Select Example --</option>
186
+ </select>
187
+ </div>
188
+ <div class="col">
189
+ <label for="run-select">Select Run:</label>
190
+ <select id="run-select" disabled>
191
+ <option value="">-- Select Run --</option>
192
+ </select>
193
+ </div>
194
+ </div>
195
+
196
+ <!-- Task & Status Display -->
197
+ <div id="run-details" class="hidden">
198
+ <div>
199
+ <h2>Task</h2>
200
+ <pre id="task-text"></pre>
201
+ </div>
202
+
203
+ <div>
204
+ <h2>Run Status</h2>
205
+ <div id="status-display"></div>
206
+ </div>
207
+
208
+ <!-- Tabs -->
209
+ <div class="tabs">
210
+ <div class="tab active" data-tab="screenshots">Screenshots</div>
211
+ <div class="tab" data-tab="agent-trace">Agent Trace</div>
212
+ <div class="tab" data-tab="raw-json">Raw JSON</div>
213
+ </div>
214
+
215
+ <!-- Screenshots Tab -->
216
+ <div id="screenshots-tab" class="tab-content active">
217
+ <div id="no-images" class="hidden">
218
+ <p>No screenshots available for this run.</p>
219
+ </div>
220
+ <div id="image-container" class="image-viewer hidden">
221
+ <img id="current-image" src="" alt="Screenshot">
222
+ <p id="image-caption" class="text-center"></p>
223
+ </div>
224
+ <div class="image-controls hidden" id="image-controls">
225
+ <div class="nav-buttons">
226
+ <button id="prev-image">Previous</button>
227
+ <span id="image-counter">0 / 0</span>
228
+ <button id="next-image">Next</button>
229
+ </div>
230
+ <input type="range" id="image-slider" min="0" max="0" value="0" style="width: 100%">
231
+ </div>
232
+ </div>
233
+
234
+ <!-- Agent Trace Tab -->
235
+ <div id="agent-trace-tab" class="tab-content">
236
+ <div id="agent-steps"></div>
237
+ </div>
238
+
239
+ <!-- Raw JSON Tab -->
240
+ <div id="raw-json-tab" class="tab-content">
241
+ <div id="json-loading-indicator" class="hidden">
242
+ <p>Loading metadata... <span class="loading"></span></p>
243
+ </div>
244
+ <div id="json-error" class="error-message hidden"></div>
245
+ <pre id="raw-json"></pre>
246
+ </div>
247
+ </div>
248
+ </div>
249
+
250
+ <script>
251
+ // Application state
252
+ const appState = {
253
+ basePath: '',
254
+ evalId: null,
255
+ currentExampleId: null,
256
+ currentRunId: null,
257
+ currentImages: [],
258
+ currentImageIndex: 0,
259
+ loadedData: {
260
+ examples: {},
261
+ runs: {},
262
+ metadata: {},
263
+ screenshots: {}
264
+ }
265
+ };
266
+
267
+ // DOM elements
268
+ const basePathInput = document.getElementById('base-path');
269
+ const refreshEvalsBtn = document.getElementById('refresh-evals-btn');
270
+ const evalSelect = document.getElementById('eval-select');
271
+ const loadStatusDisplay = document.getElementById('load-status');
272
+ const exampleSelect = document.getElementById('example-select');
273
+ const runSelect = document.getElementById('run-select');
274
+ const runDetails = document.getElementById('run-details');
275
+ const taskText = document.getElementById('task-text');
276
+ const statusDisplay = document.getElementById('status-display');
277
+ const imageContainer = document.getElementById('image-container');
278
+ const noImages = document.getElementById('no-images');
279
+ const imageControls = document.getElementById('image-controls');
280
+ const currentImage = document.getElementById('current-image');
281
+ const imageCaption = document.getElementById('image-caption');
282
+ const imageCounter = document.getElementById('image-counter');
283
+ const imageSlider = document.getElementById('image-slider');
284
+ const prevImage = document.getElementById('prev-image');
285
+ const nextImage = document.getElementById('next-image');
286
+ const agentSteps = document.getElementById('agent-steps');
287
+ const rawJson = document.getElementById('raw-json');
288
+ const jsonLoadingIndicator = document.getElementById('json-loading-indicator');
289
+ const jsonError = document.getElementById('json-error');
290
+
291
+ // Initialize by loading available evaluations
292
+ refreshEvalsBtn.addEventListener('click', loadEvaluations);
293
+
294
+ // Load evaluations from server
295
+ async function loadEvaluations() {
296
+ appState.basePath = basePathInput.value.trim();
297
+ loadStatusDisplay.textContent = 'Loading evaluations...';
298
+ refreshEvalsBtn.disabled = true;
299
+
300
+ try {
301
+ const response = await fetch(`/api/evals?path=${encodeURIComponent(appState.basePath)}`);
302
+ if (!response.ok) {
303
+ const errorData = await response.json();
304
+ throw new Error(errorData.error || 'Failed to load evaluations');
305
+ }
306
+
307
+ const evals = await response.json();
308
+
309
+ // Clear existing options
310
+ evalSelect.innerHTML = '<option value="">-- Select Evaluation --</option>';
311
+
312
+ // Add new options
313
+ evals.forEach(evalId => {
314
+ const option = document.createElement('option');
315
+ option.value = evalId;
316
+ option.textContent = evalId;
317
+ evalSelect.appendChild(option);
318
+ });
319
+
320
+ loadStatusDisplay.textContent = `Loaded ${evals.length} evaluations`;
321
+
322
+ // AUTO-SELECT LATEST EVALUATION
323
+ if (evals.length > 0) {
324
+ // Sort evaluations to get the latest one
325
+ evals.sort().reverse();
326
+ evalSelect.value = evals[0];
327
+ // Trigger change event to load examples
328
+ evalSelect.dispatchEvent(new Event('change'));
329
+ }
330
+ } catch (err) {
331
+ console.error('Error loading evaluations:', err);
332
+ loadStatusDisplay.textContent = `Error: ${err.message}`;
333
+ } finally {
334
+ refreshEvalsBtn.disabled = false;
335
+ }
336
+ }
337
+
338
+ // Handle evaluation selection
339
+ evalSelect.addEventListener('change', async () => {
340
+ appState.evalId = evalSelect.value;
341
+
342
+ if (!appState.evalId) {
343
+ exampleSelect.innerHTML = '<option value="">-- Select Example --</option>';
344
+ exampleSelect.disabled = true;
345
+ runSelect.innerHTML = '<option value="">-- Select Run --</option>';
346
+ runSelect.disabled = true;
347
+ runDetails.classList.add('hidden');
348
+ return;
349
+ }
350
+
351
+ try {
352
+ loadStatusDisplay.textContent = 'Loading examples...';
353
+ evalSelect.disabled = true;
354
+
355
+ const response = await fetch(`/api/eval/${appState.evalId}/examples?path=${encodeURIComponent(appState.basePath)}`);
356
+ if (!response.ok) {
357
+ const errorData = await response.json();
358
+ throw new Error(errorData.error || 'Failed to load examples');
359
+ }
360
+
361
+ const examples = await response.json();
362
+ appState.loadedData.examples = examples;
363
+
364
+ // Update example dropdown
365
+ exampleSelect.innerHTML = '<option value="">-- Select Example --</option>';
366
+
367
+ for (const [exampleId, task] of Object.entries(examples)) {
368
+ const option = document.createElement('option');
369
+ option.value = exampleId;
370
+ option.textContent = exampleId;
371
+ option.title = task; // Show task as tooltip
372
+ exampleSelect.appendChild(option);
373
+ }
374
+
375
+ exampleSelect.disabled = false;
376
+ runSelect.innerHTML = '<option value="">-- Select Run --</option>';
377
+ runSelect.disabled = true;
378
+ runDetails.classList.add('hidden');
379
+
380
+ loadStatusDisplay.textContent = `Loaded ${Object.keys(examples).length} examples`;
381
+
382
+ // AUTO-SELECT FIRST EXAMPLE
383
+ if (Object.keys(examples).length > 0) {
384
+ const firstExampleId = Object.keys(examples)[0];
385
+ exampleSelect.value = firstExampleId;
386
+ // Trigger change event to load runs
387
+ exampleSelect.dispatchEvent(new Event('change'));
388
+ }
389
+ } catch (err) {
390
+ console.error('Error loading examples:', err);
391
+ loadStatusDisplay.textContent = `Error: ${err.message}`;
392
+ } finally {
393
+ evalSelect.disabled = false;
394
+ }
395
+ });
396
+
397
+ // Example selection
398
+ exampleSelect.addEventListener('change', async () => {
399
+ appState.currentExampleId = exampleSelect.value;
400
+
401
+ // Reset run selection
402
+ runSelect.innerHTML = '<option value="">-- Select Run --</option>';
403
+
404
+ if (!appState.currentExampleId) {
405
+ runSelect.disabled = true;
406
+ runDetails.classList.add('hidden');
407
+ return;
408
+ }
409
+
410
+ try {
411
+ loadStatusDisplay.textContent = 'Loading runs...';
412
+ exampleSelect.disabled = true;
413
+
414
+ const response = await fetch(`/api/eval/${appState.evalId}/example/${appState.currentExampleId}/runs?path=${encodeURIComponent(appState.basePath)}`);
415
+ if (!response.ok) {
416
+ const errorData = await response.json();
417
+ throw new Error(errorData.error || 'Failed to load runs');
418
+ }
419
+
420
+ const runs = await response.json();
421
+ appState.loadedData.runs[appState.currentExampleId] = runs;
422
+
423
+ // SORT RUNS by ID (assuming run IDs have timestamps or sequence numbers)
424
+ runs.sort((a, b) => a.id.localeCompare(b.id, undefined, {numeric: true}));
425
+
426
+ // Update run dropdown with sorted runs
427
+ runSelect.innerHTML = '<option value="">-- Select Run --</option>';
428
+ runs.forEach(run => {
429
+ const option = document.createElement('option');
430
+ option.value = run.id;
431
+ option.textContent = `${run.id} (${run.status})`;
432
+ option.dataset.status = run.status;
433
+ runSelect.appendChild(option);
434
+ });
435
+
436
+ runSelect.disabled = false;
437
+ runDetails.classList.add('hidden');
438
+
439
+ loadStatusDisplay.textContent = `Loaded ${runs.length} runs`;
440
+
441
+ // AUTO-SELECT FIRST RUN
442
+ if (runs.length > 0) {
443
+ runSelect.value = runs[0].id;
444
+ // Trigger change event to load run data
445
+ runSelect.dispatchEvent(new Event('change'));
446
+ }
447
+ } catch (err) {
448
+ console.error('Error loading runs:', err);
449
+ loadStatusDisplay.textContent = `Error: ${err.message}`;
450
+ } finally {
451
+ exampleSelect.disabled = false;
452
+ }
453
+ });
454
+
455
+ // Run selection
456
+ runSelect.addEventListener('change', () => {
457
+ appState.currentRunId = runSelect.value;
458
+
459
+ if (appState.currentRunId && appState.currentExampleId) {
460
+ loadRunData(appState.currentExampleId, appState.currentRunId);
461
+ runDetails.classList.remove('hidden');
462
+ } else {
463
+ runDetails.classList.add('hidden');
464
+ }
465
+ });
466
+
467
+ // Load run data
468
+ async function loadRunData(exampleId, runId) {
469
+ loadStatusDisplay.textContent = 'Loading run data...';
470
+ runSelect.disabled = true;
471
+ jsonLoadingIndicator.classList.remove('hidden');
472
+ jsonError.classList.add('hidden');
473
+
474
+ try {
475
+ // Get metadata
476
+ const metadataResponse = await fetch(`/api/eval/${appState.evalId}/example/${exampleId}/run/${runId}/metadata?path=${encodeURIComponent(appState.basePath)}`);
477
+ let metadata;
478
+
479
+ if (metadataResponse.ok) {
480
+ metadata = await metadataResponse.json();
481
+ } else {
482
+ const errorData = await metadataResponse.json();
483
+ console.error('Error loading metadata:', errorData);
484
+ jsonError.textContent = `Error loading metadata: ${errorData.error || 'Unknown error'}`;
485
+ jsonError.classList.remove('hidden');
486
+ metadata = null;
487
+ }
488
+
489
+ appState.loadedData.metadata[exampleId] = appState.loadedData.metadata[exampleId] || {};
490
+ appState.loadedData.metadata[exampleId][runId] = metadata;
491
+
492
+ // Display task
493
+ const task = appState.loadedData.examples[exampleId];
494
+ taskText.textContent = task || "No task available";
495
+
496
+ // Display status
497
+ let statusHtml = "";
498
+
499
+ if (metadata) {
500
+ if (metadata.status === 'completed') {
501
+ statusHtml = `<p><span class="status-success">✓ Completed successfully</span></p>`;
502
+ } else {
503
+ statusHtml = `<p><span class="status-failure">✗ Failed</span></p>`;
504
+ if (metadata.error_message) {
505
+ statusHtml += `<p>Error: ${metadata.error_message}</p>`;
506
+ }
507
+ }
508
+ } else {
509
+ statusHtml = "<p>Status information not available</p>";
510
+ }
511
+
512
+ statusDisplay.innerHTML = statusHtml;
513
+
514
+ // Get screenshots
515
+ const screenshotsResponse = await fetch(`/api/eval/${appState.evalId}/example/${exampleId}/run/${runId}/screenshots?path=${encodeURIComponent(appState.basePath)}`);
516
+ const screenshots = await screenshotsResponse.json();
517
+
518
+ appState.loadedData.screenshots[exampleId] = appState.loadedData.screenshots[exampleId] || {};
519
+ appState.loadedData.screenshots[exampleId][runId] = screenshots;
520
+
521
+ // Load screenshots
522
+ loadScreenshots(exampleId, runId);
523
+
524
+ // Load agent trace
525
+ renderAgentTrace(metadata);
526
+
527
+ // Display raw JSON
528
+ if (metadata) {
529
+ rawJson.textContent = JSON.stringify(metadata, null, 2);
530
+ } else {
531
+ rawJson.textContent = "No metadata available";
532
+ }
533
+
534
+ // Show screenshots tab by default
535
+ document.querySelector('.tab[data-tab="screenshots"]').click();
536
+
537
+ loadStatusDisplay.textContent = 'Run data loaded successfully';
538
+ } catch (err) {
539
+ console.error('Error loading run data:', err);
540
+ loadStatusDisplay.textContent = `Error: ${err.message}`;
541
+ jsonError.textContent = `Error loading data: ${err.message}`;
542
+ jsonError.classList.remove('hidden');
543
+ } finally {
544
+ jsonLoadingIndicator.classList.add('hidden');
545
+ runSelect.disabled = false;
546
+ }
547
+ }
548
+
549
+ // Load screenshots
550
+ function loadScreenshots(exampleId, runId) {
551
+ appState.currentImages = appState.loadedData.screenshots[exampleId]?.[runId] || [];
552
+
553
+ if (appState.currentImages.length === 0) {
554
+ imageContainer.classList.add('hidden');
555
+ imageControls.classList.add('hidden');
556
+ noImages.classList.remove('hidden');
557
+ return;
558
+ }
559
+
560
+ // Setup image viewer
561
+ noImages.classList.add('hidden');
562
+ imageContainer.classList.remove('hidden');
563
+ imageControls.classList.remove('hidden');
564
+
565
+ // Configure slider
566
+ imageSlider.min = 0;
567
+ imageSlider.max = appState.currentImages.length - 1;
568
+ imageSlider.value = 0;
569
+
570
+ // Reset to first image
571
+ appState.currentImageIndex = 0;
572
+ updateImageDisplay();
573
+ }
574
+
575
+ // Update image display
576
+ function updateImageDisplay() {
577
+ if (appState.currentImages.length === 0) return;
578
+
579
+ const image = appState.currentImages[appState.currentImageIndex];
580
+ currentImage.src = image.path;
581
+ imageCaption.textContent = image.name;
582
+ imageCounter.textContent = `${appState.currentImageIndex + 1} / ${appState.currentImages.length}`;
583
+ imageSlider.value = appState.currentImageIndex;
584
+
585
+ // Update button states
586
+ prevImage.disabled = appState.currentImageIndex === 0;
587
+ nextImage.disabled = appState.currentImageIndex === appState.currentImages.length - 1;
588
+ }
589
+
590
+ // Image navigation
591
+ prevImage.addEventListener('click', () => {
592
+ if (appState.currentImageIndex > 0) {
593
+ appState.currentImageIndex--;
594
+ updateImageDisplay();
595
+ }
596
+ });
597
+
598
+ nextImage.addEventListener('click', () => {
599
+ if (appState.currentImageIndex < appState.currentImages.length - 1) {
600
+ appState.currentImageIndex++;
601
+ updateImageDisplay();
602
+ }
603
+ });
604
+
605
+ imageSlider.addEventListener('input', () => {
606
+ appState.currentImageIndex = parseInt(imageSlider.value);
607
+ updateImageDisplay();
608
+ });
609
+
610
+ // Tab handling
611
+ document.querySelectorAll('.tab').forEach(tab => {
612
+ tab.addEventListener('click', () => {
613
+ // Set active tab
614
+ document.querySelectorAll('.tab').forEach(t => t.classList.remove('active'));
615
+ tab.classList.add('active');
616
+
617
+ // Show active content
618
+ const tabId = tab.getAttribute('data-tab');
619
+ document.querySelectorAll('.tab-content').forEach(content => {
620
+ content.classList.remove('active');
621
+ });
622
+ document.getElementById(`${tabId}-tab`).classList.add('active');
623
+ });
624
+ });
625
+
626
+ // Render agent trace - UPDATED to show all sections expanded and remove duplicated task title
627
+ function renderAgentTrace(metadata) {
628
+ agentSteps.innerHTML = '';
629
+
630
+ if (!metadata || !metadata.summary || metadata.summary.length === 0) {
631
+ agentSteps.innerHTML = '<p>No agent trace data available</p>';
632
+ return;
633
+ }
634
+
635
+ // Process each step
636
+ metadata.summary.forEach((step, index) => {
637
+ const stepDiv = document.createElement('div');
638
+ stepDiv.className = 'step';
639
+
640
+ // Create step header
641
+ const headerDiv = document.createElement('div');
642
+ headerDiv.className = 'step-header';
643
+
644
+ let headerText = `Step ${index}`;
645
+ if (index === 0 && step.task) {
646
+ headerText = 'Task';
647
+ } else if (step.model_output_message) {
648
+ headerText = 'Planning';
649
+ } else if (step.tool_calls) {
650
+ headerText = `Action ${index}`;
651
+ } else if (step.error) {
652
+ headerText = 'Error';
653
+ }
654
+
655
+ headerDiv.innerHTML = `<span>${headerText}</span><span>▲</span>`;
656
+ stepDiv.appendChild(headerDiv);
657
+
658
+ // Create step content
659
+ const contentDiv = document.createElement('div');
660
+ contentDiv.className = 'step-content';
661
+ // Make all sections visible by default
662
+ contentDiv.style.display = 'block';
663
+
664
+ let contentHtml = '';
665
+
666
+ // Task information - don't duplicate the title
667
+ if (index === 0 && step.task) {
668
+ // Just show the task content without the "Task:" title
669
+ contentHtml += `${step.task}\n\n`;
670
+ }
671
+
672
+ // Model output and planning
673
+ if (step.model_output_message && step.model_output_message.content) {
674
+ contentHtml += `<strong>Model Output:</strong>\n${step.model_output_message.content}\n\n`;
675
+
676
+ if (step.plan) {
677
+ contentHtml += `<strong>Plan:</strong>\n${step.plan}\n\n`;
678
+ }
679
+ }
680
+
681
+ // Tool calls
682
+ if (step.tool_calls && step.tool_calls.length > 0) {
683
+ step.tool_calls.forEach(toolCall => {
684
+ if (toolCall.function) {
685
+ contentHtml += `<strong>Tool Call:</strong> ${toolCall.function.name}\n`;
686
+ if (toolCall.function.arguments) {
687
+ contentHtml += `<strong>Arguments:</strong>\n${toolCall.function.arguments}\n\n`;
688
+ }
689
+ }
690
+ });
691
+ }
692
+
693
+ // Model reasoning
694
+ if (step.model_output) {
695
+ contentHtml += `<strong>Model Reasoning:</strong>\n${step.model_output}\n\n`;
696
+ }
697
+
698
+ // Observations
699
+ if (step.observations) {
700
+ contentHtml += `<strong>Observations:</strong>\n${step.observations}\n\n`;
701
+ }
702
+
703
+ // Action output
704
+ if (step.action_output) {
705
+ contentHtml += `<strong>Action Output:</strong>\n${step.action_output}\n\n`;
706
+ }
707
+
708
+ // Errors
709
+ if (step.error) {
710
+ contentHtml += `<strong>Error Type:</strong> ${step.error.type || 'Unknown'}\n`;
711
+ if (step.error.message) {
712
+ contentHtml += `<strong>Error Message:</strong> ${step.error.message}\n`;
713
+ }
714
+ }
715
+
716
+ contentDiv.textContent = contentHtml || "No content available for this step";
717
+ stepDiv.appendChild(contentDiv);
718
+
719
+ // Add click handler to toggle content
720
+ headerDiv.addEventListener('click', () => {
721
+ const isHidden = contentDiv.style.display === 'none';
722
+ contentDiv.style.display = isHidden ? 'block' : 'none';
723
+ headerDiv.querySelector('span:last-child').textContent = isHidden ? '▲' : '▼';
724
+ });
725
+
726
+ agentSteps.appendChild(stepDiv);
727
+ });
728
+
729
+ // No need to expand the first step by default since all are now expanded
730
+ }
731
+
732
+ // Handle keyboard navigation for images
733
+ document.addEventListener('keydown', (e) => {
734
+ if (!appState.currentImages || appState.currentImages.length === 0) return;
735
+
736
+ // Check if the screenshots tab is active
737
+ const screenshotsTab = document.getElementById('screenshots-tab');
738
+ if (!screenshotsTab.classList.contains('active')) return;
739
+
740
+ if (e.key === 'ArrowLeft' && appState.currentImageIndex > 0) {
741
+ appState.currentImageIndex--;
742
+ updateImageDisplay();
743
+ } else if (e.key === 'ArrowRight' && appState.currentImageIndex < appState.currentImages.length - 1) {
744
+ appState.currentImageIndex++;
745
+ updateImageDisplay();
746
+ }
747
+ });
748
+
749
+ // Load evaluations on page load
750
+ document.addEventListener('DOMContentLoaded', loadEvaluations);
751
+ </script>
752
+ </body>
753
+ </html>
voice_interface.py ADDED
@@ -0,0 +1,137 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ voice_interface.py — Voice I/O for the Computer Agent
3
+ ======================================================
4
+ Speech-to-Text (Whisper / Faster-Whisper) and TTS (HF Inference API)
5
+ """
6
+
7
+ import os
8
+ import io
9
+ import tempfile
10
+ import base64
11
+ from typing import Optional, Dict, Any
12
+
13
+ import numpy as np
14
+
15
+ # STT
16
+ try:
17
+ from faster_whisper import WhisperModel
18
+ HAS_FASTER_WHISPER = True
19
+ except ImportError:
20
+ HAS_FASTER_WHISPER = False
21
+
22
+ # TTS via HF Inference
23
+ try:
24
+ from huggingface_hub import InferenceClient
25
+ HAS_HF_INFERENCE = True
26
+ except ImportError:
27
+ HAS_HF_INFERENCE = False
28
+
29
+
30
+ class VoiceInterface:
31
+ """Handles audio input (STT) and output (TTS) for the agent."""
32
+
33
+ def __init__(
34
+ self,
35
+ stt_model_size: str = "base",
36
+ tts_model: str = "hexgrad/Kokoro-82M",
37
+ hf_token: Optional[str] = None,
38
+ ):
39
+ self.stt_model_size = stt_model_size
40
+ self.tts_model = tts_model
41
+ self.hf_token = hf_token or os.getenv("HF_TOKEN")
42
+ self._stt: Optional[Any] = None
43
+ self._tts_client: Optional[Any] = None
44
+
45
+ # ------------------------------------------------------------------
46
+ # STT
47
+ # ------------------------------------------------------------------
48
+
49
+ def _load_stt(self) -> Any:
50
+ if self._stt is None:
51
+ if HAS_FASTER_WHISPER:
52
+ # Use CPU for Spaces compatibility; auto-detect compute type
53
+ self._stt = WhisperModel(self.stt_model_size, device="cpu", compute_type="int8")
54
+ else:
55
+ raise RuntimeError("faster-whisper not installed. Run: pip install faster-whisper")
56
+ return self._stt
57
+
58
+ def transcribe(self, audio_np: np.ndarray, sample_rate: int = 16000) -> Dict[str, Any]:
59
+ """Transcribe audio waveform to text.
60
+ audio_np: numpy array of float32 audio samples
61
+ """
62
+ model = self._load_stt()
63
+ # faster-whisper expects a file path or bytes; save to temp wav
64
+ import soundfile as sf
65
+ with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
66
+ sf.write(f.name, audio_np, sample_rate)
67
+ segments, info = model.transcribe(f.name, beam_size=5)
68
+ text = " ".join([seg.text for seg in segments])
69
+ os.unlink(f.name)
70
+ return {
71
+ "text": text.strip(),
72
+ "language": info.language,
73
+ "probability": info.language_probability,
74
+ }
75
+
76
+ def transcribe_from_file(self, file_path: str) -> Dict[str, Any]:
77
+ model = self._load_stt()
78
+ segments, info = model.transcribe(file_path, beam_size=5)
79
+ text = " ".join([seg.text for seg in segments])
80
+ return {
81
+ "text": text.strip(),
82
+ "language": info.language,
83
+ "probability": info.language_probability,
84
+ }
85
+
86
+ # ------------------------------------------------------------------
87
+ # TTS
88
+ # ------------------------------------------------------------------
89
+
90
+ def _load_tts(self) -> Any:
91
+ if self._tts_client is None:
92
+ if HAS_HF_INFERENCE:
93
+ self._tts_client = InferenceClient(model=self.tts_model, token=self.hf_token)
94
+ else:
95
+ raise RuntimeError("huggingface_hub not installed")
96
+ return self._tts_client
97
+
98
+ def synthesize(self, text: str, voice: str = "af") -> bytes:
99
+ """Synthesize text to speech bytes.
100
+ Returns raw audio bytes (usually WAV or MP3 depending on model).
101
+ """
102
+ client = self._load_tts()
103
+ try:
104
+ audio = client.text_to_speech(text, model=self.tts_model)
105
+ if hasattr(audio, "read"):
106
+ return audio.read()
107
+ return audio
108
+ except Exception as e:
109
+ # Fallback to standard TTS endpoint
110
+ alt_client = InferenceClient(token=self.hf_token)
111
+ audio = alt_client.text_to_speech(text, model="espnet/kan-bayashi_ljspeech_vits")
112
+ if hasattr(audio, "read"):
113
+ return audio.read()
114
+ return audio
115
+
116
+ def synthesize_to_file(self, text: str, output_path: str, voice: str = "af") -> str:
117
+ audio_bytes = self.synthesize(text, voice)
118
+ with open(output_path, "wb") as f:
119
+ f.write(audio_bytes)
120
+ return output_path
121
+
122
+ # ------------------------------------------------------------------
123
+ # Gradio helpers
124
+ # ------------------------------------------------------------------
125
+
126
+ def process_gradio_audio(self, audio_tuple) -> str:
127
+ """Process Gradio audio input (tuple of sample_rate, numpy_array)."""
128
+ if audio_tuple is None:
129
+ return ""
130
+ sample_rate, audio_np = audio_tuple
131
+ # Convert to mono float32 if needed
132
+ if audio_np.ndim > 1:
133
+ audio_np = audio_np.mean(axis=1)
134
+ if audio_np.dtype != np.float32:
135
+ audio_np = audio_np.astype(np.float32)
136
+ result = self.transcribe(audio_np, sample_rate)
137
+ return result["text"]