File size: 6,368 Bytes
414cc89
5894ce1
 
 
414cc89
 
 
 
5894ce1
877f588
414cc89
 
5894ce1
 
877f588
5894ce1
877f588
5894ce1
 
 
877f588
 
5894ce1
 
 
877f588
5894ce1
 
 
877f588
5894ce1
 
877f588
5894ce1
 
 
 
 
877f588
5894ce1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
877f588
5894ce1
877f588
5894ce1
877f588
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5894ce1
 
 
 
 
 
 
877f588
5894ce1
877f588
5894ce1
877f588
5894ce1
877f588
5894ce1
 
877f588
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5894ce1
 
877f588
5894ce1
 
 
 
 
877f588
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
---
title: Computer Agent v2.0
emoji: πŸ€–
colorFrom: purple
colorTo: blue
sdk: gradio
app_file: app.py
pinned: false
license: apache-2.0
short_description: "Computer agent with planner, multi-model router, MCP, memory"
---

# πŸ€– Open Computer Agent v2.0

An **enhanced** universal computer-use agent built on [smolagents](https://github.com/huggingface/smolagents), [E2B Desktop](https://e2b.dev), and [Playwright](https://playwright.dev). It plans before it acts, remembers what worked, routes tasks to the cheapest capable model, and verifies its own success.

## ✨ What's New in v2.0

| Feature | Description |
|---------|-------------|
| 🧠 **Hierarchical Planner** | Breaks goals into subtask DAGs using a cheap text model before execution |
| πŸ”Œ **Playwright MCP** | Semantic browser control β€” click by text/role, extract tables/links, evaluate JS |
| 🎯 **Multi-Model Router** | Auto-selects the cheapest capable model (fast vision ↔ powerful vision ↔ fast text ↔ powerful text) |
| 🧩 **Set-of-Marks Vision** | Overlays numbered bounding boxes on UI elements for coordinate-free interaction |
| πŸ—„οΈ **Long-Term Memory** | ChromaDB vector store retrieves similar past tasks and proven strategies |
| πŸ” **Verifier Agent** | Checks subtask completion and triggers recovery loops automatically |
| πŸ›‘ **Human-in-the-Loop** | Pauses on sensitive actions (payments, emails, deletes) for user approval |
| πŸŽ™οΈ **Voice I/O** | Speak tasks and hear responses via Whisper STT + Kokoro TTS |
| πŸ’° **Cost Dashboard** | Real-time $/task, token usage, and latency tracking |
| πŸ“Ή **Session Recording** | Saves every step as replayable macros with full trace export |
| πŸ§ͺ **Enhanced Eval** | Built-in benchmark suite with LLM-as-a-Judge grading and A/B testing |

## πŸ—οΈ Architecture

```
User Input (Text / Voice / File)
       |
       v
[IntelligenceRouter] ----> Planner (JSON DAG)
       |
       v
[Memory Retrieval] (ChromaDB)
       |
       v
[Plan Executor]
       |
       +---> [Browser Sub-Agent] (Playwright MCP)
       +---> [Desktop Sub-Agent] (E2B + SoM Vision)
       +---> [Coder Sub-Agent] (Code Interpreter)
       +---> [HF Hub Sub-Agent] (Search / Upload)
       |
       v
[Verifier] -> Retry / Alternative / Continue
       |
       v
[Macro Saver] + Cost Report + Session Recording
```

## πŸš€ Quick Start

### 1. Secrets Setup

Go to **Space Settings β†’ Secrets** and add:

| Secret Name | Value | Required? |
|-------------|-------|-----------|
| `E2B_API_KEY` | Your key from [e2b.dev](https://e2b.dev) | **Yes** for desktop automation |
| `HF_TOKEN` | Your Hugging Face token | **Yes** for model inference & Hub tools |

Then **Factory Rebuild** the Space.

### 2. Run a Task

1. Type a task (or click πŸŽ™οΈ to speak it)
2. Hit **πŸš€ Let's go!**
3. Watch the agent:
   - 🧠 Generate a plan in the left panel
   - πŸ–₯️ Control the sandbox desktop in real time
   - πŸ’° Update cost tracking live
   - βœ… Verify completion at the end

## πŸ›‘οΈ Sensitive Actions

By default, the agent pauses before:
- Payments, purchases, subscriptions
- Sending emails/messages/posts
- Deleting files or uninstalling software
- Password/credit-card fields

Enable **Auto-approve all actions** in βš™οΈ Advanced Options to disable HITL.

## πŸ’° Cost Budget

Default budget is **$2.00 USD per session**. The router automatically downgrades to cheaper models as the budget is consumed. Costs are estimated from token counts and model pricing β€” actual HF Inference API costs may vary.

## πŸ§ͺ Running Benchmarks

```python
from eval_harness import EvaluationHarness, DEFAULT_BENCHMARKS
from app import build_session_components

# Create harness with a factory that builds agents
harness = EvaluationHarness(
    agent_factory=lambda: build_session_components("eval_session", "./tmp/eval")["router"],
    judge_model_call=lambda msgs: build_session_components("eval_session", "./tmp/eval")["router"](msgs).content,
)

# Run full suite
summary = harness.run_suite(DEFAULT_BENCHMARKS, num_runs=1)
print(f"Pass rate: {summary.passed}/{summary.total_tasks}")
print(f"Avg score: {summary.avg_score}")
```

Or run a quick A/B test between two configurations:

```python
results = harness.compare_strategies(
    strategy_a_factory=make_agent_v1,
    strategy_b_factory=make_agent_v2,
    num_runs=3,
)
print(f"Winner: Strategy {results['winner']}")
```

## πŸŽ™οΈ Voice Input

1. Click the **microphone** icon next to the task box
2. Speak your task clearly
3. The transcribed text appears in the task box automatically
4. Hit **Run**

Voice requires `faster-whisper` (optional dependency). If unavailable, a text fallback is provided.

## 🧩 MCP Tools Reference

| Tool | Description |
|------|-------------|
| `browser_goto(url)` | Navigate browser to URL |
| `browser_click(selector, by)` | Click by CSS/text/role |
| `browser_fill(selector, text)` | Fill form fields |
| `browser_find_and_click(text)` | Click by visible text |
| `browser_extract_links()` | Get all page links as JSON |
| `browser_extract_tables()` | Get all page tables as JSON |
| `browser_evaluate_js(script)` | Run JS in browser context |
| `hf_search_models(query)` | Search HF Hub for models |
| `hf_search_datasets(query)` | Search HF Hub for datasets |
| `hf_upload_dataset_file(...)` | Upload a file to a HF dataset |
| `fs_read(path)` | Read a workspace file |
| `fs_write(path, content)` | Write a workspace file |

## πŸ“ Project Structure

```
β”œβ”€β”€ app.py              # Gradio UI + event orchestration
β”œβ”€β”€ core_agent.py       # Router, Planner, Verifier, Memory, SoM, Recorder
β”œβ”€β”€ mcp_tools.py        # Playwright, CodeExec, FileSystem, HF Hub bridges
β”œβ”€β”€ voice_interface.py  # STT + TTS with WebGPU detection
β”œβ”€β”€ eval_harness.py     # Benchmarks + LLM-as-a-Judge + A/B testing
β”œβ”€β”€ e2bqwen.py          # Original E2B vision agent (preserved)
β”œβ”€β”€ requirements.txt
└── README.md
```

## 🀝 Credits

- [smolagents](https://github.com/huggingface/smolagents) by Hugging Face
- [E2B](https://e2b.dev) for secure sandboxed desktops
- [Playwright](https://playwright.dev) for browser automation
- [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct) for vision reasoning
- [Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) for TTS

## πŸ“„ License

Apache 2.0