Spaces:
Sleeping
Sleeping
File size: 6,368 Bytes
414cc89 5894ce1 414cc89 5894ce1 877f588 414cc89 5894ce1 877f588 5894ce1 877f588 5894ce1 877f588 5894ce1 877f588 5894ce1 877f588 5894ce1 877f588 5894ce1 877f588 5894ce1 877f588 5894ce1 877f588 5894ce1 877f588 5894ce1 877f588 5894ce1 877f588 5894ce1 877f588 5894ce1 877f588 5894ce1 877f588 5894ce1 877f588 5894ce1 877f588 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 | ---
title: Computer Agent v2.0
emoji: π€
colorFrom: purple
colorTo: blue
sdk: gradio
app_file: app.py
pinned: false
license: apache-2.0
short_description: "Computer agent with planner, multi-model router, MCP, memory"
---
# π€ Open Computer Agent v2.0
An **enhanced** universal computer-use agent built on [smolagents](https://github.com/huggingface/smolagents), [E2B Desktop](https://e2b.dev), and [Playwright](https://playwright.dev). It plans before it acts, remembers what worked, routes tasks to the cheapest capable model, and verifies its own success.
## β¨ What's New in v2.0
| Feature | Description |
|---------|-------------|
| π§ **Hierarchical Planner** | Breaks goals into subtask DAGs using a cheap text model before execution |
| π **Playwright MCP** | Semantic browser control β click by text/role, extract tables/links, evaluate JS |
| π― **Multi-Model Router** | Auto-selects the cheapest capable model (fast vision β powerful vision β fast text β powerful text) |
| π§© **Set-of-Marks Vision** | Overlays numbered bounding boxes on UI elements for coordinate-free interaction |
| ποΈ **Long-Term Memory** | ChromaDB vector store retrieves similar past tasks and proven strategies |
| π **Verifier Agent** | Checks subtask completion and triggers recovery loops automatically |
| π **Human-in-the-Loop** | Pauses on sensitive actions (payments, emails, deletes) for user approval |
| ποΈ **Voice I/O** | Speak tasks and hear responses via Whisper STT + Kokoro TTS |
| π° **Cost Dashboard** | Real-time $/task, token usage, and latency tracking |
| πΉ **Session Recording** | Saves every step as replayable macros with full trace export |
| π§ͺ **Enhanced Eval** | Built-in benchmark suite with LLM-as-a-Judge grading and A/B testing |
## ποΈ Architecture
```
User Input (Text / Voice / File)
|
v
[IntelligenceRouter] ----> Planner (JSON DAG)
|
v
[Memory Retrieval] (ChromaDB)
|
v
[Plan Executor]
|
+---> [Browser Sub-Agent] (Playwright MCP)
+---> [Desktop Sub-Agent] (E2B + SoM Vision)
+---> [Coder Sub-Agent] (Code Interpreter)
+---> [HF Hub Sub-Agent] (Search / Upload)
|
v
[Verifier] -> Retry / Alternative / Continue
|
v
[Macro Saver] + Cost Report + Session Recording
```
## π Quick Start
### 1. Secrets Setup
Go to **Space Settings β Secrets** and add:
| Secret Name | Value | Required? |
|-------------|-------|-----------|
| `E2B_API_KEY` | Your key from [e2b.dev](https://e2b.dev) | **Yes** for desktop automation |
| `HF_TOKEN` | Your Hugging Face token | **Yes** for model inference & Hub tools |
Then **Factory Rebuild** the Space.
### 2. Run a Task
1. Type a task (or click ποΈ to speak it)
2. Hit **π Let's go!**
3. Watch the agent:
- π§ Generate a plan in the left panel
- π₯οΈ Control the sandbox desktop in real time
- π° Update cost tracking live
- β
Verify completion at the end
## π‘οΈ Sensitive Actions
By default, the agent pauses before:
- Payments, purchases, subscriptions
- Sending emails/messages/posts
- Deleting files or uninstalling software
- Password/credit-card fields
Enable **Auto-approve all actions** in βοΈ Advanced Options to disable HITL.
## π° Cost Budget
Default budget is **$2.00 USD per session**. The router automatically downgrades to cheaper models as the budget is consumed. Costs are estimated from token counts and model pricing β actual HF Inference API costs may vary.
## π§ͺ Running Benchmarks
```python
from eval_harness import EvaluationHarness, DEFAULT_BENCHMARKS
from app import build_session_components
# Create harness with a factory that builds agents
harness = EvaluationHarness(
agent_factory=lambda: build_session_components("eval_session", "./tmp/eval")["router"],
judge_model_call=lambda msgs: build_session_components("eval_session", "./tmp/eval")["router"](msgs).content,
)
# Run full suite
summary = harness.run_suite(DEFAULT_BENCHMARKS, num_runs=1)
print(f"Pass rate: {summary.passed}/{summary.total_tasks}")
print(f"Avg score: {summary.avg_score}")
```
Or run a quick A/B test between two configurations:
```python
results = harness.compare_strategies(
strategy_a_factory=make_agent_v1,
strategy_b_factory=make_agent_v2,
num_runs=3,
)
print(f"Winner: Strategy {results['winner']}")
```
## ποΈ Voice Input
1. Click the **microphone** icon next to the task box
2. Speak your task clearly
3. The transcribed text appears in the task box automatically
4. Hit **Run**
Voice requires `faster-whisper` (optional dependency). If unavailable, a text fallback is provided.
## π§© MCP Tools Reference
| Tool | Description |
|------|-------------|
| `browser_goto(url)` | Navigate browser to URL |
| `browser_click(selector, by)` | Click by CSS/text/role |
| `browser_fill(selector, text)` | Fill form fields |
| `browser_find_and_click(text)` | Click by visible text |
| `browser_extract_links()` | Get all page links as JSON |
| `browser_extract_tables()` | Get all page tables as JSON |
| `browser_evaluate_js(script)` | Run JS in browser context |
| `hf_search_models(query)` | Search HF Hub for models |
| `hf_search_datasets(query)` | Search HF Hub for datasets |
| `hf_upload_dataset_file(...)` | Upload a file to a HF dataset |
| `fs_read(path)` | Read a workspace file |
| `fs_write(path, content)` | Write a workspace file |
## π Project Structure
```
βββ app.py # Gradio UI + event orchestration
βββ core_agent.py # Router, Planner, Verifier, Memory, SoM, Recorder
βββ mcp_tools.py # Playwright, CodeExec, FileSystem, HF Hub bridges
βββ voice_interface.py # STT + TTS with WebGPU detection
βββ eval_harness.py # Benchmarks + LLM-as-a-Judge + A/B testing
βββ e2bqwen.py # Original E2B vision agent (preserved)
βββ requirements.txt
βββ README.md
```
## π€ Credits
- [smolagents](https://github.com/huggingface/smolagents) by Hugging Face
- [E2B](https://e2b.dev) for secure sandboxed desktops
- [Playwright](https://playwright.dev) for browser automation
- [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct) for vision reasoning
- [Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) for TTS
## π License
Apache 2.0
|