computer-agent-v2 / README.md
jkorstad's picture
v2.0-polish: tuple streaming, plan+cost display wiring, tracker sync, interrupt safety, README, eval CLI
877f588
---
title: Computer Agent v2.0
emoji: πŸ€–
colorFrom: purple
colorTo: blue
sdk: gradio
app_file: app.py
pinned: false
license: apache-2.0
short_description: "Computer agent with planner, multi-model router, MCP, memory"
---
# πŸ€– Open Computer Agent v2.0
An **enhanced** universal computer-use agent built on [smolagents](https://github.com/huggingface/smolagents), [E2B Desktop](https://e2b.dev), and [Playwright](https://playwright.dev). It plans before it acts, remembers what worked, routes tasks to the cheapest capable model, and verifies its own success.
## ✨ What's New in v2.0
| Feature | Description |
|---------|-------------|
| 🧠 **Hierarchical Planner** | Breaks goals into subtask DAGs using a cheap text model before execution |
| πŸ”Œ **Playwright MCP** | Semantic browser control β€” click by text/role, extract tables/links, evaluate JS |
| 🎯 **Multi-Model Router** | Auto-selects the cheapest capable model (fast vision ↔ powerful vision ↔ fast text ↔ powerful text) |
| 🧩 **Set-of-Marks Vision** | Overlays numbered bounding boxes on UI elements for coordinate-free interaction |
| πŸ—„οΈ **Long-Term Memory** | ChromaDB vector store retrieves similar past tasks and proven strategies |
| πŸ” **Verifier Agent** | Checks subtask completion and triggers recovery loops automatically |
| πŸ›‘ **Human-in-the-Loop** | Pauses on sensitive actions (payments, emails, deletes) for user approval |
| πŸŽ™οΈ **Voice I/O** | Speak tasks and hear responses via Whisper STT + Kokoro TTS |
| πŸ’° **Cost Dashboard** | Real-time $/task, token usage, and latency tracking |
| πŸ“Ή **Session Recording** | Saves every step as replayable macros with full trace export |
| πŸ§ͺ **Enhanced Eval** | Built-in benchmark suite with LLM-as-a-Judge grading and A/B testing |
## πŸ—οΈ Architecture
```
User Input (Text / Voice / File)
|
v
[IntelligenceRouter] ----> Planner (JSON DAG)
|
v
[Memory Retrieval] (ChromaDB)
|
v
[Plan Executor]
|
+---> [Browser Sub-Agent] (Playwright MCP)
+---> [Desktop Sub-Agent] (E2B + SoM Vision)
+---> [Coder Sub-Agent] (Code Interpreter)
+---> [HF Hub Sub-Agent] (Search / Upload)
|
v
[Verifier] -> Retry / Alternative / Continue
|
v
[Macro Saver] + Cost Report + Session Recording
```
## πŸš€ Quick Start
### 1. Secrets Setup
Go to **Space Settings β†’ Secrets** and add:
| Secret Name | Value | Required? |
|-------------|-------|-----------|
| `E2B_API_KEY` | Your key from [e2b.dev](https://e2b.dev) | **Yes** for desktop automation |
| `HF_TOKEN` | Your Hugging Face token | **Yes** for model inference & Hub tools |
Then **Factory Rebuild** the Space.
### 2. Run a Task
1. Type a task (or click πŸŽ™οΈ to speak it)
2. Hit **πŸš€ Let's go!**
3. Watch the agent:
- 🧠 Generate a plan in the left panel
- πŸ–₯️ Control the sandbox desktop in real time
- πŸ’° Update cost tracking live
- βœ… Verify completion at the end
## πŸ›‘οΈ Sensitive Actions
By default, the agent pauses before:
- Payments, purchases, subscriptions
- Sending emails/messages/posts
- Deleting files or uninstalling software
- Password/credit-card fields
Enable **Auto-approve all actions** in βš™οΈ Advanced Options to disable HITL.
## πŸ’° Cost Budget
Default budget is **$2.00 USD per session**. The router automatically downgrades to cheaper models as the budget is consumed. Costs are estimated from token counts and model pricing β€” actual HF Inference API costs may vary.
## πŸ§ͺ Running Benchmarks
```python
from eval_harness import EvaluationHarness, DEFAULT_BENCHMARKS
from app import build_session_components
# Create harness with a factory that builds agents
harness = EvaluationHarness(
agent_factory=lambda: build_session_components("eval_session", "./tmp/eval")["router"],
judge_model_call=lambda msgs: build_session_components("eval_session", "./tmp/eval")["router"](msgs).content,
)
# Run full suite
summary = harness.run_suite(DEFAULT_BENCHMARKS, num_runs=1)
print(f"Pass rate: {summary.passed}/{summary.total_tasks}")
print(f"Avg score: {summary.avg_score}")
```
Or run a quick A/B test between two configurations:
```python
results = harness.compare_strategies(
strategy_a_factory=make_agent_v1,
strategy_b_factory=make_agent_v2,
num_runs=3,
)
print(f"Winner: Strategy {results['winner']}")
```
## πŸŽ™οΈ Voice Input
1. Click the **microphone** icon next to the task box
2. Speak your task clearly
3. The transcribed text appears in the task box automatically
4. Hit **Run**
Voice requires `faster-whisper` (optional dependency). If unavailable, a text fallback is provided.
## 🧩 MCP Tools Reference
| Tool | Description |
|------|-------------|
| `browser_goto(url)` | Navigate browser to URL |
| `browser_click(selector, by)` | Click by CSS/text/role |
| `browser_fill(selector, text)` | Fill form fields |
| `browser_find_and_click(text)` | Click by visible text |
| `browser_extract_links()` | Get all page links as JSON |
| `browser_extract_tables()` | Get all page tables as JSON |
| `browser_evaluate_js(script)` | Run JS in browser context |
| `hf_search_models(query)` | Search HF Hub for models |
| `hf_search_datasets(query)` | Search HF Hub for datasets |
| `hf_upload_dataset_file(...)` | Upload a file to a HF dataset |
| `fs_read(path)` | Read a workspace file |
| `fs_write(path, content)` | Write a workspace file |
## πŸ“ Project Structure
```
β”œβ”€β”€ app.py # Gradio UI + event orchestration
β”œβ”€β”€ core_agent.py # Router, Planner, Verifier, Memory, SoM, Recorder
β”œβ”€β”€ mcp_tools.py # Playwright, CodeExec, FileSystem, HF Hub bridges
β”œβ”€β”€ voice_interface.py # STT + TTS with WebGPU detection
β”œβ”€β”€ eval_harness.py # Benchmarks + LLM-as-a-Judge + A/B testing
β”œβ”€β”€ e2bqwen.py # Original E2B vision agent (preserved)
β”œβ”€β”€ requirements.txt
└── README.md
```
## 🀝 Credits
- [smolagents](https://github.com/huggingface/smolagents) by Hugging Face
- [E2B](https://e2b.dev) for secure sandboxed desktops
- [Playwright](https://playwright.dev) for browser automation
- [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct) for vision reasoning
- [Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) for TTS
## πŸ“„ License
Apache 2.0