Spaces:
Sleeping
A newer version of the Gradio SDK is available: 6.14.0
title: Computer Agent v2.0
emoji: π€
colorFrom: purple
colorTo: blue
sdk: gradio
app_file: app.py
pinned: false
license: apache-2.0
short_description: Computer agent with planner, multi-model router, MCP, memory
π€ Open Computer Agent v2.0
An enhanced universal computer-use agent built on smolagents, E2B Desktop, and Playwright. It plans before it acts, remembers what worked, routes tasks to the cheapest capable model, and verifies its own success.
β¨ What's New in v2.0
| Feature | Description |
|---|---|
| π§ Hierarchical Planner | Breaks goals into subtask DAGs using a cheap text model before execution |
| π Playwright MCP | Semantic browser control β click by text/role, extract tables/links, evaluate JS |
| π― Multi-Model Router | Auto-selects the cheapest capable model (fast vision β powerful vision β fast text β powerful text) |
| π§© Set-of-Marks Vision | Overlays numbered bounding boxes on UI elements for coordinate-free interaction |
| ποΈ Long-Term Memory | ChromaDB vector store retrieves similar past tasks and proven strategies |
| π Verifier Agent | Checks subtask completion and triggers recovery loops automatically |
| π Human-in-the-Loop | Pauses on sensitive actions (payments, emails, deletes) for user approval |
| ποΈ Voice I/O | Speak tasks and hear responses via Whisper STT + Kokoro TTS |
| π° Cost Dashboard | Real-time $/task, token usage, and latency tracking |
| πΉ Session Recording | Saves every step as replayable macros with full trace export |
| π§ͺ Enhanced Eval | Built-in benchmark suite with LLM-as-a-Judge grading and A/B testing |
ποΈ Architecture
User Input (Text / Voice / File)
|
v
[IntelligenceRouter] ----> Planner (JSON DAG)
|
v
[Memory Retrieval] (ChromaDB)
|
v
[Plan Executor]
|
+---> [Browser Sub-Agent] (Playwright MCP)
+---> [Desktop Sub-Agent] (E2B + SoM Vision)
+---> [Coder Sub-Agent] (Code Interpreter)
+---> [HF Hub Sub-Agent] (Search / Upload)
|
v
[Verifier] -> Retry / Alternative / Continue
|
v
[Macro Saver] + Cost Report + Session Recording
π Quick Start
1. Secrets Setup
Go to Space Settings β Secrets and add:
| Secret Name | Value | Required? |
|---|---|---|
E2B_API_KEY |
Your key from e2b.dev | Yes for desktop automation |
HF_TOKEN |
Your Hugging Face token | Yes for model inference & Hub tools |
Then Factory Rebuild the Space.
2. Run a Task
- Type a task (or click ποΈ to speak it)
- Hit π Let's go!
- Watch the agent:
- π§ Generate a plan in the left panel
- π₯οΈ Control the sandbox desktop in real time
- π° Update cost tracking live
- β Verify completion at the end
π‘οΈ Sensitive Actions
By default, the agent pauses before:
- Payments, purchases, subscriptions
- Sending emails/messages/posts
- Deleting files or uninstalling software
- Password/credit-card fields
Enable Auto-approve all actions in βοΈ Advanced Options to disable HITL.
π° Cost Budget
Default budget is $2.00 USD per session. The router automatically downgrades to cheaper models as the budget is consumed. Costs are estimated from token counts and model pricing β actual HF Inference API costs may vary.
π§ͺ Running Benchmarks
from eval_harness import EvaluationHarness, DEFAULT_BENCHMARKS
from app import build_session_components
# Create harness with a factory that builds agents
harness = EvaluationHarness(
agent_factory=lambda: build_session_components("eval_session", "./tmp/eval")["router"],
judge_model_call=lambda msgs: build_session_components("eval_session", "./tmp/eval")["router"](msgs).content,
)
# Run full suite
summary = harness.run_suite(DEFAULT_BENCHMARKS, num_runs=1)
print(f"Pass rate: {summary.passed}/{summary.total_tasks}")
print(f"Avg score: {summary.avg_score}")
Or run a quick A/B test between two configurations:
results = harness.compare_strategies(
strategy_a_factory=make_agent_v1,
strategy_b_factory=make_agent_v2,
num_runs=3,
)
print(f"Winner: Strategy {results['winner']}")
ποΈ Voice Input
- Click the microphone icon next to the task box
- Speak your task clearly
- The transcribed text appears in the task box automatically
- Hit Run
Voice requires faster-whisper (optional dependency). If unavailable, a text fallback is provided.
π§© MCP Tools Reference
| Tool | Description |
|---|---|
browser_goto(url) |
Navigate browser to URL |
browser_click(selector, by) |
Click by CSS/text/role |
browser_fill(selector, text) |
Fill form fields |
browser_find_and_click(text) |
Click by visible text |
browser_extract_links() |
Get all page links as JSON |
browser_extract_tables() |
Get all page tables as JSON |
browser_evaluate_js(script) |
Run JS in browser context |
hf_search_models(query) |
Search HF Hub for models |
hf_search_datasets(query) |
Search HF Hub for datasets |
hf_upload_dataset_file(...) |
Upload a file to a HF dataset |
fs_read(path) |
Read a workspace file |
fs_write(path, content) |
Write a workspace file |
π Project Structure
βββ app.py # Gradio UI + event orchestration
βββ core_agent.py # Router, Planner, Verifier, Memory, SoM, Recorder
βββ mcp_tools.py # Playwright, CodeExec, FileSystem, HF Hub bridges
βββ voice_interface.py # STT + TTS with WebGPU detection
βββ eval_harness.py # Benchmarks + LLM-as-a-Judge + A/B testing
βββ e2bqwen.py # Original E2B vision agent (preserved)
βββ requirements.txt
βββ README.md
π€ Credits
- smolagents by Hugging Face
- E2B for secure sandboxed desktops
- Playwright for browser automation
- Qwen2.5-VL for vision reasoning
- Kokoro-82M for TTS
π License
Apache 2.0