Spaces:
Sleeping
Sleeping
| title: Computer Agent v2.0 | |
| emoji: π€ | |
| colorFrom: purple | |
| colorTo: blue | |
| sdk: gradio | |
| app_file: app.py | |
| pinned: false | |
| license: apache-2.0 | |
| short_description: "Computer agent with planner, multi-model router, MCP, memory" | |
| # π€ Open Computer Agent v2.0 | |
| An **enhanced** universal computer-use agent built on [smolagents](https://github.com/huggingface/smolagents), [E2B Desktop](https://e2b.dev), and [Playwright](https://playwright.dev). It plans before it acts, remembers what worked, routes tasks to the cheapest capable model, and verifies its own success. | |
| ## β¨ What's New in v2.0 | |
| | Feature | Description | | |
| |---------|-------------| | |
| | π§ **Hierarchical Planner** | Breaks goals into subtask DAGs using a cheap text model before execution | | |
| | π **Playwright MCP** | Semantic browser control β click by text/role, extract tables/links, evaluate JS | | |
| | π― **Multi-Model Router** | Auto-selects the cheapest capable model (fast vision β powerful vision β fast text β powerful text) | | |
| | π§© **Set-of-Marks Vision** | Overlays numbered bounding boxes on UI elements for coordinate-free interaction | | |
| | ποΈ **Long-Term Memory** | ChromaDB vector store retrieves similar past tasks and proven strategies | | |
| | π **Verifier Agent** | Checks subtask completion and triggers recovery loops automatically | | |
| | π **Human-in-the-Loop** | Pauses on sensitive actions (payments, emails, deletes) for user approval | | |
| | ποΈ **Voice I/O** | Speak tasks and hear responses via Whisper STT + Kokoro TTS | | |
| | π° **Cost Dashboard** | Real-time $/task, token usage, and latency tracking | | |
| | πΉ **Session Recording** | Saves every step as replayable macros with full trace export | | |
| | π§ͺ **Enhanced Eval** | Built-in benchmark suite with LLM-as-a-Judge grading and A/B testing | | |
| ## ποΈ Architecture | |
| ``` | |
| User Input (Text / Voice / File) | |
| | | |
| v | |
| [IntelligenceRouter] ----> Planner (JSON DAG) | |
| | | |
| v | |
| [Memory Retrieval] (ChromaDB) | |
| | | |
| v | |
| [Plan Executor] | |
| | | |
| +---> [Browser Sub-Agent] (Playwright MCP) | |
| +---> [Desktop Sub-Agent] (E2B + SoM Vision) | |
| +---> [Coder Sub-Agent] (Code Interpreter) | |
| +---> [HF Hub Sub-Agent] (Search / Upload) | |
| | | |
| v | |
| [Verifier] -> Retry / Alternative / Continue | |
| | | |
| v | |
| [Macro Saver] + Cost Report + Session Recording | |
| ``` | |
| ## π Quick Start | |
| ### 1. Secrets Setup | |
| Go to **Space Settings β Secrets** and add: | |
| | Secret Name | Value | Required? | | |
| |-------------|-------|-----------| | |
| | `E2B_API_KEY` | Your key from [e2b.dev](https://e2b.dev) | **Yes** for desktop automation | | |
| | `HF_TOKEN` | Your Hugging Face token | **Yes** for model inference & Hub tools | | |
| Then **Factory Rebuild** the Space. | |
| ### 2. Run a Task | |
| 1. Type a task (or click ποΈ to speak it) | |
| 2. Hit **π Let's go!** | |
| 3. Watch the agent: | |
| - π§ Generate a plan in the left panel | |
| - π₯οΈ Control the sandbox desktop in real time | |
| - π° Update cost tracking live | |
| - β Verify completion at the end | |
| ## π‘οΈ Sensitive Actions | |
| By default, the agent pauses before: | |
| - Payments, purchases, subscriptions | |
| - Sending emails/messages/posts | |
| - Deleting files or uninstalling software | |
| - Password/credit-card fields | |
| Enable **Auto-approve all actions** in βοΈ Advanced Options to disable HITL. | |
| ## π° Cost Budget | |
| Default budget is **$2.00 USD per session**. The router automatically downgrades to cheaper models as the budget is consumed. Costs are estimated from token counts and model pricing β actual HF Inference API costs may vary. | |
| ## π§ͺ Running Benchmarks | |
| ```python | |
| from eval_harness import EvaluationHarness, DEFAULT_BENCHMARKS | |
| from app import build_session_components | |
| # Create harness with a factory that builds agents | |
| harness = EvaluationHarness( | |
| agent_factory=lambda: build_session_components("eval_session", "./tmp/eval")["router"], | |
| judge_model_call=lambda msgs: build_session_components("eval_session", "./tmp/eval")["router"](msgs).content, | |
| ) | |
| # Run full suite | |
| summary = harness.run_suite(DEFAULT_BENCHMARKS, num_runs=1) | |
| print(f"Pass rate: {summary.passed}/{summary.total_tasks}") | |
| print(f"Avg score: {summary.avg_score}") | |
| ``` | |
| Or run a quick A/B test between two configurations: | |
| ```python | |
| results = harness.compare_strategies( | |
| strategy_a_factory=make_agent_v1, | |
| strategy_b_factory=make_agent_v2, | |
| num_runs=3, | |
| ) | |
| print(f"Winner: Strategy {results['winner']}") | |
| ``` | |
| ## ποΈ Voice Input | |
| 1. Click the **microphone** icon next to the task box | |
| 2. Speak your task clearly | |
| 3. The transcribed text appears in the task box automatically | |
| 4. Hit **Run** | |
| Voice requires `faster-whisper` (optional dependency). If unavailable, a text fallback is provided. | |
| ## π§© MCP Tools Reference | |
| | Tool | Description | | |
| |------|-------------| | |
| | `browser_goto(url)` | Navigate browser to URL | | |
| | `browser_click(selector, by)` | Click by CSS/text/role | | |
| | `browser_fill(selector, text)` | Fill form fields | | |
| | `browser_find_and_click(text)` | Click by visible text | | |
| | `browser_extract_links()` | Get all page links as JSON | | |
| | `browser_extract_tables()` | Get all page tables as JSON | | |
| | `browser_evaluate_js(script)` | Run JS in browser context | | |
| | `hf_search_models(query)` | Search HF Hub for models | | |
| | `hf_search_datasets(query)` | Search HF Hub for datasets | | |
| | `hf_upload_dataset_file(...)` | Upload a file to a HF dataset | | |
| | `fs_read(path)` | Read a workspace file | | |
| | `fs_write(path, content)` | Write a workspace file | | |
| ## π Project Structure | |
| ``` | |
| βββ app.py # Gradio UI + event orchestration | |
| βββ core_agent.py # Router, Planner, Verifier, Memory, SoM, Recorder | |
| βββ mcp_tools.py # Playwright, CodeExec, FileSystem, HF Hub bridges | |
| βββ voice_interface.py # STT + TTS with WebGPU detection | |
| βββ eval_harness.py # Benchmarks + LLM-as-a-Judge + A/B testing | |
| βββ e2bqwen.py # Original E2B vision agent (preserved) | |
| βββ requirements.txt | |
| βββ README.md | |
| ``` | |
| ## π€ Credits | |
| - [smolagents](https://github.com/huggingface/smolagents) by Hugging Face | |
| - [E2B](https://e2b.dev) for secure sandboxed desktops | |
| - [Playwright](https://playwright.dev) for browser automation | |
| - [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct) for vision reasoning | |
| - [Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) for TTS | |
| ## π License | |
| Apache 2.0 | |