computer-agent-v2 / README.md
jkorstad's picture
v2.0-polish: tuple streaming, plan+cost display wiring, tracker sync, interrupt safety, README, eval CLI
877f588

A newer version of the Gradio SDK is available: 6.14.0

Upgrade
metadata
title: Computer Agent v2.0
emoji: πŸ€–
colorFrom: purple
colorTo: blue
sdk: gradio
app_file: app.py
pinned: false
license: apache-2.0
short_description: Computer agent with planner, multi-model router, MCP, memory

πŸ€– Open Computer Agent v2.0

An enhanced universal computer-use agent built on smolagents, E2B Desktop, and Playwright. It plans before it acts, remembers what worked, routes tasks to the cheapest capable model, and verifies its own success.

✨ What's New in v2.0

Feature Description
🧠 Hierarchical Planner Breaks goals into subtask DAGs using a cheap text model before execution
πŸ”Œ Playwright MCP Semantic browser control β€” click by text/role, extract tables/links, evaluate JS
🎯 Multi-Model Router Auto-selects the cheapest capable model (fast vision ↔ powerful vision ↔ fast text ↔ powerful text)
🧩 Set-of-Marks Vision Overlays numbered bounding boxes on UI elements for coordinate-free interaction
πŸ—„οΈ Long-Term Memory ChromaDB vector store retrieves similar past tasks and proven strategies
πŸ” Verifier Agent Checks subtask completion and triggers recovery loops automatically
πŸ›‘ Human-in-the-Loop Pauses on sensitive actions (payments, emails, deletes) for user approval
πŸŽ™οΈ Voice I/O Speak tasks and hear responses via Whisper STT + Kokoro TTS
πŸ’° Cost Dashboard Real-time $/task, token usage, and latency tracking
πŸ“Ή Session Recording Saves every step as replayable macros with full trace export
πŸ§ͺ Enhanced Eval Built-in benchmark suite with LLM-as-a-Judge grading and A/B testing

πŸ—οΈ Architecture

User Input (Text / Voice / File)
       |
       v
[IntelligenceRouter] ----> Planner (JSON DAG)
       |
       v
[Memory Retrieval] (ChromaDB)
       |
       v
[Plan Executor]
       |
       +---> [Browser Sub-Agent] (Playwright MCP)
       +---> [Desktop Sub-Agent] (E2B + SoM Vision)
       +---> [Coder Sub-Agent] (Code Interpreter)
       +---> [HF Hub Sub-Agent] (Search / Upload)
       |
       v
[Verifier] -> Retry / Alternative / Continue
       |
       v
[Macro Saver] + Cost Report + Session Recording

πŸš€ Quick Start

1. Secrets Setup

Go to Space Settings β†’ Secrets and add:

Secret Name Value Required?
E2B_API_KEY Your key from e2b.dev Yes for desktop automation
HF_TOKEN Your Hugging Face token Yes for model inference & Hub tools

Then Factory Rebuild the Space.

2. Run a Task

  1. Type a task (or click πŸŽ™οΈ to speak it)
  2. Hit πŸš€ Let's go!
  3. Watch the agent:
    • 🧠 Generate a plan in the left panel
    • πŸ–₯️ Control the sandbox desktop in real time
    • πŸ’° Update cost tracking live
    • βœ… Verify completion at the end

πŸ›‘οΈ Sensitive Actions

By default, the agent pauses before:

  • Payments, purchases, subscriptions
  • Sending emails/messages/posts
  • Deleting files or uninstalling software
  • Password/credit-card fields

Enable Auto-approve all actions in βš™οΈ Advanced Options to disable HITL.

πŸ’° Cost Budget

Default budget is $2.00 USD per session. The router automatically downgrades to cheaper models as the budget is consumed. Costs are estimated from token counts and model pricing β€” actual HF Inference API costs may vary.

πŸ§ͺ Running Benchmarks

from eval_harness import EvaluationHarness, DEFAULT_BENCHMARKS
from app import build_session_components

# Create harness with a factory that builds agents
harness = EvaluationHarness(
    agent_factory=lambda: build_session_components("eval_session", "./tmp/eval")["router"],
    judge_model_call=lambda msgs: build_session_components("eval_session", "./tmp/eval")["router"](msgs).content,
)

# Run full suite
summary = harness.run_suite(DEFAULT_BENCHMARKS, num_runs=1)
print(f"Pass rate: {summary.passed}/{summary.total_tasks}")
print(f"Avg score: {summary.avg_score}")

Or run a quick A/B test between two configurations:

results = harness.compare_strategies(
    strategy_a_factory=make_agent_v1,
    strategy_b_factory=make_agent_v2,
    num_runs=3,
)
print(f"Winner: Strategy {results['winner']}")

πŸŽ™οΈ Voice Input

  1. Click the microphone icon next to the task box
  2. Speak your task clearly
  3. The transcribed text appears in the task box automatically
  4. Hit Run

Voice requires faster-whisper (optional dependency). If unavailable, a text fallback is provided.

🧩 MCP Tools Reference

Tool Description
browser_goto(url) Navigate browser to URL
browser_click(selector, by) Click by CSS/text/role
browser_fill(selector, text) Fill form fields
browser_find_and_click(text) Click by visible text
browser_extract_links() Get all page links as JSON
browser_extract_tables() Get all page tables as JSON
browser_evaluate_js(script) Run JS in browser context
hf_search_models(query) Search HF Hub for models
hf_search_datasets(query) Search HF Hub for datasets
hf_upload_dataset_file(...) Upload a file to a HF dataset
fs_read(path) Read a workspace file
fs_write(path, content) Write a workspace file

πŸ“ Project Structure

β”œβ”€β”€ app.py              # Gradio UI + event orchestration
β”œβ”€β”€ core_agent.py       # Router, Planner, Verifier, Memory, SoM, Recorder
β”œβ”€β”€ mcp_tools.py        # Playwright, CodeExec, FileSystem, HF Hub bridges
β”œβ”€β”€ voice_interface.py  # STT + TTS with WebGPU detection
β”œβ”€β”€ eval_harness.py     # Benchmarks + LLM-as-a-Judge + A/B testing
β”œβ”€β”€ e2bqwen.py          # Original E2B vision agent (preserved)
β”œβ”€β”€ requirements.txt
└── README.md

🀝 Credits

πŸ“„ License

Apache 2.0