Spaces:

jkorstad
/

computer-agent-v2

Sleeping

App Files Files Community

computer-agent-v2 / README.md

jkorstad

v2.0-polish: tuple streaming, plan+cost display wiring, tracker sync, interrupt safety, README, eval CLI

877f588 about 1 month ago

preview code

raw

history blame contribute delete

6.37 kB

A newer version of the Gradio SDK is available: 6.14.0

Upgrade

metadata

title: Computer Agent v2.0
emoji: 🤖
colorFrom: purple
colorTo: blue
sdk: gradio
app_file: app.py
pinned: false
license: apache-2.0
short_description: Computer agent with planner, multi-model router, MCP, memory

🤖 Open Computer Agent v2.0

An enhanced universal computer-use agent built on smolagents, E2B Desktop, and Playwright. It plans before it acts, remembers what worked, routes tasks to the cheapest capable model, and verifies its own success.

✨ What's New in v2.0

Feature	Description
🧠 Hierarchical Planner	Breaks goals into subtask DAGs using a cheap text model before execution
🔌 Playwright MCP	Semantic browser control — click by text/role, extract tables/links, evaluate JS
🎯 Multi-Model Router	Auto-selects the cheapest capable model (fast vision ↔ powerful vision ↔ fast text ↔ powerful text)
🧩 Set-of-Marks Vision	Overlays numbered bounding boxes on UI elements for coordinate-free interaction
🗄️ Long-Term Memory	ChromaDB vector store retrieves similar past tasks and proven strategies
🔍 Verifier Agent	Checks subtask completion and triggers recovery loops automatically
🛑 Human-in-the-Loop	Pauses on sensitive actions (payments, emails, deletes) for user approval
🎙️ Voice I/O	Speak tasks and hear responses via Whisper STT + Kokoro TTS
💰 Cost Dashboard	Real-time $/task, token usage, and latency tracking
📹 Session Recording	Saves every step as replayable macros with full trace export
🧪 Enhanced Eval	Built-in benchmark suite with LLM-as-a-Judge grading and A/B testing

🏗️ Architecture

User Input (Text / Voice / File)
       |
       v
[IntelligenceRouter] ----> Planner (JSON DAG)
       |
       v
[Memory Retrieval] (ChromaDB)
       |
       v
[Plan Executor]
       |
       +---> [Browser Sub-Agent] (Playwright MCP)
       +---> [Desktop Sub-Agent] (E2B + SoM Vision)
       +---> [Coder Sub-Agent] (Code Interpreter)
       +---> [HF Hub Sub-Agent] (Search / Upload)
       |
       v
[Verifier] -> Retry / Alternative / Continue
       |
       v
[Macro Saver] + Cost Report + Session Recording

🚀 Quick Start

1. Secrets Setup

Go to Space Settings → Secrets and add:

Secret Name	Value	Required?
`E2B_API_KEY`	Your key from e2b.dev	Yes for desktop automation
`HF_TOKEN`	Your Hugging Face token	Yes for model inference & Hub tools

Then Factory Rebuild the Space.

2. Run a Task

Type a task (or click 🎙️ to speak it)
Hit 🚀 Let's go!
Watch the agent:
- 🧠 Generate a plan in the left panel
- 🖥️ Control the sandbox desktop in real time
- 💰 Update cost tracking live
- ✅ Verify completion at the end

🛡️ Sensitive Actions

By default, the agent pauses before:

Payments, purchases, subscriptions
Sending emails/messages/posts
Deleting files or uninstalling software
Password/credit-card fields

Enable Auto-approve all actions in ⚙️ Advanced Options to disable HITL.

💰 Cost Budget

Default budget is $2.00 USD per session. The router automatically downgrades to cheaper models as the budget is consumed. Costs are estimated from token counts and model pricing — actual HF Inference API costs may vary.

🧪 Running Benchmarks

from eval_harness import EvaluationHarness, DEFAULT_BENCHMARKS
from app import build_session_components

# Create harness with a factory that builds agents
harness = EvaluationHarness(
    agent_factory=lambda: build_session_components("eval_session", "./tmp/eval")["router"],
    judge_model_call=lambda msgs: build_session_components("eval_session", "./tmp/eval")["router"](msgs).content,
)

# Run full suite
summary = harness.run_suite(DEFAULT_BENCHMARKS, num_runs=1)
print(f"Pass rate: {summary.passed}/{summary.total_tasks}")
print(f"Avg score: {summary.avg_score}")

Or run a quick A/B test between two configurations:

results = harness.compare_strategies(
    strategy_a_factory=make_agent_v1,
    strategy_b_factory=make_agent_v2,
    num_runs=3,
)
print(f"Winner: Strategy {results['winner']}")

🎙️ Voice Input

Click the microphone icon next to the task box
Speak your task clearly
The transcribed text appears in the task box automatically
Hit Run

Voice requires faster-whisper (optional dependency). If unavailable, a text fallback is provided.

🧩 MCP Tools Reference

Tool	Description
`browser_goto(url)`	Navigate browser to URL
`browser_click(selector, by)`	Click by CSS/text/role
`browser_fill(selector, text)`	Fill form fields
`browser_find_and_click(text)`	Click by visible text
`browser_extract_links()`	Get all page links as JSON
`browser_extract_tables()`	Get all page tables as JSON
`browser_evaluate_js(script)`	Run JS in browser context
`hf_search_models(query)`	Search HF Hub for models
`hf_search_datasets(query)`	Search HF Hub for datasets
`hf_upload_dataset_file(...)`	Upload a file to a HF dataset
`fs_read(path)`	Read a workspace file
`fs_write(path, content)`	Write a workspace file

📁 Project Structure

├── app.py              # Gradio UI + event orchestration
├── core_agent.py       # Router, Planner, Verifier, Memory, SoM, Recorder
├── mcp_tools.py        # Playwright, CodeExec, FileSystem, HF Hub bridges
├── voice_interface.py  # STT + TTS with WebGPU detection
├── eval_harness.py     # Benchmarks + LLM-as-a-Judge + A/B testing
├── e2bqwen.py          # Original E2B vision agent (preserved)
├── requirements.txt
└── README.md

🤝 Credits

smolagents by Hugging Face
E2B for secure sandboxed desktops
Playwright for browser automation
Qwen2.5-VL for vision reasoning
Kokoro-82M for TTS

📄 License

Apache 2.0