Spaces:

jkorstad
/

computer-agent-v2

Sleeping

App Files Files Community

computer-agent-v2 / README.md

jkorstad

v2.0-polish: tuple streaming, plan+cost display wiring, tracker sync, interrupt safety, README, eval CLI

877f588 about 1 month ago

preview code

raw

history blame contribute delete

6.37 kB

	---
	title: Computer Agent v2.0
	emoji: 🤖
	colorFrom: purple
	colorTo: blue
	sdk: gradio
	app_file: app.py
	pinned: false
	license: apache-2.0
	short_description: "Computer agent with planner, multi-model router, MCP, memory"
	---

	# 🤖 Open Computer Agent v2.0

	An enhanced universal computer-use agent built on [smolagents](https://github.com/huggingface/smolagents), [E2B Desktop](https://e2b.dev), and [Playwright](https://playwright.dev). It plans before it acts, remembers what worked, routes tasks to the cheapest capable model, and verifies its own success.

	## ✨ What's New in v2.0

	\| Feature \| Description \|
	\|---------\|-------------\|
	\| 🧠 Hierarchical Planner \| Breaks goals into subtask DAGs using a cheap text model before execution \|
	\| 🔌 Playwright MCP \| Semantic browser control — click by text/role, extract tables/links, evaluate JS \|
	\| 🎯 Multi-Model Router \| Auto-selects the cheapest capable model (fast vision ↔ powerful vision ↔ fast text ↔ powerful text) \|
	\| 🧩 Set-of-Marks Vision \| Overlays numbered bounding boxes on UI elements for coordinate-free interaction \|
	\| 🗄️ Long-Term Memory \| ChromaDB vector store retrieves similar past tasks and proven strategies \|
	\| 🔍 Verifier Agent \| Checks subtask completion and triggers recovery loops automatically \|
	\| 🛑 Human-in-the-Loop \| Pauses on sensitive actions (payments, emails, deletes) for user approval \|
	\| 🎙️ Voice I/O \| Speak tasks and hear responses via Whisper STT + Kokoro TTS \|
	\| 💰 Cost Dashboard \| Real-time $/task, token usage, and latency tracking \|
	\| 📹 Session Recording \| Saves every step as replayable macros with full trace export \|
	\| 🧪 Enhanced Eval \| Built-in benchmark suite with LLM-as-a-Judge grading and A/B testing \|

	## 🏗️ Architecture

	```
	User Input (Text / Voice / File)
	\|
	v
	[IntelligenceRouter] ----> Planner (JSON DAG)
	\|
	v
	[Memory Retrieval] (ChromaDB)
	\|
	v
	[Plan Executor]
	\|
	+---> [Browser Sub-Agent] (Playwright MCP)
	+---> [Desktop Sub-Agent] (E2B + SoM Vision)
	+---> [Coder Sub-Agent] (Code Interpreter)
	+---> [HF Hub Sub-Agent] (Search / Upload)
	\|
	v
	[Verifier] -> Retry / Alternative / Continue
	\|
	v
	[Macro Saver] + Cost Report + Session Recording
	```

	## 🚀 Quick Start

	### 1. Secrets Setup

	Go to Space Settings → Secrets and add:

	\| Secret Name \| Value \| Required? \|
	\|-------------\|-------\|-----------\|
	\| `E2B_API_KEY` \| Your key from [e2b.dev](https://e2b.dev) \| Yes for desktop automation \|
	\| `HF_TOKEN` \| Your Hugging Face token \| Yes for model inference & Hub tools \|

	Then Factory Rebuild the Space.

	### 2. Run a Task

	1. Type a task (or click 🎙️ to speak it)
	2. Hit 🚀 Let's go!
	3. Watch the agent:
	- 🧠 Generate a plan in the left panel
	- 🖥️ Control the sandbox desktop in real time
	- 💰 Update cost tracking live
	- ✅ Verify completion at the end

	## 🛡️ Sensitive Actions

	By default, the agent pauses before:
	- Payments, purchases, subscriptions
	- Sending emails/messages/posts
	- Deleting files or uninstalling software
	- Password/credit-card fields

	Enable Auto-approve all actions in ⚙️ Advanced Options to disable HITL.

	## 💰 Cost Budget

	Default budget is $2.00 USD per session. The router automatically downgrades to cheaper models as the budget is consumed. Costs are estimated from token counts and model pricing — actual HF Inference API costs may vary.

	## 🧪 Running Benchmarks

	```python
	from eval_harness import EvaluationHarness, DEFAULT_BENCHMARKS
	from app import build_session_components

	# Create harness with a factory that builds agents
	harness = EvaluationHarness(
	agent_factory=lambda: build_session_components("eval_session", "./tmp/eval")["router"],
	judge_model_call=lambda msgs: build_session_components("eval_session", "./tmp/eval")["router"](msgs).content,
	)

	# Run full suite
	summary = harness.run_suite(DEFAULT_BENCHMARKS, num_runs=1)
	print(f"Pass rate: {summary.passed}/{summary.total_tasks}")
	print(f"Avg score: {summary.avg_score}")
	```

	Or run a quick A/B test between two configurations:

	```python
	results = harness.compare_strategies(
	strategy_a_factory=make_agent_v1,
	strategy_b_factory=make_agent_v2,
	num_runs=3,
	)
	print(f"Winner: Strategy {results['winner']}")
	```

	## 🎙️ Voice Input

	1. Click the microphone icon next to the task box
	2. Speak your task clearly
	3. The transcribed text appears in the task box automatically
	4. Hit Run

	Voice requires `faster-whisper` (optional dependency). If unavailable, a text fallback is provided.

	## 🧩 MCP Tools Reference

	\| Tool \| Description \|
	\|------\|-------------\|
	\| `browser_goto(url)` \| Navigate browser to URL \|
	\| `browser_click(selector, by)` \| Click by CSS/text/role \|
	\| `browser_fill(selector, text)` \| Fill form fields \|
	\| `browser_find_and_click(text)` \| Click by visible text \|
	\| `browser_extract_links()` \| Get all page links as JSON \|
	\| `browser_extract_tables()` \| Get all page tables as JSON \|
	\| `browser_evaluate_js(script)` \| Run JS in browser context \|
	\| `hf_search_models(query)` \| Search HF Hub for models \|
	\| `hf_search_datasets(query)` \| Search HF Hub for datasets \|
	\| `hf_upload_dataset_file(...)` \| Upload a file to a HF dataset \|
	\| `fs_read(path)` \| Read a workspace file \|
	\| `fs_write(path, content)` \| Write a workspace file \|

	## 📁 Project Structure

	```
	├── app.py # Gradio UI + event orchestration
	├── core_agent.py # Router, Planner, Verifier, Memory, SoM, Recorder
	├── mcp_tools.py # Playwright, CodeExec, FileSystem, HF Hub bridges
	├── voice_interface.py # STT + TTS with WebGPU detection
	├── eval_harness.py # Benchmarks + LLM-as-a-Judge + A/B testing
	├── e2bqwen.py # Original E2B vision agent (preserved)
	├── requirements.txt
	└── README.md
	```

	## 🤝 Credits

	- [smolagents](https://github.com/huggingface/smolagents) by Hugging Face
	- [E2B](https://e2b.dev) for secure sandboxed desktops
	- [Playwright](https://playwright.dev) for browser automation
	- [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct) for vision reasoning
	- [Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) for TTS

	## 📄 License

	Apache 2.0