jkorstad
/

computer-agent-v2

Model card Files Files and versions

computer-agent-v2 / README.md

jkorstad's picture

Deploy Computer Agent v2.0 full stack

31e5b3a verified 19 days ago

|

history blame contribute delete

3.28 kB

	---
	title: Computer Agent v2.0
	emoji: 🤖
	colorFrom: purple
	colorTo: blue
	sdk: gradio
	sdk_version: "5.0.0"
	app_file: app.py
	pinned: false
	license: apache-2.0
	short_description: "Enhanced universal computer agent with planner, MCP, memory & voice"
	---

	# 🤖 Open Computer Agent v2.0

	An enhanced universal computer-use agent built on [smolagents](https://github.com/huggingface/smolagents), [E2B Desktop](https://e2b.dev), and [Playwright](https://playwright.dev).

	## What's New in v2.0

	\| Feature \| Description \|
	\|---------\|-------------\|
	\| 🧠 Hierarchical Planner \| Breaks goals into subtasks before execution using a cheap text model \|
	\| 🔌 Playwright MCP \| Semantic browser control (click by text/role, extract tables/links, evaluate JS) \|
	\| 🎯 Multi-Model Router \| Auto-selects the cheapest capable model (fast vision ↔ powerful vision ↔ fast text ↔ powerful text) \|
	\| 🧩 Set-of-Marks Vision \| Overlays numbered bounding boxes on UI elements for coordinate-free interaction \|
	\| 🗄️ Long-Term Memory \| ChromaDB vector store retrieves similar past tasks and proven strategies \|
	\| 🔍 Verifier Agent \| Checks subtask completion and triggers recovery loops \|
	\| 🛑 Human-in-the-Loop \| Pauses on sensitive actions (payments, emails, deletes) for user approval \|
	\| 🎙️ Voice I/O \| Speak tasks and hear responses via Whisper STT + Kokoro TTS \|
	\| 💰 Cost Dashboard \| Real-time $/task, token usage, and latency tracking \|
	\| 📹 Session Recording \| Saves every step as replayable macros with GIF/MP4 export potential \|
	\| 🧪 Enhanced Eval \| Built-in benchmark suite with LLM-as-a-Judge grading and A/B testing \|

	## Architecture

	```
	User Input (Text / Voice / File)
	\|
	v
	[Intelligence Router] ----> Planner (JSON DAG)
	\|
	v
	[Memory Retrieval] (ChromaDB)
	\|
	v
	[Plan Executor]
	\|
	+---> [Browser Sub-Agent] (Playwright MCP)
	+---> [Desktop Sub-Agent] (E2B + SoM Vision)
	+---> [Coder Sub-Agent] (Code Interpreter)
	+---> [HF Hub Sub-Agent] (Search / Upload)
	\|
	v
	[Verifier] -> Retry / Alternative / Continue
	\|
	v
	[Macro Saver] + Cost Report + Session Recording
	```

	## Quick Start

	1. Set your HF_TOKEN and E2B_API_KEY in the Space Secrets.
	2. Type a task (or speak it) and hit 🚀 Let's go!.
	3. Watch the agent plan, execute, verify, and report costs.

	## Sensitive Actions

	By default, the agent pauses before:
	- Payments, purchases, subscriptions
	- Sending emails/messages/posts
	- Deleting files or uninstalling software
	- Password/credit-card fields

	Enable Auto-approve all actions in Advanced Options to disable HITL.

	## Cost Budget

	Default budget is $2.00 USD per session. The router automatically downgrades to cheaper models as the budget is consumed.

	## Benchmarks

	Run the built-in eval suite:
	```python
	from eval_harness import EvaluationHarness
	# See eval_harness.py for usage
	```

	## Credits

	- [smolagents](https://github.com/huggingface/smolagents) by Hugging Face
	- [E2B](https://e2b.dev) for secure sandboxed desktops
	- [Playwright](https://playwright.dev) for browser automation
	- [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct) for vision reasoning