Spaces:

openenv
/

tbench2

Running

App Files Files Community

tbench2 / README.md

sergiopaniego HF Staff

Upload folder using huggingface_hub

5d897b1 verified 10 days ago

preview code

raw

history blame contribute delete

6.98 kB

	---
	title: TB2 Environment Server
	emoji: "🧪"
	colorFrom: blue
	colorTo: blue
	sdk: docker
	pinned: false
	app_port: 8000
	base_path: /web
	tags:
	- openenv
	- terminal-bench-2
	- spaces
	---

	# TB2 Environment (Terminal-Bench 2)

	OpenEnv wrapper for [Terminal-Bench 2](https://github.com/laude-institute/terminal-bench-2) tasks. Supports two execution modes:

	\| Mode \| Description \| Use Case \|
	\|------\|-------------\|----------\|
	\| Local \| Runs commands in the server process (no Docker) \| Hugging Face Spaces, environments without Docker access \|
	\| Docker \| Runs each task in its own container \| Full TB2.0 fidelity with custom task images \|

	## Quick Start

	```python
	from tbench2_env import Tbench2Env, Tbench2Action

	env = Tbench2Env(base_url="http://localhost:8000")
	result = env.reset(task_id="headless-terminal")
	print(result.observation.instruction)

	result = env.step(Tbench2Action(action_type="exec", command="ls -la"))
	print(result.observation.output)

	result = env.step(Tbench2Action(action_type="evaluate"))
	print(result.reward, result.done)

	env.close()
	```

	## Building the Docker Image

	Before using the environment, build the Docker image:

	```bash
	# From project root
	docker build -t tbench2-env:latest -f envs/tbench2_env/server/Dockerfile .
	```

	## Environment Details

	### Action
	Tbench2Action: Controls interaction with the TB2 task session

	\| Field \| Type \| Default \| Description \|
	\|-------\|------\|---------\|-------------\|
	\| `action_type` \| str \| `"exec"` \| Action to perform (`exec`, `write`, `view`, `wait`, `kill`, `write_file`, `evaluate`, `close`) \|
	\| `command` \| str \| `""` \| Shell command or input to send \|
	\| `session_id` \| str \\| None \| `None` \| Session ID for streaming processes \|
	\| `block` \| bool \| `True` \| Whether to block until command completes \|
	\| `wait_seconds` \| float \\| None \| `None` \| Time to wait (for `wait` action) \|
	\| `file_path` \| str \| `""` \| File path (for `write_file` action) \|
	\| `content` \| str \| `""` \| Content to write (for `write_file` action) \|

	### Observation
	Tbench2Observation: Contains the environment response

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `instruction` \| str \| Task instruction/prompt from the TB2 task \|
	\| `output` \| str \| Command output (stdout/stderr) \|
	\| `success` \| bool \| Whether the action succeeded \|
	\| `error` \| str \| Error message if action failed \|
	\| `task_id` \| str \| Current task identifier \|
	\| `task_path` \| str \| Path to the task directory \|
	\| `session_id` \| str \\| None \| Session ID for streaming processes \|
	\| `action_type` \| str \| The action type that produced this observation \|
	\| `info` \| dict \| Additional metadata \|

	### State
	Tbench2State: Server-side state for the task session

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `task_id` \| str \| Current task identifier \|
	\| `task_path` \| str \| Path to the task directory \|
	\| `session_id` \| str \| Active session ID \|
	\| `terminal_ready` \| bool \| Whether the terminal is ready for commands \|
	\| `last_action_type` \| str \| Last action type executed \|
	\| `last_command` \| str \| Last command executed \|
	\| `last_output` \| str \| Output from last command \|

	## Execution Modes

	### Local Mode (Default)

	Commands execute directly in the server process. Ideal for HF Spaces where Docker-in-Docker is unavailable.

	```bash
	# Default - local mode
	python -m tbench2_env.server.app

	# Or explicitly set mode
	TB2_MODE=local python -m tbench2_env.server.app
	```

	Note: Local mode ignores Docker images specified in task.toml. Tasks requiring specific runtime environments may fail.

	### Docker Mode

	Each task runs in its own Docker container, using the image specified in the task's `task.toml`:

	```bash
	# Enable Docker mode
	TB2_MODE=docker python -m tbench2_env.server.app
	```

	Requirements:
	- Docker socket mounted at `/var/run/docker.sock`
	- Sufficient disk space for container images
	- Network access to pull images if not cached

	Environment Variables for Docker Mode:
	- `TB2_MODE=docker` - Enable Docker-backed execution
	- Docker socket must be accessible (mounted volume)

	## Action Types

	\| Action \| Description \| Required Fields \|
	\|--------\|-------------\|-----------------\|
	\| `exec` \| Run a shell command \| `command`, optionally `block`, `session_id` \|
	\| `write` \| Send input to a running session \| `session_id`, `command` \|
	\| `view` \| Read pending output \| `session_id` \|
	\| `wait` \| Wait for output \| `session_id`, optionally `wait_seconds` \|
	\| `kill` \| Terminate a running session \| `session_id` \|
	\| `write_file` \| Write content to a file \| `file_path`, `content` \|
	\| `evaluate` \| Run pytest tests, return reward \| (none) \|
	\| `close` \| Stop and cleanup \| (none) \|

	## Session IDs (Streaming Processes)

	`session_id` is only required when you start a non-blocking process and want to interact with it (`write`, `view`, `wait`, `kill`). For plain `exec` commands, you can omit it.

	Example (Python):
	```python
	# Start a long-running process
	env.step(Tbench2Action(action_type="exec", command="python -i", block=False, session_id="sess1"))

	# Send input to it
	env.step(Tbench2Action(action_type="write", session_id="sess1", command="print(2+2)\n"))

	# Read its output
	env.step(Tbench2Action(action_type="view", session_id="sess1"))
	```

	## Environment Variables

	\| Variable \| Default \| Description \|
	\|----------\|---------\|-------------\|
	\| `TB2_MODE` \| `local` \| Execution mode: `local` or `docker` \|
	\| `TB2_TASKS_DIR` \| (auto-download) \| Path to local Terminal-Bench-2 repo checkout \|
	\| `TB2_OUTPUT_DIR` \| `/tmp/tbench2_env_runs` \| Directory for session logs and cache \|
	\| `TB2_CACHE_DIR` \| `$TB2_OUTPUT_DIR/repo_cache` \| Where to extract TB2 repo \|
	\| `TB2_REPO_URL` \| (GitHub main.zip) \| Repo zip URL for auto-download \|

	## Reward

	Binary reward on `evaluate` action:
	- `1.0` - All pytest tests pass (exit code 0)
	- `0.0` - Tests fail (non-zero exit code)

	Intermediate steps return `reward=None`.

	## Running the Server

	```bash
	# Install dependencies
	uv sync --all-extras

	# Local mode (default, for Spaces)
	python -m tbench2_env.server.app --port 8000

	# Docker mode (full TB2.0 compatibility)
	TB2_MODE=docker python -m tbench2_env.server.app --port 8000

	# With local TB2 repo
	TB2_TASKS_DIR=/path/to/terminal-bench-2 python -m tbench2_env.server.app
	```

	## Project Structure

	```
	tbench2_env/
	├── __init__.py # Module exports (Tbench2Env, Tbench2Action, etc.)
	├── README.md # This file
	├── client.py # Tbench2Env client implementation
	├── models.py # Tbench2Action, Tbench2Observation, Tbench2State
	├── openenv.yaml # OpenEnv configuration
	├── pyproject.toml # Package dependencies
	└── server/
	├── __init__.py # Server exports
	├── app.py # FastAPI application
	├── tbench2_env_environment.py # Core environment logic
	└── Dockerfile # Container image definition
	```