tbench2 / README.md
sergiopaniego's picture
Upload folder using huggingface_hub
5d897b1 verified
---
title: TB2 Environment Server
emoji: "πŸ§ͺ"
colorFrom: blue
colorTo: blue
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
- openenv
- terminal-bench-2
- spaces
---
# TB2 Environment (Terminal-Bench 2)
OpenEnv wrapper for [Terminal-Bench 2](https://github.com/laude-institute/terminal-bench-2) tasks. Supports two execution modes:
| Mode | Description | Use Case |
|------|-------------|----------|
| **Local** | Runs commands in the server process (no Docker) | Hugging Face Spaces, environments without Docker access |
| **Docker** | Runs each task in its own container | Full TB2.0 fidelity with custom task images |
## Quick Start
```python
from tbench2_env import Tbench2Env, Tbench2Action
env = Tbench2Env(base_url="http://localhost:8000")
result = env.reset(task_id="headless-terminal")
print(result.observation.instruction)
result = env.step(Tbench2Action(action_type="exec", command="ls -la"))
print(result.observation.output)
result = env.step(Tbench2Action(action_type="evaluate"))
print(result.reward, result.done)
env.close()
```
## Building the Docker Image
Before using the environment, build the Docker image:
```bash
# From project root
docker build -t tbench2-env:latest -f envs/tbench2_env/server/Dockerfile .
```
## Environment Details
### Action
**Tbench2Action**: Controls interaction with the TB2 task session
| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `action_type` | str | `"exec"` | Action to perform (`exec`, `write`, `view`, `wait`, `kill`, `write_file`, `evaluate`, `close`) |
| `command` | str | `""` | Shell command or input to send |
| `session_id` | str \| None | `None` | Session ID for streaming processes |
| `block` | bool | `True` | Whether to block until command completes |
| `wait_seconds` | float \| None | `None` | Time to wait (for `wait` action) |
| `file_path` | str | `""` | File path (for `write_file` action) |
| `content` | str | `""` | Content to write (for `write_file` action) |
### Observation
**Tbench2Observation**: Contains the environment response
| Field | Type | Description |
|-------|------|-------------|
| `instruction` | str | Task instruction/prompt from the TB2 task |
| `output` | str | Command output (stdout/stderr) |
| `success` | bool | Whether the action succeeded |
| `error` | str | Error message if action failed |
| `task_id` | str | Current task identifier |
| `task_path` | str | Path to the task directory |
| `session_id` | str \| None | Session ID for streaming processes |
| `action_type` | str | The action type that produced this observation |
| `info` | dict | Additional metadata |
### State
**Tbench2State**: Server-side state for the task session
| Field | Type | Description |
|-------|------|-------------|
| `task_id` | str | Current task identifier |
| `task_path` | str | Path to the task directory |
| `session_id` | str | Active session ID |
| `terminal_ready` | bool | Whether the terminal is ready for commands |
| `last_action_type` | str | Last action type executed |
| `last_command` | str | Last command executed |
| `last_output` | str | Output from last command |
## Execution Modes
### Local Mode (Default)
Commands execute directly in the server process. Ideal for HF Spaces where Docker-in-Docker is unavailable.
```bash
# Default - local mode
python -m tbench2_env.server.app
# Or explicitly set mode
TB2_MODE=local python -m tbench2_env.server.app
```
**Note:** Local mode ignores Docker images specified in task.toml. Tasks requiring specific runtime environments may fail.
### Docker Mode
Each task runs in its own Docker container, using the image specified in the task's `task.toml`:
```bash
# Enable Docker mode
TB2_MODE=docker python -m tbench2_env.server.app
```
**Requirements:**
- Docker socket mounted at `/var/run/docker.sock`
- Sufficient disk space for container images
- Network access to pull images if not cached
**Environment Variables for Docker Mode:**
- `TB2_MODE=docker` - Enable Docker-backed execution
- Docker socket must be accessible (mounted volume)
## Action Types
| Action | Description | Required Fields |
|--------|-------------|-----------------|
| `exec` | Run a shell command | `command`, optionally `block`, `session_id` |
| `write` | Send input to a running session | `session_id`, `command` |
| `view` | Read pending output | `session_id` |
| `wait` | Wait for output | `session_id`, optionally `wait_seconds` |
| `kill` | Terminate a running session | `session_id` |
| `write_file` | Write content to a file | `file_path`, `content` |
| `evaluate` | Run pytest tests, return reward | (none) |
| `close` | Stop and cleanup | (none) |
## Session IDs (Streaming Processes)
`session_id` is **only** required when you start a non-blocking process and want to interact with it (`write`, `view`, `wait`, `kill`). For plain `exec` commands, you can omit it.
Example (Python):
```python
# Start a long-running process
env.step(Tbench2Action(action_type="exec", command="python -i", block=False, session_id="sess1"))
# Send input to it
env.step(Tbench2Action(action_type="write", session_id="sess1", command="print(2+2)\n"))
# Read its output
env.step(Tbench2Action(action_type="view", session_id="sess1"))
```
## Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `TB2_MODE` | `local` | Execution mode: `local` or `docker` |
| `TB2_TASKS_DIR` | (auto-download) | Path to local Terminal-Bench-2 repo checkout |
| `TB2_OUTPUT_DIR` | `/tmp/tbench2_env_runs` | Directory for session logs and cache |
| `TB2_CACHE_DIR` | `$TB2_OUTPUT_DIR/repo_cache` | Where to extract TB2 repo |
| `TB2_REPO_URL` | (GitHub main.zip) | Repo zip URL for auto-download |
## Reward
Binary reward on `evaluate` action:
- `1.0` - All pytest tests pass (exit code 0)
- `0.0` - Tests fail (non-zero exit code)
Intermediate steps return `reward=None`.
## Running the Server
```bash
# Install dependencies
uv sync --all-extras
# Local mode (default, for Spaces)
python -m tbench2_env.server.app --port 8000
# Docker mode (full TB2.0 compatibility)
TB2_MODE=docker python -m tbench2_env.server.app --port 8000
# With local TB2 repo
TB2_TASKS_DIR=/path/to/terminal-bench-2 python -m tbench2_env.server.app
```
## Project Structure
```
tbench2_env/
β”œβ”€β”€ __init__.py # Module exports (Tbench2Env, Tbench2Action, etc.)
β”œβ”€β”€ README.md # This file
β”œβ”€β”€ client.py # Tbench2Env client implementation
β”œβ”€β”€ models.py # Tbench2Action, Tbench2Observation, Tbench2State
β”œβ”€β”€ openenv.yaml # OpenEnv configuration
β”œβ”€β”€ pyproject.toml # Package dependencies
└── server/
β”œβ”€β”€ __init__.py # Server exports
β”œβ”€β”€ app.py # FastAPI application
β”œβ”€β”€ tbench2_env_environment.py # Core environment logic
└── Dockerfile # Container image definition
```