--- title: TB2 Environment Server emoji: "๐Ÿงช" colorFrom: blue colorTo: blue sdk: docker pinned: false app_port: 8000 base_path: /web tags: - openenv - terminal-bench-2 - spaces --- # TB2 Environment (Terminal-Bench 2) OpenEnv wrapper for [Terminal-Bench 2](https://github.com/laude-institute/terminal-bench-2) tasks. Supports two execution modes: | Mode | Description | Use Case | |------|-------------|----------| | **Local** | Runs commands in the server process (no Docker) | Hugging Face Spaces, environments without Docker access | | **Docker** | Runs each task in its own container | Full TB2.0 fidelity with custom task images | ## Quick Start ```python from tbench2_env import Tbench2Env, Tbench2Action env = Tbench2Env(base_url="http://localhost:8000") result = env.reset(task_id="headless-terminal") print(result.observation.instruction) result = env.step(Tbench2Action(action_type="exec", command="ls -la")) print(result.observation.output) result = env.step(Tbench2Action(action_type="evaluate")) print(result.reward, result.done) env.close() ``` ## Building the Docker Image Before using the environment, build the Docker image: ```bash # From project root docker build -t tbench2-env:latest -f envs/tbench2_env/server/Dockerfile . ``` ## Environment Details ### Action **Tbench2Action**: Controls interaction with the TB2 task session | Field | Type | Default | Description | |-------|------|---------|-------------| | `action_type` | str | `"exec"` | Action to perform (`exec`, `write`, `view`, `wait`, `kill`, `write_file`, `evaluate`, `close`) | | `command` | str | `""` | Shell command or input to send | | `session_id` | str \| None | `None` | Session ID for streaming processes | | `block` | bool | `True` | Whether to block until command completes | | `wait_seconds` | float \| None | `None` | Time to wait (for `wait` action) | | `file_path` | str | `""` | File path (for `write_file` action) | | `content` | str | `""` | Content to write (for `write_file` action) | ### Observation **Tbench2Observation**: Contains the environment response | Field | Type | Description | |-------|------|-------------| | `instruction` | str | Task instruction/prompt from the TB2 task | | `output` | str | Command output (stdout/stderr) | | `success` | bool | Whether the action succeeded | | `error` | str | Error message if action failed | | `task_id` | str | Current task identifier | | `task_path` | str | Path to the task directory | | `session_id` | str \| None | Session ID for streaming processes | | `action_type` | str | The action type that produced this observation | | `info` | dict | Additional metadata | ### State **Tbench2State**: Server-side state for the task session | Field | Type | Description | |-------|------|-------------| | `task_id` | str | Current task identifier | | `task_path` | str | Path to the task directory | | `session_id` | str | Active session ID | | `terminal_ready` | bool | Whether the terminal is ready for commands | | `last_action_type` | str | Last action type executed | | `last_command` | str | Last command executed | | `last_output` | str | Output from last command | ## Execution Modes ### Local Mode (Default) Commands execute directly in the server process. Ideal for HF Spaces where Docker-in-Docker is unavailable. ```bash # Default - local mode python -m tbench2_env.server.app # Or explicitly set mode TB2_MODE=local python -m tbench2_env.server.app ``` **Note:** Local mode ignores Docker images specified in task.toml. Tasks requiring specific runtime environments may fail. ### Docker Mode Each task runs in its own Docker container, using the image specified in the task's `task.toml`: ```bash # Enable Docker mode TB2_MODE=docker python -m tbench2_env.server.app ``` **Requirements:** - Docker socket mounted at `/var/run/docker.sock` - Sufficient disk space for container images - Network access to pull images if not cached **Environment Variables for Docker Mode:** - `TB2_MODE=docker` - Enable Docker-backed execution - Docker socket must be accessible (mounted volume) ## Action Types | Action | Description | Required Fields | |--------|-------------|-----------------| | `exec` | Run a shell command | `command`, optionally `block`, `session_id` | | `write` | Send input to a running session | `session_id`, `command` | | `view` | Read pending output | `session_id` | | `wait` | Wait for output | `session_id`, optionally `wait_seconds` | | `kill` | Terminate a running session | `session_id` | | `write_file` | Write content to a file | `file_path`, `content` | | `evaluate` | Run pytest tests, return reward | (none) | | `close` | Stop and cleanup | (none) | ## Session IDs (Streaming Processes) `session_id` is **only** required when you start a non-blocking process and want to interact with it (`write`, `view`, `wait`, `kill`). For plain `exec` commands, you can omit it. Example (Python): ```python # Start a long-running process env.step(Tbench2Action(action_type="exec", command="python -i", block=False, session_id="sess1")) # Send input to it env.step(Tbench2Action(action_type="write", session_id="sess1", command="print(2+2)\n")) # Read its output env.step(Tbench2Action(action_type="view", session_id="sess1")) ``` ## Environment Variables | Variable | Default | Description | |----------|---------|-------------| | `TB2_MODE` | `local` | Execution mode: `local` or `docker` | | `TB2_TASKS_DIR` | (auto-download) | Path to local Terminal-Bench-2 repo checkout | | `TB2_OUTPUT_DIR` | `/tmp/tbench2_env_runs` | Directory for session logs and cache | | `TB2_CACHE_DIR` | `$TB2_OUTPUT_DIR/repo_cache` | Where to extract TB2 repo | | `TB2_REPO_URL` | (GitHub main.zip) | Repo zip URL for auto-download | ## Reward Binary reward on `evaluate` action: - `1.0` - All pytest tests pass (exit code 0) - `0.0` - Tests fail (non-zero exit code) Intermediate steps return `reward=None`. ## Running the Server ```bash # Install dependencies uv sync --all-extras # Local mode (default, for Spaces) python -m tbench2_env.server.app --port 8000 # Docker mode (full TB2.0 compatibility) TB2_MODE=docker python -m tbench2_env.server.app --port 8000 # With local TB2 repo TB2_TASKS_DIR=/path/to/terminal-bench-2 python -m tbench2_env.server.app ``` ## Project Structure ``` tbench2_env/ โ”œโ”€โ”€ __init__.py # Module exports (Tbench2Env, Tbench2Action, etc.) โ”œโ”€โ”€ README.md # This file โ”œโ”€โ”€ client.py # Tbench2Env client implementation โ”œโ”€โ”€ models.py # Tbench2Action, Tbench2Observation, Tbench2State โ”œโ”€โ”€ openenv.yaml # OpenEnv configuration โ”œโ”€โ”€ pyproject.toml # Package dependencies โ””โ”€โ”€ server/ โ”œโ”€โ”€ __init__.py # Server exports โ”œโ”€โ”€ app.py # FastAPI application โ”œโ”€โ”€ tbench2_env_environment.py # Core environment logic โ””โ”€โ”€ Dockerfile # Container image definition ```