File size: 6,981 Bytes
224946d 5d897b1 224946d 5d897b1 224946d 5d897b1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 |
---
title: TB2 Environment Server
emoji: "π§ͺ"
colorFrom: blue
colorTo: blue
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
- openenv
- terminal-bench-2
- spaces
---
# TB2 Environment (Terminal-Bench 2)
OpenEnv wrapper for [Terminal-Bench 2](https://github.com/laude-institute/terminal-bench-2) tasks. Supports two execution modes:
| Mode | Description | Use Case |
|------|-------------|----------|
| **Local** | Runs commands in the server process (no Docker) | Hugging Face Spaces, environments without Docker access |
| **Docker** | Runs each task in its own container | Full TB2.0 fidelity with custom task images |
## Quick Start
```python
from tbench2_env import Tbench2Env, Tbench2Action
env = Tbench2Env(base_url="http://localhost:8000")
result = env.reset(task_id="headless-terminal")
print(result.observation.instruction)
result = env.step(Tbench2Action(action_type="exec", command="ls -la"))
print(result.observation.output)
result = env.step(Tbench2Action(action_type="evaluate"))
print(result.reward, result.done)
env.close()
```
## Building the Docker Image
Before using the environment, build the Docker image:
```bash
# From project root
docker build -t tbench2-env:latest -f envs/tbench2_env/server/Dockerfile .
```
## Environment Details
### Action
**Tbench2Action**: Controls interaction with the TB2 task session
| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `action_type` | str | `"exec"` | Action to perform (`exec`, `write`, `view`, `wait`, `kill`, `write_file`, `evaluate`, `close`) |
| `command` | str | `""` | Shell command or input to send |
| `session_id` | str \| None | `None` | Session ID for streaming processes |
| `block` | bool | `True` | Whether to block until command completes |
| `wait_seconds` | float \| None | `None` | Time to wait (for `wait` action) |
| `file_path` | str | `""` | File path (for `write_file` action) |
| `content` | str | `""` | Content to write (for `write_file` action) |
### Observation
**Tbench2Observation**: Contains the environment response
| Field | Type | Description |
|-------|------|-------------|
| `instruction` | str | Task instruction/prompt from the TB2 task |
| `output` | str | Command output (stdout/stderr) |
| `success` | bool | Whether the action succeeded |
| `error` | str | Error message if action failed |
| `task_id` | str | Current task identifier |
| `task_path` | str | Path to the task directory |
| `session_id` | str \| None | Session ID for streaming processes |
| `action_type` | str | The action type that produced this observation |
| `info` | dict | Additional metadata |
### State
**Tbench2State**: Server-side state for the task session
| Field | Type | Description |
|-------|------|-------------|
| `task_id` | str | Current task identifier |
| `task_path` | str | Path to the task directory |
| `session_id` | str | Active session ID |
| `terminal_ready` | bool | Whether the terminal is ready for commands |
| `last_action_type` | str | Last action type executed |
| `last_command` | str | Last command executed |
| `last_output` | str | Output from last command |
## Execution Modes
### Local Mode (Default)
Commands execute directly in the server process. Ideal for HF Spaces where Docker-in-Docker is unavailable.
```bash
# Default - local mode
python -m tbench2_env.server.app
# Or explicitly set mode
TB2_MODE=local python -m tbench2_env.server.app
```
**Note:** Local mode ignores Docker images specified in task.toml. Tasks requiring specific runtime environments may fail.
### Docker Mode
Each task runs in its own Docker container, using the image specified in the task's `task.toml`:
```bash
# Enable Docker mode
TB2_MODE=docker python -m tbench2_env.server.app
```
**Requirements:**
- Docker socket mounted at `/var/run/docker.sock`
- Sufficient disk space for container images
- Network access to pull images if not cached
**Environment Variables for Docker Mode:**
- `TB2_MODE=docker` - Enable Docker-backed execution
- Docker socket must be accessible (mounted volume)
## Action Types
| Action | Description | Required Fields |
|--------|-------------|-----------------|
| `exec` | Run a shell command | `command`, optionally `block`, `session_id` |
| `write` | Send input to a running session | `session_id`, `command` |
| `view` | Read pending output | `session_id` |
| `wait` | Wait for output | `session_id`, optionally `wait_seconds` |
| `kill` | Terminate a running session | `session_id` |
| `write_file` | Write content to a file | `file_path`, `content` |
| `evaluate` | Run pytest tests, return reward | (none) |
| `close` | Stop and cleanup | (none) |
## Session IDs (Streaming Processes)
`session_id` is **only** required when you start a non-blocking process and want to interact with it (`write`, `view`, `wait`, `kill`). For plain `exec` commands, you can omit it.
Example (Python):
```python
# Start a long-running process
env.step(Tbench2Action(action_type="exec", command="python -i", block=False, session_id="sess1"))
# Send input to it
env.step(Tbench2Action(action_type="write", session_id="sess1", command="print(2+2)\n"))
# Read its output
env.step(Tbench2Action(action_type="view", session_id="sess1"))
```
## Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `TB2_MODE` | `local` | Execution mode: `local` or `docker` |
| `TB2_TASKS_DIR` | (auto-download) | Path to local Terminal-Bench-2 repo checkout |
| `TB2_OUTPUT_DIR` | `/tmp/tbench2_env_runs` | Directory for session logs and cache |
| `TB2_CACHE_DIR` | `$TB2_OUTPUT_DIR/repo_cache` | Where to extract TB2 repo |
| `TB2_REPO_URL` | (GitHub main.zip) | Repo zip URL for auto-download |
## Reward
Binary reward on `evaluate` action:
- `1.0` - All pytest tests pass (exit code 0)
- `0.0` - Tests fail (non-zero exit code)
Intermediate steps return `reward=None`.
## Running the Server
```bash
# Install dependencies
uv sync --all-extras
# Local mode (default, for Spaces)
python -m tbench2_env.server.app --port 8000
# Docker mode (full TB2.0 compatibility)
TB2_MODE=docker python -m tbench2_env.server.app --port 8000
# With local TB2 repo
TB2_TASKS_DIR=/path/to/terminal-bench-2 python -m tbench2_env.server.app
```
## Project Structure
```
tbench2_env/
βββ __init__.py # Module exports (Tbench2Env, Tbench2Action, etc.)
βββ README.md # This file
βββ client.py # Tbench2Env client implementation
βββ models.py # Tbench2Action, Tbench2Observation, Tbench2State
βββ openenv.yaml # OpenEnv configuration
βββ pyproject.toml # Package dependencies
βββ server/
βββ __init__.py # Server exports
βββ app.py # FastAPI application
βββ tbench2_env_environment.py # Core environment logic
βββ Dockerfile # Container image definition
```
|