File size: 6,981 Bytes
224946d
5d897b1
 
 
 
224946d
 
5d897b1
 
 
 
 
 
224946d
 
5d897b1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
---
title: TB2 Environment Server
emoji: "πŸ§ͺ"
colorFrom: blue
colorTo: blue
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv
  - terminal-bench-2
  - spaces
---

# TB2 Environment (Terminal-Bench 2)

OpenEnv wrapper for [Terminal-Bench 2](https://github.com/laude-institute/terminal-bench-2) tasks. Supports two execution modes:

| Mode | Description | Use Case |
|------|-------------|----------|
| **Local** | Runs commands in the server process (no Docker) | Hugging Face Spaces, environments without Docker access |
| **Docker** | Runs each task in its own container | Full TB2.0 fidelity with custom task images |

## Quick Start

```python
from tbench2_env import Tbench2Env, Tbench2Action

env = Tbench2Env(base_url="http://localhost:8000")
result = env.reset(task_id="headless-terminal")
print(result.observation.instruction)

result = env.step(Tbench2Action(action_type="exec", command="ls -la"))
print(result.observation.output)

result = env.step(Tbench2Action(action_type="evaluate"))
print(result.reward, result.done)

env.close()
```

## Building the Docker Image

Before using the environment, build the Docker image:

```bash
# From project root
docker build -t tbench2-env:latest -f envs/tbench2_env/server/Dockerfile .
```

## Environment Details

### Action
**Tbench2Action**: Controls interaction with the TB2 task session

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `action_type` | str | `"exec"` | Action to perform (`exec`, `write`, `view`, `wait`, `kill`, `write_file`, `evaluate`, `close`) |
| `command` | str | `""` | Shell command or input to send |
| `session_id` | str \| None | `None` | Session ID for streaming processes |
| `block` | bool | `True` | Whether to block until command completes |
| `wait_seconds` | float \| None | `None` | Time to wait (for `wait` action) |
| `file_path` | str | `""` | File path (for `write_file` action) |
| `content` | str | `""` | Content to write (for `write_file` action) |

### Observation
**Tbench2Observation**: Contains the environment response

| Field | Type | Description |
|-------|------|-------------|
| `instruction` | str | Task instruction/prompt from the TB2 task |
| `output` | str | Command output (stdout/stderr) |
| `success` | bool | Whether the action succeeded |
| `error` | str | Error message if action failed |
| `task_id` | str | Current task identifier |
| `task_path` | str | Path to the task directory |
| `session_id` | str \| None | Session ID for streaming processes |
| `action_type` | str | The action type that produced this observation |
| `info` | dict | Additional metadata |

### State
**Tbench2State**: Server-side state for the task session

| Field | Type | Description |
|-------|------|-------------|
| `task_id` | str | Current task identifier |
| `task_path` | str | Path to the task directory |
| `session_id` | str | Active session ID |
| `terminal_ready` | bool | Whether the terminal is ready for commands |
| `last_action_type` | str | Last action type executed |
| `last_command` | str | Last command executed |
| `last_output` | str | Output from last command |

## Execution Modes

### Local Mode (Default)

Commands execute directly in the server process. Ideal for HF Spaces where Docker-in-Docker is unavailable.

```bash
# Default - local mode
python -m tbench2_env.server.app

# Or explicitly set mode
TB2_MODE=local python -m tbench2_env.server.app
```

**Note:** Local mode ignores Docker images specified in task.toml. Tasks requiring specific runtime environments may fail.

### Docker Mode

Each task runs in its own Docker container, using the image specified in the task's `task.toml`:

```bash
# Enable Docker mode
TB2_MODE=docker python -m tbench2_env.server.app
```

**Requirements:**
- Docker socket mounted at `/var/run/docker.sock`
- Sufficient disk space for container images
- Network access to pull images if not cached

**Environment Variables for Docker Mode:**
- `TB2_MODE=docker` - Enable Docker-backed execution
- Docker socket must be accessible (mounted volume)

## Action Types

| Action | Description | Required Fields |
|--------|-------------|-----------------|
| `exec` | Run a shell command | `command`, optionally `block`, `session_id` |
| `write` | Send input to a running session | `session_id`, `command` |
| `view` | Read pending output | `session_id` |
| `wait` | Wait for output | `session_id`, optionally `wait_seconds` |
| `kill` | Terminate a running session | `session_id` |
| `write_file` | Write content to a file | `file_path`, `content` |
| `evaluate` | Run pytest tests, return reward | (none) |
| `close` | Stop and cleanup | (none) |

## Session IDs (Streaming Processes)

`session_id` is **only** required when you start a non-blocking process and want to interact with it (`write`, `view`, `wait`, `kill`). For plain `exec` commands, you can omit it.

Example (Python):
```python
# Start a long-running process
env.step(Tbench2Action(action_type="exec", command="python -i", block=False, session_id="sess1"))

# Send input to it
env.step(Tbench2Action(action_type="write", session_id="sess1", command="print(2+2)\n"))

# Read its output
env.step(Tbench2Action(action_type="view", session_id="sess1"))
```

## Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `TB2_MODE` | `local` | Execution mode: `local` or `docker` |
| `TB2_TASKS_DIR` | (auto-download) | Path to local Terminal-Bench-2 repo checkout |
| `TB2_OUTPUT_DIR` | `/tmp/tbench2_env_runs` | Directory for session logs and cache |
| `TB2_CACHE_DIR` | `$TB2_OUTPUT_DIR/repo_cache` | Where to extract TB2 repo |
| `TB2_REPO_URL` | (GitHub main.zip) | Repo zip URL for auto-download |

## Reward

Binary reward on `evaluate` action:
- `1.0` - All pytest tests pass (exit code 0)
- `0.0` - Tests fail (non-zero exit code)

Intermediate steps return `reward=None`.

## Running the Server

```bash
# Install dependencies
uv sync --all-extras

# Local mode (default, for Spaces)
python -m tbench2_env.server.app --port 8000

# Docker mode (full TB2.0 compatibility)
TB2_MODE=docker python -m tbench2_env.server.app --port 8000

# With local TB2 repo
TB2_TASKS_DIR=/path/to/terminal-bench-2 python -m tbench2_env.server.app
```

## Project Structure

```
tbench2_env/
β”œβ”€β”€ __init__.py              # Module exports (Tbench2Env, Tbench2Action, etc.)
β”œβ”€β”€ README.md                # This file
β”œβ”€β”€ client.py                # Tbench2Env client implementation
β”œβ”€β”€ models.py                # Tbench2Action, Tbench2Observation, Tbench2State
β”œβ”€β”€ openenv.yaml             # OpenEnv configuration
β”œβ”€β”€ pyproject.toml           # Package dependencies
└── server/
    β”œβ”€β”€ __init__.py          # Server exports
    β”œβ”€β”€ app.py               # FastAPI application
    β”œβ”€β”€ tbench2_env_environment.py  # Core environment logic
    └── Dockerfile           # Container image definition
```