|
|
--- |
|
|
title: TB2 Environment Server |
|
|
emoji: "π§ͺ" |
|
|
colorFrom: blue |
|
|
colorTo: blue |
|
|
sdk: docker |
|
|
pinned: false |
|
|
app_port: 8000 |
|
|
base_path: /web |
|
|
tags: |
|
|
- openenv |
|
|
- terminal-bench-2 |
|
|
- spaces |
|
|
--- |
|
|
|
|
|
# TB2 Environment (Terminal-Bench 2) |
|
|
|
|
|
OpenEnv wrapper for [Terminal-Bench 2](https://github.com/laude-institute/terminal-bench-2) tasks. Supports two execution modes: |
|
|
|
|
|
| Mode | Description | Use Case | |
|
|
|------|-------------|----------| |
|
|
| **Local** | Runs commands in the server process (no Docker) | Hugging Face Spaces, environments without Docker access | |
|
|
| **Docker** | Runs each task in its own container | Full TB2.0 fidelity with custom task images | |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
```python |
|
|
from tbench2_env import Tbench2Env, Tbench2Action |
|
|
|
|
|
env = Tbench2Env(base_url="http://localhost:8000") |
|
|
result = env.reset(task_id="headless-terminal") |
|
|
print(result.observation.instruction) |
|
|
|
|
|
result = env.step(Tbench2Action(action_type="exec", command="ls -la")) |
|
|
print(result.observation.output) |
|
|
|
|
|
result = env.step(Tbench2Action(action_type="evaluate")) |
|
|
print(result.reward, result.done) |
|
|
|
|
|
env.close() |
|
|
``` |
|
|
|
|
|
## Building the Docker Image |
|
|
|
|
|
Before using the environment, build the Docker image: |
|
|
|
|
|
```bash |
|
|
# From project root |
|
|
docker build -t tbench2-env:latest -f envs/tbench2_env/server/Dockerfile . |
|
|
``` |
|
|
|
|
|
## Environment Details |
|
|
|
|
|
### Action |
|
|
**Tbench2Action**: Controls interaction with the TB2 task session |
|
|
|
|
|
| Field | Type | Default | Description | |
|
|
|-------|------|---------|-------------| |
|
|
| `action_type` | str | `"exec"` | Action to perform (`exec`, `write`, `view`, `wait`, `kill`, `write_file`, `evaluate`, `close`) | |
|
|
| `command` | str | `""` | Shell command or input to send | |
|
|
| `session_id` | str \| None | `None` | Session ID for streaming processes | |
|
|
| `block` | bool | `True` | Whether to block until command completes | |
|
|
| `wait_seconds` | float \| None | `None` | Time to wait (for `wait` action) | |
|
|
| `file_path` | str | `""` | File path (for `write_file` action) | |
|
|
| `content` | str | `""` | Content to write (for `write_file` action) | |
|
|
|
|
|
### Observation |
|
|
**Tbench2Observation**: Contains the environment response |
|
|
|
|
|
| Field | Type | Description | |
|
|
|-------|------|-------------| |
|
|
| `instruction` | str | Task instruction/prompt from the TB2 task | |
|
|
| `output` | str | Command output (stdout/stderr) | |
|
|
| `success` | bool | Whether the action succeeded | |
|
|
| `error` | str | Error message if action failed | |
|
|
| `task_id` | str | Current task identifier | |
|
|
| `task_path` | str | Path to the task directory | |
|
|
| `session_id` | str \| None | Session ID for streaming processes | |
|
|
| `action_type` | str | The action type that produced this observation | |
|
|
| `info` | dict | Additional metadata | |
|
|
|
|
|
### State |
|
|
**Tbench2State**: Server-side state for the task session |
|
|
|
|
|
| Field | Type | Description | |
|
|
|-------|------|-------------| |
|
|
| `task_id` | str | Current task identifier | |
|
|
| `task_path` | str | Path to the task directory | |
|
|
| `session_id` | str | Active session ID | |
|
|
| `terminal_ready` | bool | Whether the terminal is ready for commands | |
|
|
| `last_action_type` | str | Last action type executed | |
|
|
| `last_command` | str | Last command executed | |
|
|
| `last_output` | str | Output from last command | |
|
|
|
|
|
## Execution Modes |
|
|
|
|
|
### Local Mode (Default) |
|
|
|
|
|
Commands execute directly in the server process. Ideal for HF Spaces where Docker-in-Docker is unavailable. |
|
|
|
|
|
```bash |
|
|
# Default - local mode |
|
|
python -m tbench2_env.server.app |
|
|
|
|
|
# Or explicitly set mode |
|
|
TB2_MODE=local python -m tbench2_env.server.app |
|
|
``` |
|
|
|
|
|
**Note:** Local mode ignores Docker images specified in task.toml. Tasks requiring specific runtime environments may fail. |
|
|
|
|
|
### Docker Mode |
|
|
|
|
|
Each task runs in its own Docker container, using the image specified in the task's `task.toml`: |
|
|
|
|
|
```bash |
|
|
# Enable Docker mode |
|
|
TB2_MODE=docker python -m tbench2_env.server.app |
|
|
``` |
|
|
|
|
|
**Requirements:** |
|
|
- Docker socket mounted at `/var/run/docker.sock` |
|
|
- Sufficient disk space for container images |
|
|
- Network access to pull images if not cached |
|
|
|
|
|
**Environment Variables for Docker Mode:** |
|
|
- `TB2_MODE=docker` - Enable Docker-backed execution |
|
|
- Docker socket must be accessible (mounted volume) |
|
|
|
|
|
## Action Types |
|
|
|
|
|
| Action | Description | Required Fields | |
|
|
|--------|-------------|-----------------| |
|
|
| `exec` | Run a shell command | `command`, optionally `block`, `session_id` | |
|
|
| `write` | Send input to a running session | `session_id`, `command` | |
|
|
| `view` | Read pending output | `session_id` | |
|
|
| `wait` | Wait for output | `session_id`, optionally `wait_seconds` | |
|
|
| `kill` | Terminate a running session | `session_id` | |
|
|
| `write_file` | Write content to a file | `file_path`, `content` | |
|
|
| `evaluate` | Run pytest tests, return reward | (none) | |
|
|
| `close` | Stop and cleanup | (none) | |
|
|
|
|
|
## Session IDs (Streaming Processes) |
|
|
|
|
|
`session_id` is **only** required when you start a non-blocking process and want to interact with it (`write`, `view`, `wait`, `kill`). For plain `exec` commands, you can omit it. |
|
|
|
|
|
Example (Python): |
|
|
```python |
|
|
# Start a long-running process |
|
|
env.step(Tbench2Action(action_type="exec", command="python -i", block=False, session_id="sess1")) |
|
|
|
|
|
# Send input to it |
|
|
env.step(Tbench2Action(action_type="write", session_id="sess1", command="print(2+2)\n")) |
|
|
|
|
|
# Read its output |
|
|
env.step(Tbench2Action(action_type="view", session_id="sess1")) |
|
|
``` |
|
|
|
|
|
## Environment Variables |
|
|
|
|
|
| Variable | Default | Description | |
|
|
|----------|---------|-------------| |
|
|
| `TB2_MODE` | `local` | Execution mode: `local` or `docker` | |
|
|
| `TB2_TASKS_DIR` | (auto-download) | Path to local Terminal-Bench-2 repo checkout | |
|
|
| `TB2_OUTPUT_DIR` | `/tmp/tbench2_env_runs` | Directory for session logs and cache | |
|
|
| `TB2_CACHE_DIR` | `$TB2_OUTPUT_DIR/repo_cache` | Where to extract TB2 repo | |
|
|
| `TB2_REPO_URL` | (GitHub main.zip) | Repo zip URL for auto-download | |
|
|
|
|
|
## Reward |
|
|
|
|
|
Binary reward on `evaluate` action: |
|
|
- `1.0` - All pytest tests pass (exit code 0) |
|
|
- `0.0` - Tests fail (non-zero exit code) |
|
|
|
|
|
Intermediate steps return `reward=None`. |
|
|
|
|
|
## Running the Server |
|
|
|
|
|
```bash |
|
|
# Install dependencies |
|
|
uv sync --all-extras |
|
|
|
|
|
# Local mode (default, for Spaces) |
|
|
python -m tbench2_env.server.app --port 8000 |
|
|
|
|
|
# Docker mode (full TB2.0 compatibility) |
|
|
TB2_MODE=docker python -m tbench2_env.server.app --port 8000 |
|
|
|
|
|
# With local TB2 repo |
|
|
TB2_TASKS_DIR=/path/to/terminal-bench-2 python -m tbench2_env.server.app |
|
|
``` |
|
|
|
|
|
## Project Structure |
|
|
|
|
|
``` |
|
|
tbench2_env/ |
|
|
βββ __init__.py # Module exports (Tbench2Env, Tbench2Action, etc.) |
|
|
βββ README.md # This file |
|
|
βββ client.py # Tbench2Env client implementation |
|
|
βββ models.py # Tbench2Action, Tbench2Observation, Tbench2State |
|
|
βββ openenv.yaml # OpenEnv configuration |
|
|
βββ pyproject.toml # Package dependencies |
|
|
βββ server/ |
|
|
βββ __init__.py # Server exports |
|
|
βββ app.py # FastAPI application |
|
|
βββ tbench2_env_environment.py # Core environment logic |
|
|
βββ Dockerfile # Container image definition |
|
|
``` |
|
|
|