Spaces:

sergiopaniego
/

tbench2

Runtime error

App Files Files Community

sergiopaniego HF Staff commited on Jan 21

Commit

6b2d0fd

verified ·

1 Parent(s): f056a5e

Upload folder using huggingface_hub

Browse files

Files changed (10) hide show

Dockerfile +95 -0
README.md +201 -4
__init__.py +18 -0
client.py +75 -0
models.py +58 -0
openenv.yaml +7 -0
pyproject.toml +46 -0
server/__init__.py +12 -0
server/app.py +104 -0
server/tbench2_env_environment.py +724 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,95 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+# Multi-stage build using openenv-base
+# This Dockerfile is flexible and works for both:
+# - In-repo environments (with local OpenEnv sources)
+# - Standalone environments (with openenv from PyPI/Git)
+# The build script (openenv build) handles context detection and sets appropriate build args.
+ARG BASE_IMAGE=ghcr.io/meta-pytorch/openenv-base:latest
+FROM ${BASE_IMAGE} AS builder
+WORKDIR /app
+# Ensure git is available (required for installing dependencies from VCS)
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends git && \
+    rm -rf /var/lib/apt/lists/*
+# Build argument to control whether we're building standalone or in-repo
+ARG BUILD_MODE=in-repo
+ARG ENV_NAME=tbench2_env
+# Copy environment code (always at root of build context)
+COPY . /app/env
+# For in-repo builds, openenv is already vendored in the build context
+# For standalone builds, openenv will be installed via pyproject.toml
+WORKDIR /app/env
+# Ensure uv is available (for local builds where base image lacks it)
+RUN if ! command -v uv >/dev/null 2>&1; then \
+        curl -LsSf https://astral.sh/uv/install.sh | sh && \
+        mv /root/.local/bin/uv /usr/local/bin/uv && \
+        mv /root/.local/bin/uvx /usr/local/bin/uvx; \
+    fi
+# Install dependencies using uv sync
+# If uv.lock exists, use it; otherwise resolve on the fly
+RUN --mount=type=cache,target=/root/.cache/uv \
+    if [ -f uv.lock ]; then \
+        uv sync --frozen --no-install-project --no-editable; \
+    else \
+        uv sync --no-install-project --no-editable; \
+    fi
+RUN --mount=type=cache,target=/root/.cache/uv \
+    if [ -f uv.lock ]; then \
+        uv sync --frozen --no-editable; \
+    else \
+        uv sync --no-editable; \
+    fi
+# Final runtime stage
+FROM ${BASE_IMAGE}
+WORKDIR /app
+# Install git and git-lfs for cloning HuggingFace datasets
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends git git-lfs && \
+    rm -rf /var/lib/apt/lists/* && \
+    git lfs install
+# Clone SETA dataset from HuggingFace
+# Source: https://huggingface.co/datasets/camel-ai/seta-env
+RUN git clone --depth 1 https://huggingface.co/datasets/camel-ai/seta-env /app/seta-env
+# Set TB2_TASKS_DIR to point to SETA tasks
+# Tasks are in /app/seta-env/Dataset/ with numeric IDs (1, 2, 3, etc.)
+ENV TB2_TASKS_DIR="/app/seta-env/Dataset"
+# Copy the virtual environment from builder
+COPY --from=builder /app/env/.venv /app/.venv
+# Copy the environment code
+COPY --from=builder /app/env /app/env
+# Set PATH to use the virtual environment
+ENV PATH="/app/.venv/bin:$PATH"
+# Set PYTHONPATH so imports work correctly
+ENV PYTHONPATH="/app/env:$PYTHONPATH"
+# Health check
+HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
+    CMD curl -f http://localhost:8000/health || exit 1
+# Run the FastAPI server
+# The module path is constructed to work with the /app/env structure
+ENV ENABLE_WEB_INTERFACE=true
+CMD ["sh", "-c", "cd /app/env && uvicorn server.app:app --host 0.0.0.0 --port 8000"]

README.md CHANGED Viewed

@@ -1,10 +1,207 @@
 ---
-title: Tbench2
-emoji: 🏃
 colorFrom: blue
-colorTo: pink
 sdk: docker
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: TB2 Environment Server
+emoji: "🧪"
 colorFrom: blue
+colorTo: blue
 sdk: docker
 pinned: false
+app_port: 8000
+base_path: /web
+tags:
+  - openenv
+  - terminal-bench-2
+  - spaces
 ---
+# TB2 Environment (Terminal-Bench 2)
+OpenEnv wrapper for [Terminal-Bench 2](https://github.com/laude-institute/terminal-bench-2) tasks. Supports two execution modes:
+| Mode | Description | Use Case |
+|------|-------------|----------|
+| **Local** | Runs commands in the server process (no Docker) | Hugging Face Spaces, environments without Docker access |
+| **Docker** | Runs each task in its own container | Full TB2.0 fidelity with custom task images |
+## Quick Start
+```python
+from tbench2_env import Tbench2Env, Tbench2Action
+env = Tbench2Env(base_url="http://localhost:8000")
+result = env.reset(task_id="headless-terminal")
+print(result.observation.instruction)
+result = env.step(Tbench2Action(action_type="exec", command="ls -la"))
+print(result.observation.output)
+result = env.step(Tbench2Action(action_type="evaluate"))
+print(result.reward, result.done)
+env.close()
+```
+## Building the Docker Image
+Before using the environment, build the Docker image:
+```bash
+# From project root
+docker build -t tbench2-env:latest -f envs/tbench2_env/server/Dockerfile .
+```
+## Environment Details
+### Action
+**Tbench2Action**: Controls interaction with the TB2 task session
+| Field | Type | Default | Description |
+|-------|------|---------|-------------|
+| `action_type` | str | `"exec"` | Action to perform (`exec`, `write`, `view`, `wait`, `kill`, `write_file`, `evaluate`, `close`) |
+| `command` | str | `""` | Shell command or input to send |
+| `session_id` | str \| None | `None` | Session ID for streaming processes |
+| `block` | bool | `True` | Whether to block until command completes |
+| `wait_seconds` | float \| None | `None` | Time to wait (for `wait` action) |
+| `file_path` | str | `""` | File path (for `write_file` action) |
+| `content` | str | `""` | Content to write (for `write_file` action) |
+### Observation
+**Tbench2Observation**: Contains the environment response
+| Field | Type | Description |
+|-------|------|-------------|
+| `instruction` | str | Task instruction/prompt from the TB2 task |
+| `output` | str | Command output (stdout/stderr) |
+| `success` | bool | Whether the action succeeded |
+| `error` | str | Error message if action failed |
+| `task_id` | str | Current task identifier |
+| `task_path` | str | Path to the task directory |
+| `session_id` | str \| None | Session ID for streaming processes |
+| `action_type` | str | The action type that produced this observation |
+| `info` | dict | Additional metadata |
+### State
+**Tbench2State**: Server-side state for the task session
+| Field | Type | Description |
+|-------|------|-------------|
+| `task_id` | str | Current task identifier |
+| `task_path` | str | Path to the task directory |
+| `session_id` | str | Active session ID |
+| `terminal_ready` | bool | Whether the terminal is ready for commands |
+| `last_action_type` | str | Last action type executed |
+| `last_command` | str | Last command executed |
+| `last_output` | str | Output from last command |
+## Execution Modes
+### Local Mode (Default)
+Commands execute directly in the server process. Ideal for HF Spaces where Docker-in-Docker is unavailable.
+```bash
+# Default - local mode
+python -m tbench2_env.server.app
+# Or explicitly set mode
+TB2_MODE=local python -m tbench2_env.server.app
+```
+**Note:** Local mode ignores Docker images specified in task.toml. Tasks requiring specific runtime environments may fail.
+### Docker Mode
+Each task runs in its own Docker container, using the image specified in the task's `task.toml`:
+```bash
+# Enable Docker mode
+TB2_MODE=docker python -m tbench2_env.server.app
+```
+**Requirements:**
+- Docker socket mounted at `/var/run/docker.sock`
+- Sufficient disk space for container images
+- Network access to pull images if not cached
+**Environment Variables for Docker Mode:**
+- `TB2_MODE=docker` - Enable Docker-backed execution
+- Docker socket must be accessible (mounted volume)
+## Action Types
+| Action | Description | Required Fields |
+|--------|-------------|-----------------|
+| `exec` | Run a shell command | `command`, optionally `block`, `session_id` |
+| `write` | Send input to a running session | `session_id`, `command` |
+| `view` | Read pending output | `session_id` |
+| `wait` | Wait for output | `session_id`, optionally `wait_seconds` |
+| `kill` | Terminate a running session | `session_id` |
+| `write_file` | Write content to a file | `file_path`, `content` |
+| `evaluate` | Run pytest tests, return reward | (none) |
+| `close` | Stop and cleanup | (none) |
+## Session IDs (Streaming Processes)
+`session_id` is **only** required when you start a non-blocking process and want to interact with it (`write`, `view`, `wait`, `kill`). For plain `exec` commands, you can omit it.
+Example (Python):
+```python
+# Start a long-running process
+env.step(Tbench2Action(action_type="exec", command="python -i", block=False, session_id="sess1"))
+# Send input to it
+env.step(Tbench2Action(action_type="write", session_id="sess1", command="print(2+2)\n"))
+# Read its output
+env.step(Tbench2Action(action_type="view", session_id="sess1"))
+```
+## Environment Variables
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `TB2_MODE` | `local` | Execution mode: `local` or `docker` |
+| `TB2_TASKS_DIR` | (auto-download) | Path to local Terminal-Bench-2 repo checkout |
+| `TB2_OUTPUT_DIR` | `/tmp/tbench2_env_runs` | Directory for session logs and cache |
+| `TB2_CACHE_DIR` | `$TB2_OUTPUT_DIR/repo_cache` | Where to extract TB2 repo |
+| `TB2_REPO_URL` | (GitHub main.zip) | Repo zip URL for auto-download |
+## Reward
+Binary reward on `evaluate` action:
+- `1.0` - All pytest tests pass (exit code 0)
+- `0.0` - Tests fail (non-zero exit code)
+Intermediate steps return `reward=None`.
+## Running the Server
+```bash
+# Install dependencies
+uv sync --all-extras
+# Local mode (default, for Spaces)
+python -m tbench2_env.server.app --port 8000
+# Docker mode (full TB2.0 compatibility)
+TB2_MODE=docker python -m tbench2_env.server.app --port 8000
+# With local TB2 repo
+TB2_TASKS_DIR=/path/to/terminal-bench-2 python -m tbench2_env.server.app
+```
+## Project Structure
+```
+tbench2_env/
+├── __init__.py              # Module exports (Tbench2Env, Tbench2Action, etc.)
+├── README.md                # This file
+├── client.py                # Tbench2Env client implementation
+├── models.py                # Tbench2Action, Tbench2Observation, Tbench2State
+├── openenv.yaml             # OpenEnv configuration
+├── pyproject.toml           # Package dependencies
+└── server/
+    ├── __init__.py          # Server exports
+    ├── app.py               # FastAPI application
+    ├── tbench2_env_environment.py  # Core environment logic
+    └── Dockerfile           # Container image definition
+```

__init__.py ADDED Viewed

	@@ -0,0 +1,18 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""Tbench2 Env Environment."""
+from .client import Tbench2Env
+from .models import Tbench2Action, Tbench2Observation, Tbench2State
+__all__ = [
+    "Tbench2Action",
+    "Tbench2Observation",
+    "Tbench2Env",
+    "Tbench2State",
+]

client.py ADDED Viewed

	@@ -0,0 +1,75 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""TB2 Environment Client."""
+from __future__ import annotations
+from typing import Any
+# Support both in-repo and standalone imports
+try:
+    # In-repo imports (when running from OpenEnv repository)
+    from openenv.core.client_types import StepResult
+    from openenv.core.env_client import EnvClient
+    from .models import Tbench2Action, Tbench2Observation, Tbench2State
+except ImportError:
+    # Standalone imports (when environment is standalone with openenv from pip)
+    from openenv.core.client_types import StepResult
+    from openenv.core.env_client import EnvClient
+    from models import Tbench2Action, Tbench2Observation, Tbench2State
+class Tbench2Env(EnvClient[Tbench2Action, Tbench2Observation, Tbench2State]):
+    """HTTP client for the TB2 environment."""
+    def _step_payload(self, action: Tbench2Action) -> dict[str, Any]:
+        return {
+            "action_type": action.action_type,
+            "command": action.command,
+            "session_id": action.session_id,
+            "block": action.block,
+            "wait_seconds": action.wait_seconds,
+            "file_path": action.file_path,
+            "content": action.content,
+        }
+    def _parse_result(self, payload: dict[str, Any]) -> StepResult[Tbench2Observation]:
+        obs_data = payload.get("observation", {})
+        observation = Tbench2Observation(
+            instruction=obs_data.get("instruction", ""),
+            output=obs_data.get("output", ""),
+            success=obs_data.get("success", True),
+            error=obs_data.get("error", ""),
+            task_id=obs_data.get("task_id", ""),
+            task_path=obs_data.get("task_path", ""),
+            session_id=obs_data.get("session_id"),
+            action_type=obs_data.get("action_type", ""),
+            info=obs_data.get("info", {}),
+            reward=payload.get("reward"),
+            done=payload.get("done", False),
+            metadata=obs_data.get("metadata", {}),
+        )
+        return StepResult(
+            observation=observation,
+            reward=payload.get("reward"),
+            done=payload.get("done", False),
+        )
+    def _parse_state(self, payload: dict[str, Any]) -> Tbench2State:
+        return Tbench2State(
+            episode_id=payload.get("episode_id"),
+            step_count=payload.get("step_count", 0),
+            task_id=payload.get("task_id", ""),
+            task_path=payload.get("task_path", ""),
+            terminal_ready=payload.get("terminal_ready", False),
+            last_action_type=payload.get("last_action_type", ""),
+            last_command=payload.get("last_command", ""),
+            last_output=payload.get("last_output", ""),
+        )

models.py ADDED Viewed

	@@ -0,0 +1,58 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""
+Data models for the TB2 environment.
+"""
+from pydantic import Field
+# Support both in-repo and standalone imports
+try:
+    # In-repo imports (when running from OpenEnv repository)
+    from openenv.core.env_server.types import Action, Observation, State
+except ImportError:
+    # Standalone imports (when environment is standalone with openenv from pip)
+    from openenv.core.env_server.types import Action, Observation, State
+class Tbench2Action(Action):
+    """Action for interacting with a TB2 task session."""
+    action_type: str = Field(default="exec")
+    command: str = Field(default="")
+    session_id: str | None = Field(default=None)
+    block: bool = Field(default=True)
+    wait_seconds: float | None = Field(default=None)
+    file_path: str = Field(default="")
+    content: str = Field(default="")
+class Tbench2Observation(Observation):
+    """Observation returned from the TB2 environment."""
+    instruction: str = Field(default="")
+    output: str = Field(default="")
+    success: bool = Field(default=True)
+    error: str = Field(default="")
+    task_id: str = Field(default="")
+    task_path: str = Field(default="")
+    session_id: str | None = Field(default=None)
+    action_type: str = Field(default="")
+    info: dict = Field(default_factory=dict)
+class Tbench2State(State):
+    """Server-side state for a TB2 task."""
+    task_id: str = Field(default="")
+    task_path: str = Field(default="")
+    session_id: str = Field(default="")
+    terminal_ready: bool = Field(default=False)
+    last_action_type: str = Field(default="")
+    last_command: str = Field(default="")
+    last_output: str = Field(default="")

openenv.yaml ADDED Viewed

	@@ -0,0 +1,7 @@

+spec_version: 1
+name: tbench2
+type: space
+runtime: fastapi
+app: server.app:app
+port: 8000

pyproject.toml ADDED Viewed

	@@ -0,0 +1,46 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+[build-system]
+requires = ["setuptools>=45", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "openenv-tbench2_env"
+version = "0.1.0"
+description = "Tbench2 Env environment for OpenEnv"
+requires-python = ">=3.10"
+dependencies = [
+    # Core OpenEnv runtime (provides FastAPI server + HTTP client types)
+    # install from github
+    "openenv-core @ git+https://github.com/meta-pytorch/OpenEnv.git",
+    "pytest>=8.4.0",
+    # Environment-specific dependencies
+    # Add all dependencies needed for your environment here
+    "camel-ai",
+    # Docker-backed mode (optional, for full TB2.0 compatibility)
+    "docker>=7.0.0",
+    # TOML parsing (tomllib for Python 3.11+, tomli for older versions)
+    "tomli>=2.0.0; python_version < '3.11'",
+    # YAML parsing (for SETA dataset task.yaml format)
+    "pyyaml>=6.0.0",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.0.0",
+    "pytest-cov>=4.0.0",
+]
+[project.scripts]
+# Server entry point - enables running via: uv run --project . server
+# or: python -m tbench2_env.server.app
+server = "tbench2_env.server.app:main"
+[tool.setuptools]
+include-package-data = true
+packages = ["tbench2_env", "tbench2_env.server"]
+package-dir = { "tbench2_env" = ".", "tbench2_env.server" = "server" }

server/__init__.py ADDED Viewed

	@@ -0,0 +1,12 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""Tbench2 Env environment server components."""
+from .tbench2_env_environment import Tbench2DockerEnvironment, Tbench2Environment
+__all__ = ["Tbench2Environment", "Tbench2DockerEnvironment"]

server/app.py ADDED Viewed

	@@ -0,0 +1,104 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""
+FastAPI application for the Tbench2 Env Environment.
+This module creates an HTTP server that exposes the Tbench2Environment
+over HTTP and WebSocket endpoints, compatible with EnvClient.
+Endpoints:
+    - POST /reset: Reset the environment
+    - POST /step: Execute an action
+    - GET /state: Get current environment state
+    - GET /schema: Get action/observation schemas
+    - WS /ws: WebSocket endpoint for persistent sessions
+Usage:
+    # Development (with auto-reload):
+    uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
+    # Production:
+    uvicorn server.app:app --host 0.0.0.0 --port 8000 --workers 4
+    # Or run directly:
+    python -m server.app
+"""
+import os
+try:
+    from openenv.core.env_server.http_server import create_app
+    # In-repo imports
+    from tbench2_env.models import Tbench2Action, Tbench2Observation
+    from .tbench2_env_environment import Tbench2DockerEnvironment, Tbench2Environment
+except Exception as e:  # pragma: no cover
+    # Standalone imports (when environment is standalone with openenv from pip)
+    from openenv.core.env_server.http_server import create_app
+    from server.tbench2_env_environment import Tbench2DockerEnvironment, Tbench2Environment
+    from models import Tbench2Action, Tbench2Observation
+    _IMPORT_ERROR = e
+# Determine which environment class to use based on TB2_MODE
+_TB2_MODE = os.getenv("TB2_MODE", "local").lower()
+if _TB2_MODE == "docker":
+    _DEFAULT_ENVIRONMENT = Tbench2DockerEnvironment
+    _ENV_SUFFIX = " (Docker mode)"
+elif _TB2_MODE == "auto":
+    # Auto-detect: try Docker, fall back to local
+    _DEFAULT_ENVIRONMENT = Tbench2Environment
+    _ENV_SUFFIX = " (auto-detect mode)"
+else:
+    _DEFAULT_ENVIRONMENT = Tbench2Environment
+    _ENV_SUFFIX = " (local mode)"
+# Create the app with web interface and README integration
+app = create_app(
+    _DEFAULT_ENVIRONMENT,
+    Tbench2Action,
+    Tbench2Observation,
+    env_name="tbench2_env" + _ENV_SUFFIX,
+    max_concurrent_envs=1,  # increase this number to allow more concurrent WebSocket sessions
+)
+def main(host: str = "0.0.0.0", port: int = 8000):
+    """
+    Entry point for direct execution via uv run or python -m.
+    This function enables running the server without Docker:
+        uv run --project . server
+        uv run --project . server --port 8001
+        python -m tbench2_env.server.app
+    Args:
+        host: Host address to bind to (default: "0.0.0.0")
+        port: Port number to listen on (default: 8000)
+    For production deployments, consider using uvicorn directly with
+    multiple workers:
+        uvicorn tbench2_env.server.app:app --workers 4
+    """
+    import uvicorn
+    uvicorn.run(app, host=host, port=port)
+if __name__ == "__main__":
+    import argparse
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--port", type=int, default=8000)
+    args = parser.parse_args()
+    main(port=args.port)

server/tbench2_env_environment.py ADDED Viewed

	@@ -0,0 +1,724 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""TB2 environment server implementation (Spaces-compatible local mode)."""
+from __future__ import annotations
+import logging
+import os
+import sys
+import urllib.request
+import zipfile
+from pathlib import Path
+from typing import Any
+from uuid import uuid4
+if sys.version_info >= (3, 11):
+    import tomllib
+else:
+    import tomli as tomllib
+from openenv.core.env_server.interfaces import Environment
+# Support both in-repo and standalone imports
+try:
+    # In-repo imports (when running from OpenEnv repository)
+    from tbench2_env.models import Tbench2Action, Tbench2Observation, Tbench2State
+except ImportError:
+    # Standalone imports (when environment is standalone with openenv from pip)
+    from models import Tbench2Action, Tbench2Observation, Tbench2State
+_CAMEL_IMPORT_ERROR: Exception | None = None
+def _require_terminal_toolkit() -> Any:
+    global _CAMEL_IMPORT_ERROR
+    if _CAMEL_IMPORT_ERROR is not None:
+        raise RuntimeError(
+            "camel-ai (TerminalToolkit) is required for TB2. Install from PyPI or from the CAMEL repo."
+        ) from _CAMEL_IMPORT_ERROR
+    try:
+        from camel.toolkits import TerminalToolkit
+    except Exception as exc:  # pragma: no cover
+        _CAMEL_IMPORT_ERROR = exc
+        raise RuntimeError(
+            "camel-ai (TerminalToolkit) is required for TB2. Install from PyPI or from the CAMEL repo."
+        ) from exc
+    return TerminalToolkit
+def _download_tb2_repo(cache_dir: Path) -> Path:
+    repo_url = os.getenv(
+        "TB2_REPO_URL",
+        "https://github.com/laude-institute/terminal-bench-2/archive/refs/heads/main.zip",
+    )
+    cache_dir.mkdir(parents=True, exist_ok=True)
+    archive_path = cache_dir / "terminal-bench-2.zip"
+    if not archive_path.exists():
+        urllib.request.urlretrieve(repo_url, archive_path)
+    with zipfile.ZipFile(archive_path) as zf:
+        root = zf.namelist()[0].split("/")[0]
+        extract_dir = cache_dir / root
+        if not extract_dir.exists():
+            zf.extractall(cache_dir)
+    return extract_dir
+def _read_instruction(task_dir: Path) -> str:
+    """Read task instruction from instruction.md or task.yaml (SETA format)."""
+    # Try instruction.md first (Terminal-Bench-2 format)
+    instruction_path = task_dir / "instruction.md"
+    if instruction_path.exists():
+        return instruction_path.read_text(encoding="utf-8")
+    # Try task.yaml (SETA dataset format)
+    # Source: https://huggingface.co/datasets/camel-ai/seta-env
+    task_yaml_path = task_dir / "task.yaml"
+    if task_yaml_path.exists():
+        try:
+            import yaml
+            data = yaml.safe_load(task_yaml_path.read_text(encoding="utf-8"))
+            if isinstance(data, dict) and "instruction" in data:
+                return data["instruction"]
+        except Exception:
+            pass
+    return ""
+def _read_timeout(task_dir: Path, fallback: float) -> float:
+    task_toml = task_dir / "task.toml"
+    if not task_toml.exists():
+        return fallback
+    try:
+        data = tomllib.loads(task_toml.read_text(encoding="utf-8"))
+    except Exception:
+        return fallback
+    verifier = data.get("verifier", {})
+    return float(verifier.get("timeout_sec", fallback))
+class Tbench2Environment(Environment[Tbench2Action, Tbench2Observation, Tbench2State]):
+    """OpenEnv wrapper around Terminal-Bench 2 tasks (local execution)."""
+    SUPPORTS_CONCURRENT_SESSIONS: bool = True
+    def __init__(
+        self,
+        tasks_dir: str | None = None,
+        output_dir: str | None = None,
+        command_timeout_s: float = 20.0,
+        safe_mode: bool = False,
+    ) -> None:
+        super().__init__()
+        self.tasks_dir = tasks_dir or os.getenv("TB2_TASKS_DIR", "")
+        self.output_dir = Path(output_dir or os.getenv("TB2_OUTPUT_DIR", "/tmp/tbench2_env_runs"))
+        self.command_timeout_s = command_timeout_s
+        self.safe_mode = safe_mode
+        self._state = Tbench2State()
+        self._task_dir: Path | None = None
+        self._terminal_toolkit = None
+        self._instruction = ""
+    def reset(
+        self,
+        seed: int | None = None,
+        episode_id: str | None = None,
+        **kwargs: Any,
+    ) -> Tbench2Observation:
+        del seed
+        TerminalToolkit = _require_terminal_toolkit()
+        task_id = kwargs.get("task_id") or kwargs.get("task_name")
+        task_path = kwargs.get("task_path") or kwargs.get("path")
+        task_dir = self._resolve_task_path(task_id, task_path)
+        resolved_task_id = task_id or task_dir.name
+        self._instruction = _read_instruction(task_dir)
+        self._task_dir = task_dir
+        trial_name = f"{resolved_task_id}.{episode_id or uuid4().hex}"
+        session_logs_dir = self.output_dir / trial_name / "terminal_toolkit_session_logs"
+        session_logs_dir.mkdir(parents=True, exist_ok=True)
+        self._terminal_toolkit = TerminalToolkit(
+            timeout=self.command_timeout_s,
+            working_directory=str(task_dir),
+            use_docker_backend=False,
+            session_logs_dir=session_logs_dir,
+            safe_mode=self.safe_mode,
+        )
+        self._state = Tbench2State(
+            episode_id=episode_id or str(uuid4()),
+            step_count=0,
+            task_id=resolved_task_id,
+            task_path=str(task_dir),
+            terminal_ready=True,
+        )
+        return Tbench2Observation(
+            instruction=self._instruction,
+            output="",
+            success=True,
+            error="",
+            task_id=resolved_task_id,
+            task_path=str(task_dir),
+            session_id=None,
+            action_type="reset",
+            info={},
+            reward=0.0,
+            done=False,
+        )
+    def step(
+        self,
+        action: Tbench2Action,
+        timeout_s: float | None = None,
+        **kwargs: Any,
+    ) -> Tbench2Observation:
+        del timeout_s, kwargs
+        if not isinstance(action, Tbench2Action):
+            raise TypeError(f"Expected Tbench2Action, got {type(action)}")
+        if self._terminal_toolkit is None or self._task_dir is None:
+            raise RuntimeError("TB2 environment not initialized. Call reset() first.")
+        self._state.step_count += 1
+        self._state.last_action_type = action.action_type
+        self._state.last_command = action.command
+        output = ""
+        error = ""
+        success = True
+        reward = None
+        done = False
+        info: dict[str, Any] = {}
+        session_id = action.session_id or "tb2-session"
+        try:
+            if action.action_type == "exec":
+                output = self._terminal_toolkit.shell_exec(
+                    command=action.command,
+                    block=action.block,
+                    id=session_id,
+                )
+            elif action.action_type == "write":
+                self._ensure_session_id(action.session_id, action.action_type)
+                output = self._terminal_toolkit.shell_write_to_process(
+                    id=action.session_id,
+                    command=action.command,
+                )
+            elif action.action_type == "view":
+                self._ensure_session_id(action.session_id, action.action_type)
+                output = self._terminal_toolkit.shell_view(id=action.session_id)
+            elif action.action_type == "wait":
+                self._ensure_session_id(action.session_id, action.action_type)
+                wait_seconds = action.wait_seconds or 0.0
+                output = self._terminal_toolkit.shell_wait(
+                    id=action.session_id,
+                    wait_seconds=wait_seconds,
+                )
+            elif action.action_type == "kill":
+                self._ensure_session_id(action.session_id, action.action_type)
+                self._terminal_toolkit.shell_kill_process(id=action.session_id)
+                output = f"Killed session {action.session_id}"
+            elif action.action_type == "write_file":
+                self._terminal_toolkit.shell_write_content_to_file(
+                    content=action.content,
+                    file_path=action.file_path,
+                )
+                output = f"Wrote content to {action.file_path}"
+            elif action.action_type == "evaluate":
+                output, reward, info = self._evaluate_task()
+                done = True
+            elif action.action_type == "close":
+                self.close()
+                output = "Closed TB2 environment."
+                done = True
+            else:
+                raise ValueError(f"Unsupported action_type: {action.action_type}")
+        except Exception as exc:  # pragma: no cover
+            success = False
+            error = str(exc)
+        self._state.last_output = output
+        self._state.session_id = session_id or ""
+        return Tbench2Observation(
+            instruction=self._instruction,
+            output=output,
+            success=success,
+            error=error,
+            task_id=self._state.task_id,
+            task_path=self._state.task_path,
+            session_id=session_id or "",
+            action_type=action.action_type,
+            info=info,
+            reward=reward,
+            done=done,
+        )
+    @property
+    def state(self) -> Tbench2State:
+        return self._state
+    def close(self) -> None:
+        self._terminal_toolkit = None
+        self._task_dir = None
+        self._instruction = ""
+    def _resolve_task_path(self, task_id: str | None, task_path: str | None) -> Path:
+        if task_path:
+            resolved = Path(task_path).expanduser().resolve()
+            if not resolved.exists():
+                raise FileNotFoundError(f"Task path not found: {resolved}")
+            return resolved
+        if not task_id:
+            raise ValueError("Provide task_id or task_path to reset TB2 environment.")
+        if not self.tasks_dir:
+            cache_dir = Path(os.getenv("TB2_CACHE_DIR", str(self.output_dir / "repo_cache")))
+            repo_dir = _download_tb2_repo(cache_dir)
+            resolved = repo_dir / task_id
+        else:
+            resolved = Path(self.tasks_dir).expanduser().resolve() / task_id
+        if not resolved.exists():
+            raise FileNotFoundError(f"Task path not found: {resolved}")
+        return resolved
+    def _ensure_session_id(self, session_id: str | None, action_type: str) -> None:
+        if not session_id:
+            raise ValueError(f"session_id is required for action_type='{action_type}'")
+    def _evaluate_task(self) -> tuple[str, float, dict[str, Any]]:
+        if self._task_dir is None:
+            raise RuntimeError("TB2 environment not initialized. Call reset() first.")
+        if self._terminal_toolkit is None:
+            raise RuntimeError("Terminal toolkit not initialized.")
+        _read_timeout(self._task_dir, fallback=900.0)  # Validate timeout config
+        # Determine evaluation method based on task format
+        run_tests_sh = self._task_dir / "run-tests.sh"
+        tests_dir = self._task_dir / "tests"
+        if run_tests_sh.exists():
+            # SETA format: use run-tests.sh
+            # Source: https://huggingface.co/datasets/camel-ai/seta-env
+            cmd = f"cd {self._task_dir} && bash run-tests.sh; echo __TB2_EXIT_CODE__:$?"
+        elif tests_dir.exists():
+            # Terminal-Bench-2 format: use pytest
+            cmd = f"cd {self._task_dir} && python -m pytest -q {tests_dir} -rA; echo __TB2_EXIT_CODE__:$?"
+        else:
+            # No tests found
+            return "No tests found (neither run-tests.sh nor tests/ directory)", 0.0, {"tests_passed": False, "exit_code": -1}
+        output = self._terminal_toolkit.shell_exec(
+            id="tb2-tests",
+            command=cmd,
+            block=True,
+        )
+        exit_code = 1
+        marker = "__TB2_EXIT_CODE__"
+        for line in output.splitlines()[::-1]:
+            if marker in line:
+                try:
+                    exit_code = int(line.split(":", 1)[1].strip())
+                except Exception:
+                    exit_code = 1
+                break
+        reward = 1.0 if exit_code == 0 else 0.0
+        info = {"tests_passed": exit_code == 0, "exit_code": exit_code}
+        return output, reward, info
+class Tbench2DockerEnvironment(Environment[Tbench2Action, Tbench2Observation, Tbench2State]):
+    """OpenEnv wrapper around Terminal-Bench 2 tasks with Docker isolation.
+    This environment runs each task in its own Docker container, reading
+    the image specification from task.toml's [environment] section.
+    Requires:
+    - Docker socket mounted (/var/run/docker.sock)
+    - Sufficient disk space for container images
+    """
+    SUPPORTS_CONCURRENT_SESSIONS: bool = True
+    def __init__(
+        self,
+        tasks_dir: str | None = None,
+        output_dir: str | None = None,
+        command_timeout_s: float = 300.0,
+        safe_mode: bool = True,
+    ) -> None:
+        super().__init__()
+        self.tasks_dir = tasks_dir or os.getenv("TB2_TASKS_DIR", "")
+        self.output_dir = Path(output_dir or os.getenv("TB2_OUTPUT_DIR", "/tmp/tbench2_env_runs"))
+        self.command_timeout_s = command_timeout_s
+        self.safe_mode = safe_mode
+        self._state = Tbench2State()
+        self._task_dir: Path | None = None
+        self._docker_client = None
+        self._container = None
+        self._instruction = ""
+        self._task_image = ""
+        self._task_config: dict[str, Any] = {}
+    def _get_docker_client(self) -> Any:
+        """Lazy initialization of Docker client."""
+        if self._docker_client is None:
+            try:
+                import docker
+                self._docker_client = docker.from_env()
+            except Exception as exc:
+                raise RuntimeError(
+                    f"Docker client not available. Ensure Docker socket is mounted. Error: {exc}"
+                ) from exc
+        return self._docker_client
+    def reset(
+        self,
+        seed: int | None = None,
+        episode_id: str | None = None,
+        **kwargs: Any,
+    ) -> Tbench2Observation:
+        del seed
+        task_id = kwargs.get("task_id") or kwargs.get("task_name")
+        task_path = kwargs.get("task_path") or kwargs.get("path")
+        task_dir = self._resolve_task_path(task_id, task_path)
+        resolved_task_id = task_id or task_dir.name
+        # Read task configuration including Docker image
+        task_toml_path = task_dir / "task.toml"
+        if task_toml_path.exists():
+            self._task_config = tomllib.loads(task_toml_path.read_text(encoding="utf-8"))
+            self._task_image = self._task_config.get("environment", {}).get("docker_image", "")
+        else:
+            self._task_image = ""
+            self._task_config = {}
+        self._instruction = _read_instruction(task_dir)
+        self._task_dir = task_dir
+        # Create trial directory for logs
+        trial_name = f"{resolved_task_id}.{episode_id or uuid4().hex}"
+        trial_dir = self.output_dir / trial_name
+        trial_dir.mkdir(parents=True, exist_ok=True)
+        # Start Docker container if image is specified
+        if self._task_image:
+            self._start_container(task_dir, trial_dir)
+        else:
+            # Fallback to local mode if no image specified
+            self._state = Tbench2State(
+                episode_id=episode_id or str(uuid4()),
+                step_count=0,
+                task_id=resolved_task_id,
+                task_path=str(task_dir),
+                terminal_ready=not self._task_image,  # Ready if no container needed
+            )
+        return Tbench2Observation(
+            instruction=self._instruction,
+            output="",
+            success=True,
+            error="",
+            task_id=resolved_task_id,
+            task_path=str(task_dir),
+            session_id=None,
+            action_type="reset",
+            info={"docker_image": self._task_image} if self._task_image else {},
+            reward=0.0,
+            done=False,
+        )
+    def _start_container(self, task_dir: Path, trial_dir: Path) -> None:
+        """Start a Docker container for the task.
+        Uses file copying instead of bind mounts to support Docker-in-Docker
+        scenarios where the server runs inside a container. Bind mounts reference
+        host paths, which don't exist when the server is containerized.
+        """
+        docker = self._get_docker_client()
+        try:
+            # Pull image if needed
+            try:
+                docker.images.get(self._task_image)
+            except Exception:
+                logging.info(f"Pulling image {self._task_image}...")
+                docker.images.pull(self._task_image)
+            # Start container WITHOUT bind mounts (for DinD compatibility)
+            self._container = docker.containers.run(
+                image=self._task_image,
+                command="sleep infinity",
+                detach=True,
+                network_mode="host",
+                working_dir="/task",
+                remove=False,
+            )
+            # Copy task files into container using tar archive
+            # This works in Docker-in-Docker because we read files from our
+            # filesystem and stream them to the container via the Docker API
+            self._copy_dir_to_container(task_dir, "/task")
+            self._state = Tbench2State(
+                episode_id=str(uuid4()),
+                step_count=0,
+                task_id=task_dir.name,
+                task_path=str(task_dir),
+                terminal_ready=True,
+            )
+        except Exception as exc:
+            raise RuntimeError(f"Failed to start container: {exc}") from exc
+    def _copy_dir_to_container(self, src_dir: Path, dest_path: str) -> None:
+        """Copy a directory into the container using tar archive.
+        This method streams files via the Docker API, avoiding bind mount
+        issues in Docker-in-Docker scenarios.
+        """
+        import io
+        import tarfile
+        if self._container is None:
+            raise RuntimeError("Container not started")
+        # Create tar archive in memory
+        tar_stream = io.BytesIO()
+        with tarfile.open(fileobj=tar_stream, mode="w") as tar:
+            for item in src_dir.rglob("*"):
+                arcname = str(item.relative_to(src_dir))
+                tar.add(str(item), arcname=arcname)
+        tar_stream.seek(0)
+        # Copy to container
+        self._container.put_archive(dest_path, tar_stream.getvalue())
+    def _exec_in_container(self, command: str, workdir: str = "/task") -> tuple[int, str]:
+        """Execute a command inside the container."""
+        if self._container is None:
+            raise RuntimeError("Container not started. Call reset() first.")
+        exit_code, output = self._container.exec_run(
+            cmd=f"bash -c 'cd {workdir} && {command}'",
+            workdir="/task",
+            stdout=True,
+            stderr=True,
+        )
+        return exit_code, output.decode("utf-8", errors="replace")
+    def step(
+        self,
+        action: Tbench2Action,
+        timeout_s: float | None = None,
+        **kwargs: Any,
+    ) -> Tbench2Observation:
+        del timeout_s, kwargs
+        if not isinstance(action, Tbench2Action):
+            raise TypeError(f"Expected Tbench2Action, got {type(action)}")
+        if self._task_dir is None:
+            raise RuntimeError("TB2 environment not initialized. Call reset() first.")
+        self._state.step_count += 1
+        self._state.last_action_type = action.action_type
+        self._state.last_command = action.command
+        output = ""
+        error = ""
+        success = True
+        reward = None
+        done = False
+        info: dict[str, Any] = {}
+        session_id = action.session_id or "tb2-session"
+        try:
+            if action.action_type == "exec":
+                if self._container:
+                    exit_code, output = self._exec_in_container(action.command)
+                    success = exit_code == 0
+                else:
+                    # Fallback to local execution
+                    import subprocess
+                    result = subprocess.run(
+                        action.command,
+                        shell=True,
+                        capture_output=True,
+                        text=True,
+                        timeout=self.command_timeout_s,
+                    )
+                    output = result.stdout + result.stderr
+                    success = result.returncode == 0
+            elif action.action_type == "write_file":
+                if self._container:
+                    # Write to container
+                    exit_code, _ = self._exec_in_container(f"cat > {action.file_path} << 'EOF'\n{action.content}\nEOF")
+                    success = exit_code == 0
+                    output = f"Wrote to {action.file_path}"
+                else:
+                    # Local write
+                    Path(action.file_path).write_text(action.content)
+                    output = f"Wrote to {action.file_path}"
+            elif action.action_type == "evaluate":
+                if self._container:
+                    output, reward, info = self._evaluate_docker()
+                else:
+                    output, reward, info = self._evaluate_local()
+                done = True
+            elif action.action_type == "close":
+                self.close()
+                output = "Closed TB2 environment."
+                done = True
+            else:
+                raise ValueError(f"Unsupported action_type in Docker mode: {action.action_type}")
+        except Exception as exc:
+            success = False
+            error = str(exc)
+        self._state.last_output = output
+        self._state.session_id = session_id or ""
+        return Tbench2Observation(
+            instruction=self._instruction,
+            output=output,
+            success=success,
+            error=error,
+            task_id=self._state.task_id,
+            task_path=self._state.task_path,
+            session_id=session_id or "",
+            action_type=action.action_type,
+            info=info,
+            reward=reward,
+            done=done,
+        )
+    def _evaluate_docker(self) -> tuple[str, float, dict[str, Any]]:
+        """Evaluate task inside Docker container."""
+        if self._container is None:
+            raise RuntimeError("Container not started.")
+        assert self._task_dir is not None, "Task directory not set"
+        # Run pytest in the container's /task directory
+        # Use exit code marker for consistency with local mode
+        cmd = "cd /task && python -m pytest -q tests/ -rA; echo __TB2_EXIT_CODE__:$?"
+        exit_code, output = self._container.exec_run(
+            cmd=f"bash -c '{cmd}'",
+            workdir="/task",
+            stdout=True,
+            stderr=True,
+        )
+        output_str = output.decode("utf-8", errors="replace")
+        # Parse exit code from marker (same logic as local mode)
+        ec = 1
+        marker = "__TB2_EXIT_CODE__"
+        for line in output_str.splitlines()[::-1]:
+            if marker in line:
+                try:
+                    ec = int(line.split(":", 1)[1].strip())
+                except Exception:
+                    ec = 1
+                break
+        reward = 1.0 if ec == 0 else 0.0
+        info = {"tests_passed": ec == 0, "exit_code": ec}
+        return output_str, reward, info
+    def _evaluate_local(self) -> tuple[str, float, dict[str, Any]]:
+        """Evaluate task locally (fallback)."""
+        if self._task_dir is None:
+            raise RuntimeError("Task not initialized.")
+        tests_dir = self._task_dir / "tests"
+        cmd = f"cd {self._task_dir} && python -m pytest -q {tests_dir} -rA; echo __TB2_EXIT_CODE__:$?"
+        import subprocess
+        result = subprocess.run(
+            cmd,
+            shell=True,
+            capture_output=True,
+            text=True,
+            timeout=900.0,
+        )
+        output = result.stdout + result.stderr
+        exit_code = result.returncode
+        reward = 1.0 if exit_code == 0 else 0.0
+        info = {"tests_passed": exit_code == 0, "exit_code": exit_code}
+        return output, reward, info
+    @property
+    def state(self) -> Tbench2State:
+        return self._state
+    def close(self) -> None:
+        if self._container:
+            try:
+                self._container.stop(timeout=10)
+                self._container.remove(force=True)
+            except Exception:
+                pass
+            self._container = None
+        self._task_dir = None
+        self._instruction = ""
+    def _resolve_task_path(self, task_id: str | None, task_path: str | None) -> Path:
+        if task_path:
+            resolved = Path(task_path).expanduser().resolve()
+            if not resolved.exists():
+                raise FileNotFoundError(f"Task path not found: {resolved}")
+            return resolved
+        if not task_id:
+            raise ValueError("Provide task_id or task_path to reset TB2 environment.")
+        if not self.tasks_dir:
+            cache_dir = Path(os.getenv("TB2_CACHE_DIR", str(self.output_dir / "repo_cache")))
+            repo_dir = _download_tb2_repo(cache_dir)
+            resolved = repo_dir / task_id
+        else:
+            resolved = Path(self.tasks_dir).expanduser().resolve() / task_id
+        if not resolved.exists():
+            raise FileNotFoundError(f"Task path not found: {resolved}")
+        return resolved