TemporalBenchEnv / README.md
yashu2000's picture
Upload folder using huggingface_hub
d954568 verified
metadata
title: TemporalBenchEnv MCQ Server
emoji: πŸ₯
colorFrom: yellow
colorTo: indigo
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv

TemporalBenchEnv

OpenEnv environment for multi-step multiple-choice time-series reasoning. Each episode samples nine questions from pre-built JSON banks (per-dataset files or merged JSONL in TSQuestion shape). Rewards combine per-step correctness and an episode bonus (see env/reward.py).

Question bank layout

Point the server at a directory containing PSML_questions.json, freshretailnet_questions.json, MIMIC_questions.json, and causal_chambers_questions.json (each file is a JSON array of TSQuestion records), or set TEMPORALBENCH_QUESTION_BANK_DIR to that path. If unset, the server uses tests/fixtures/banks when present (for local smoke runs).

Each record must include at least: question_id, dataset, task_type (T1U | T3 | T2_MCQ), prompt, options (length β‰₯ 2), answer, plus optional family, capability_tags, difficulty, metadata.

Quick Start

Use the typed client (TemporalBenchEnvClient; alias TemporalbenchenvEnv):

from client import TemporalBenchAction, TemporalBenchEnvClient

try:
    env = TemporalBenchEnvClient.from_docker_image("TemporalBenchEnv-env:latest")
    out = env.reset()
    while not out.done:
        q = out.observation
        # Agent picks q.options[i] or equivalent label string
        out = env.step(TemporalBenchAction(answer=q.options[0]))
finally:
    env.close()

TemporalBenchEnvClient.from_docker_image() handles:

  • Starting the Docker container
  • Waiting for the server to be ready
  • Connecting to the environment
  • Container cleanup when you call close()

Building the Docker Image

Before using the environment, you need to build the Docker image:

# From project root
docker build -t TemporalBenchEnv-env:latest -f server/Dockerfile .

Deploying to Hugging Face Spaces

You can easily deploy your OpenEnv environment to Hugging Face Spaces using the openenv push command:

# From the environment directory (where openenv.yaml is located)
openenv push

# Or specify options
openenv push --namespace my-org --private

The openenv push command will:

  1. Validate that the directory is an OpenEnv environment (checks for openenv.yaml)
  2. Prepare a custom build for Hugging Face Docker space (enables web interface)
  3. Upload to Hugging Face (ensuring you're logged in)

Prerequisites

  • Authenticate with Hugging Face: The command will prompt for login if not already authenticated

Options

  • --directory, -d: Directory containing the OpenEnv environment (defaults to current directory)
  • --repo-id, -r: Repository ID in format 'username/repo-name' (defaults to 'username/env-name' from openenv.yaml)
  • --base-image, -b: Base Docker image to use (overrides Dockerfile FROM)
  • --private: Deploy the space as private (default: public)

Examples

# Push to your personal namespace (defaults to username/env-name from openenv.yaml)
openenv push

# Push to a specific repository
openenv push --repo-id my-org/my-env

# Push with a custom base image
openenv push --base-image ghcr.io/meta-pytorch/openenv-base:latest

# Push as a private space
openenv push --private

# Combine options
openenv push --repo-id my-org/my-env --base-image custom-base:latest --private

After deployment, your space will be available at: https://huggingface.co/spaces/<repo-id>

The deployed space includes:

  • Web Interface at /web - Interactive UI for exploring the environment
  • API Documentation at /docs - Full OpenAPI/Swagger interface
  • Health Check at /health - Container health monitoring
  • WebSocket at /ws - Persistent session endpoint for low-latency interactions

Environment Details

Action (TemporalBenchAction)

  • answer (str) β€” MCQ label (must match ground truth after optional normalization)
  • confidence, reasoning β€” optional

Observation (TemporalBenchObservation)

  • question, options, task_type, dataset, history, accuracy_so_far
  • step_idx, steps_remaining, max_steps, done, reward, metadata

Reward

  • Per step: alpha * correctness (correctness 0 or 1).
  • On the final step, adds episode bonus: lambda_ep * (total_correct / N) * coverage_multiplier (1.0 if every dataset in the episode has at least one correct answer, else 0.8).

Advanced Usage

Connecting to an Existing Server

If you already have a TemporalBenchEnv server running, connect with:

from client import TemporalBenchAction, TemporalBenchEnvClient

with TemporalBenchEnvClient(base_url="http://localhost:8000") as env:
    r = env.reset()
    r = env.step(TemporalBenchAction(answer=r.observation.options[0]))

Note: close() does not stop a remote server you attached to with base_url=....

Using the Context Manager

The client supports context manager usage for automatic connection management:

from client import TemporalBenchAction, TemporalBenchEnvClient

with TemporalBenchEnvClient(base_url="http://localhost:8000") as env:
    result = env.reset()
    while not result.done:
        ans = result.observation.options[0]
        result = env.step(TemporalBenchAction(answer=ans))

The client uses WebSocket connections for:

  • Lower latency: No HTTP connection overhead per request
  • Persistent session: Server maintains your environment state
  • Efficient for episodes: Better for many sequential steps

Concurrent WebSocket Sessions

The server uses factory mode (create_app(_env_factory, ...)) so each WebSocket session gets a fresh TemporalBenchEnvironment. Tune max_concurrent_envs in server/app.py as needed.

Development & Testing

Direct environment testing

uv sync --extra dev
uv run pytest tests/

Running Locally

Run the server locally for development:

uvicorn server.app:app --reload

Project Structure

TemporalBenchEnv/
β”œβ”€β”€ .dockerignore         # Docker build exclusions
β”œβ”€β”€ __init__.py            # Module exports
β”œβ”€β”€ README.md              # This file
β”œβ”€β”€ openenv.yaml           # OpenEnv manifest
β”œβ”€β”€ pyproject.toml         # Project metadata and dependencies
β”œβ”€β”€ uv.lock                # Locked dependencies (generated)
β”œβ”€β”€ client.py              # TemporalBenchEnvClient (alias TemporalbenchenvEnv)
β”œβ”€β”€ models.py              # Action / observation / state re-exports
β”œβ”€β”€ env/                   # Environment, sampler, grading, rewards
β”œβ”€β”€ data/                  # TSQuestion schema + JSON/JSONL loaders
└── server/
    β”œβ”€β”€ __init__.py        # Server module exports
    β”œβ”€β”€ app.py             # FastAPI application (HTTP + WebSocket endpoints)
    └── Dockerfile         # Container image definition