Spaces:
Sleeping
Sleeping
| title: TemporalBenchEnv MCQ Server | |
| emoji: π₯ | |
| colorFrom: yellow | |
| colorTo: indigo | |
| sdk: docker | |
| pinned: false | |
| app_port: 8000 | |
| base_path: /web | |
| tags: | |
| - openenv | |
| # TemporalBenchEnv | |
| OpenEnv environment for **multi-step multiple-choice** time-series reasoning. Each episode samples nine questions from pre-built JSON banks (per-dataset files or merged JSONL in `TSQuestion` shape). Rewards combine per-step correctness and an episode bonus (see `env/reward.py`). | |
| ## Question bank layout | |
| Point the server at a directory containing `PSML_questions.json`, `freshretailnet_questions.json`, `MIMIC_questions.json`, and `causal_chambers_questions.json` (each file is a JSON array of `TSQuestion` records), or set **`TEMPORALBENCH_QUESTION_BANK_DIR`** to that path. If unset, the server uses `tests/fixtures/banks` when present (for local smoke runs). | |
| Each record must include at least: `question_id`, `dataset`, `task_type` (`T1U` | `T3` | `T2_MCQ`), `prompt`, `options` (length β₯ 2), `answer`, plus optional `family`, `capability_tags`, `difficulty`, `metadata`. | |
| ## Quick Start | |
| Use the typed client (`TemporalBenchEnvClient`; alias `TemporalbenchenvEnv`): | |
| ```python | |
| from client import TemporalBenchAction, TemporalBenchEnvClient | |
| try: | |
| env = TemporalBenchEnvClient.from_docker_image("TemporalBenchEnv-env:latest") | |
| out = env.reset() | |
| while not out.done: | |
| q = out.observation | |
| # Agent picks q.options[i] or equivalent label string | |
| out = env.step(TemporalBenchAction(answer=q.options[0])) | |
| finally: | |
| env.close() | |
| ``` | |
| `TemporalBenchEnvClient.from_docker_image()` handles: | |
| - Starting the Docker container | |
| - Waiting for the server to be ready | |
| - Connecting to the environment | |
| - Container cleanup when you call `close()` | |
| ## Building the Docker Image | |
| Before using the environment, you need to build the Docker image: | |
| ```bash | |
| # From project root | |
| docker build -t TemporalBenchEnv-env:latest -f server/Dockerfile . | |
| ``` | |
| ## Deploying to Hugging Face Spaces | |
| You can easily deploy your OpenEnv environment to Hugging Face Spaces using the `openenv push` command: | |
| ```bash | |
| # From the environment directory (where openenv.yaml is located) | |
| openenv push | |
| # Or specify options | |
| openenv push --namespace my-org --private | |
| ``` | |
| The `openenv push` command will: | |
| 1. Validate that the directory is an OpenEnv environment (checks for `openenv.yaml`) | |
| 2. Prepare a custom build for Hugging Face Docker space (enables web interface) | |
| 3. Upload to Hugging Face (ensuring you're logged in) | |
| ### Prerequisites | |
| - Authenticate with Hugging Face: The command will prompt for login if not already authenticated | |
| ### Options | |
| - `--directory`, `-d`: Directory containing the OpenEnv environment (defaults to current directory) | |
| - `--repo-id`, `-r`: Repository ID in format 'username/repo-name' (defaults to 'username/env-name' from openenv.yaml) | |
| - `--base-image`, `-b`: Base Docker image to use (overrides Dockerfile FROM) | |
| - `--private`: Deploy the space as private (default: public) | |
| ### Examples | |
| ```bash | |
| # Push to your personal namespace (defaults to username/env-name from openenv.yaml) | |
| openenv push | |
| # Push to a specific repository | |
| openenv push --repo-id my-org/my-env | |
| # Push with a custom base image | |
| openenv push --base-image ghcr.io/meta-pytorch/openenv-base:latest | |
| # Push as a private space | |
| openenv push --private | |
| # Combine options | |
| openenv push --repo-id my-org/my-env --base-image custom-base:latest --private | |
| ``` | |
| After deployment, your space will be available at: | |
| `https://huggingface.co/spaces/<repo-id>` | |
| The deployed space includes: | |
| - **Web Interface** at `/web` - Interactive UI for exploring the environment | |
| - **API Documentation** at `/docs` - Full OpenAPI/Swagger interface | |
| - **Health Check** at `/health` - Container health monitoring | |
| - **WebSocket** at `/ws` - Persistent session endpoint for low-latency interactions | |
| ## Environment Details | |
| ### Action (`TemporalBenchAction`) | |
| - `answer` (str) β MCQ label (must match ground truth after optional normalization) | |
| - `confidence`, `reasoning` β optional | |
| ### Observation (`TemporalBenchObservation`) | |
| - `question`, `options`, `task_type`, `dataset`, `history`, `accuracy_so_far` | |
| - `step_idx`, `steps_remaining`, `max_steps`, `done`, `reward`, `metadata` | |
| ### Reward | |
| - Per step: `alpha * correctness` (correctness 0 or 1). | |
| - On the final step, adds episode bonus: `lambda_ep * (total_correct / N) * coverage_multiplier` (1.0 if every dataset in the episode has at least one correct answer, else 0.8). | |
| ## Advanced Usage | |
| ### Connecting to an Existing Server | |
| If you already have a TemporalBenchEnv server running, connect with: | |
| ```python | |
| from client import TemporalBenchAction, TemporalBenchEnvClient | |
| with TemporalBenchEnvClient(base_url="http://localhost:8000") as env: | |
| r = env.reset() | |
| r = env.step(TemporalBenchAction(answer=r.observation.options[0])) | |
| ``` | |
| Note: `close()` does not stop a remote server you attached to with `base_url=...`. | |
| ### Using the Context Manager | |
| The client supports context manager usage for automatic connection management: | |
| ```python | |
| from client import TemporalBenchAction, TemporalBenchEnvClient | |
| with TemporalBenchEnvClient(base_url="http://localhost:8000") as env: | |
| result = env.reset() | |
| while not result.done: | |
| ans = result.observation.options[0] | |
| result = env.step(TemporalBenchAction(answer=ans)) | |
| ``` | |
| The client uses WebSocket connections for: | |
| - **Lower latency**: No HTTP connection overhead per request | |
| - **Persistent session**: Server maintains your environment state | |
| - **Efficient for episodes**: Better for many sequential steps | |
| ### Concurrent WebSocket Sessions | |
| The server uses **factory mode** (`create_app(_env_factory, ...)`) so each WebSocket session gets a fresh `TemporalBenchEnvironment`. Tune `max_concurrent_envs` in `server/app.py` as needed. | |
| ## Development & Testing | |
| ### Direct environment testing | |
| ```bash | |
| uv sync --extra dev | |
| uv run pytest tests/ | |
| ``` | |
| ### Running Locally | |
| Run the server locally for development: | |
| ```bash | |
| uvicorn server.app:app --reload | |
| ``` | |
| ## Project Structure | |
| ``` | |
| TemporalBenchEnv/ | |
| βββ .dockerignore # Docker build exclusions | |
| βββ __init__.py # Module exports | |
| βββ README.md # This file | |
| βββ openenv.yaml # OpenEnv manifest | |
| βββ pyproject.toml # Project metadata and dependencies | |
| βββ uv.lock # Locked dependencies (generated) | |
| βββ client.py # TemporalBenchEnvClient (alias TemporalbenchenvEnv) | |
| βββ models.py # Action / observation / state re-exports | |
| βββ env/ # Environment, sampler, grading, rewards | |
| βββ data/ # TSQuestion schema + JSON/JSONL loaders | |
| βββ server/ | |
| βββ __init__.py # Server module exports | |
| βββ app.py # FastAPI application (HTTP + WebSocket endpoints) | |
| βββ Dockerfile # Container image definition | |
| ``` | |