Spaces:
Sleeping
Sleeping
| title: 911 Dispatch Supervisor | |
| emoji: "π¨" | |
| colorFrom: red | |
| colorTo: gray | |
| sdk: docker | |
| app_port: 7860 | |
| tags: | |
| - openenv | |
| pinned: false | |
| # π¨ 911 Dispatch Supervisor | |
| > **A city-wide emergency dispatch RL environment** β train and evaluate LLM agents to manage simultaneous incidents by dispatching police, fire, and EMS units across a city grid under realistic resource constraints. | |
| [](https://openenv.dev) | |
| [](https://hub.docker.com) | |
| [](https://huggingface.co/spaces) | |
| [](LICENSE) | |
| --- | |
| ## Why This Matters | |
| 911 dispatch centers in the United States handle over 240 million calls per year. Every dispatcher decision β which unit to send, in what order, with what priority β directly determines survival outcomes. A 90-second delay in dispatching a MEDIC to a cardiac arrest drops survival probability by roughly 10%. | |
| The **911 Dispatch Supervisor** is the first open RL benchmark for training and evaluating AI agents on emergency dispatch decisions. It models the exact tradeoffs real dispatchers face: triage under uncertainty, multi-unit resource allocation, geographic coverage, and protocol compliance β all simultaneously. | |
| This fills a direct gap for researchers building AI copilots for public safety systems, and provides immediate evaluation value for any LLM claiming real-world decision-making capability. | |
| ## Overview | |
| At every step, an LLM agent plays the role of a city-wide dispatch supervisor, deciding which units to dispatch, reassign, cancel, stage, or escalate β under time pressure, limited resources, and competing priorities across a 100Γ100 city grid. | |
| This is not a toy environment. Emergency dispatch is a high-stakes, multi-objective decision problem that: | |
| - Requires **triage** β prioritizing life-threatening incidents over property damage | |
| - Demands **coverage awareness** β keeping geographic zones protected | |
| - Rewards **correct unit-type matching** β sending a MEDIC vs. an ENGINE | |
| - Punishes **delays** that cause Priority-1 incidents to escalate | |
| - Scores **dispatch phraseology** β realistic radio communication language | |
| --- | |
| ## Environment Architecture | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β OpenEnv Interface β | |
| β reset() Β· step(action) Β· state() β | |
| ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ | |
| β | |
| ββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββ | |
| β DispatchStateMachine β | |
| β β’ Validates actions via DispatchProtocolValidator β | |
| β β’ Moves units toward incidents (Manhattan physics) β | |
| β β’ Advances incident status: PENDING β RESPONDING β β | |
| β ON_SCENE β RESOLVED (or ESCALATED if timeout) β | |
| β β’ Spawns incident waves at configured step offsets β | |
| ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ | |
| β | |
| ββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββ | |
| β RewardCalculator β | |
| β β’ response_time (30%) Β· triage (25%) Β· survival (25%) β | |
| β β’ coverage (12%) Β· protocol (8%) β | |
| β β’ Safety gate: P1 failure β score capped at 0.2 β | |
| ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ | |
| β | |
| ββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββ | |
| β Task-Specific Episode Graders β | |
| β single_incident Β· multi_incident Β· mass_casualty Β· β | |
| β shift_surge β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| --- | |
| ## Action Space | |
| Actions are structured Pydantic models β no free-text parsing required. | |
| **`src.models.Action`** | |
| | Field | Type | Description | | |
| |---|---|---| | |
| | `action_type` | `DispatchAction` | One of: `DISPATCH`, `CANCEL`, `REASSIGN`, `STAGE`, `MUTUAL_AID`, `UPGRADE`, `DOWNGRADE` | | |
| | `unit_id` | `str` | Unit identifier, e.g. `MED-1`, `ENG-2` | | |
| | `incident_id` | `str` | Incident identifier, e.g. `INC-001` | | |
| | `notes` | `str \| None` | Optional phraseology text for protocol scoring bonus | | |
| | `priority_override` | `IncidentSeverity \| None` | Required for `UPGRADE`/`DOWNGRADE` actions | | |
| **Action Types** | |
| | Action | Description | Protocol Rule | | |
| |---|---|---| | |
| | `DISPATCH` | Send an available unit to an incident | Unit must be `AVAILABLE`; incident must not be `RESOLVED` | | |
| | `CANCEL` | Release a unit from its current assignment | Unit must be assigned to the specified incident | | |
| | `REASSIGN` | Redirect an assigned unit to a different incident | Unit must be `DISPATCHED`, `ON_SCENE`, or `TRANSPORTING` | | |
| | `STAGE` | Pre-position a unit near an incident without committing | Unit must be `AVAILABLE`; incident must be `PENDING` | | |
| | `MUTUAL_AID` | Request external unit of a given type | Only allowed when all local units of that type are busy | | |
| | `UPGRADE` | Increase incident severity | New severity must be strictly higher than current | | |
| | `DOWNGRADE` | Decrease incident severity | New severity must be strictly lower than current | | |
| #### Dispatch Phraseology (bonus scoring) | |
| The `notes` field is scored for realistic radio communication language. Agents that use proper dispatch phraseology receive up to 8% bonus on their protocol score. | |
| | Action | Example notes value | | |
| |---|---| | |
| | Dispatch MEDIC to cardiac | `"Medic 1 en route to cardiac arrest, Code 3, ETA 4 minutes"` | | |
| | Dispatch ENGINE to fire | `"Engine 2 responding to structure fire, Code 3, all units advised"` | | |
| | Mutual aid request | `"Requesting mutual aid, all local MEDICs committed, Priority 1 cardiac at grid 45-72"` | | |
| | Stage unit | `"Engine 1 staging at District 3 perimeter, awaiting scene clear"` | | |
| --- | |
| ## Observation Space | |
| **`src.models.Observation`** | |
| | Field | Type | Description | | |
| |---|---|---| | |
| | `result` | `str` | Human-readable result of the last action | | |
| | `score` | `float` | Episode score in `[0.0, 1.0]` (task-level grade) | | |
| | `protocol_ok` | `bool` | Whether the action passed protocol validation | | |
| | `issues` | `list[str]` | Warnings or error codes from the validator | | |
| | `reward_breakdown` | `dict[str, float] \| None` | Per-component reward scores for dashboard display | | |
| **Full State (`src.models.State`)** | |
| | Field | Type | Description | | |
| |---|---|---| | |
| | `units` | `dict[str, UnitState]` | All units with type, status, location, ETA | | |
| | `incidents` | `dict[str, IncidentState]` | All incidents with type, severity, status, assigned units | | |
| | `episode_id` | `str` | Unique episode identifier | | |
| | `step_count` | `int` | Current step number | | |
| | `task_id` | `str` | Active task identifier | | |
| | `city_time` | `float` | Simulated city clock in seconds (30s per step) | | |
| | `metadata` | `dict` | Schema info, districts, seeds, wave configs, bookkeeping | | |
| **Unit Status Transitions** | |
| ``` | |
| AVAILABLE β DISPATCHED β ON_SCENE β AVAILABLE | |
| β | |
| OUT_OF_SERVICE (shift_surge only) | |
| ``` | |
| **Incident Status Transitions** | |
| ``` | |
| PENDING β RESPONDING β ON_SCENE β RESOLVED | |
| β β | |
| ESCALATED ESCALATED (survival clock expires) | |
| ``` | |
| --- | |
| ## Reward Function | |
| The step-level reward is a weighted combination of five components: | |
| | Component | Weight | Description | | |
| |---|---|---| | |
| | `response_time` | **30%** | How quickly dispatched units reach incidents relative to severity benchmarks (P1: 240s, P2: 480s, P3: 900s) | | |
| | `triage` | **25%** | Whether the dispatched unit type matches incident requirements (e.g., MEDIC for CARDIAC_ARREST) | | |
| | `survival` | **25%** | Fraction of Priority-1 incidents resolved before the survival clock expires | | |
| | `coverage` | **12%** | Geographic distribution of available units across city districts | | |
| | `protocol` | **8%** | Action legality + optional phraseology/readback quality via `Action.notes` | | |
| > **β οΈ Safety Gate:** If any Priority-1 incident (cardiac arrest, shooting, building collapse) results in zero survival score, the entire episode reward is hard-capped at **0.2** regardless of other performance. This forces agents to treat life-threatening incidents as non-negotiable β exactly as real dispatch protocol requires. | |
| **Non-DISPATCH actions** receive neutral `0.5` for `response_time` and `triage`, allowing agents to maintain coverage without penalty. | |
| --- | |
| ## Tasks | |
| ### Task Difficulty Overview | |
| | Task | Difficulty | Max Steps | Key Challenge | | |
| |---|---|---|---| | |
| | `single_incident` | π’ Easy | 20 | Dispatch the right unit type quickly | | |
| | `multi_incident` | π‘ Medium | 40 | Triage 3 simultaneous incidents, protect P1s | | |
| | `mass_casualty` | π΄ Hard | 60 | Manage wave-based surge with limited resources | | |
| | `shift_surge` | π΄ Hard | 60 | Adapt as units fail and incidents stream continuously | | |
| --- | |
| ### π’ Task 1: `single_incident` β Basic Dispatch (Easy) | |
| **Scenario**: One active incident (`CARDIAC_ARREST`, Priority-1) in a small city. A MEDIC, ENGINE, and PATROL are all available. | |
| **Objective**: Dispatch the correct unit type (MEDIC) to the incident as fast as possible. | |
| **Grader Logic**: | |
| ``` | |
| score = 0.0 | |
| if incident RESOLVED: score += 0.50 | |
| if MEDIC dispatched correctly: score += 0.30 | |
| if resolved within 10 steps: score += 0.20 | |
| ``` | |
| **Why it's easy**: One incident, one correct action, small state space. | |
| **What a good agent does**: Immediately dispatches `MED-1 β INC-001`. | |
| **Scoring:** 50% resolution + 30% correct unit type used + 20% response speed. | |
| --- | |
| ### π‘ Task 2: `multi_incident` β Simultaneous Triage (Medium) | |
| **Scenario**: Three concurrent incidents at episode start β a structure fire (P2), a cardiac arrest (P1), and a shooting (P1) β with 6 available units. | |
| **Objective**: Respond to all incidents with the right unit types, prioritizing P1s. | |
| **Grader Logic**: | |
| ``` | |
| score = 0.5 Γ p1_resolution_rate | |
| + 0.3 Γ overall_resolution_rate | |
| - 0.2 Γ escalation_penalty | |
| ``` | |
| **Why it's medium**: Multiple incidents compete for units; wrong type dispatch wastes coverage; P1s must be addressed before P2. | |
| **What a good agent does**: Immediately dispatches MEDIC to cardiac arrest and patrol to shooting, then handles the fire with ENGINE/LADDER. | |
| **Scoring:** 50% P1 resolution + 30% overall resolution β 20% escalation penalty. | |
| --- | |
| ### π΄ Task 3: `mass_casualty` β Wave-Based Surge (Hard) | |
| **Scenario**: One critical incident (`BUILDING_COLLAPSE`, P1) at step 0. New waves arrive at steps 5 (structure fire) and 12 (two simultaneous cardiac arrests). | |
| **Objective**: Maximize P1 survival across all waves despite resource conflicts. | |
| **Grader Logic**: | |
| ``` | |
| score = 0.6 Γ p1_survival_rate | |
| + 0.3 Γ mean_step_reward | |
| - failure_penalty | |
| ``` | |
| **Why it's hard**: Resources are exhausted when waves arrive. Agents must decide whether to reassign mid-scene or request mutual aid (at a 120s ETA penalty). Mutual aid is only legal when local units of the required type are fully committed. | |
| **What a good agent does**: Dispatches immediately to initial collapse, stages additional units near expected wave arrival zones, requests mutual aid for later waves. | |
| **Scoring:** 60% P1 survival + 30% mean step reward β failure penalty if building collapse unresponded. | |
| --- | |
| ### π΄ Task 4: `shift_surge` β Long-Horizon Degradation (Hard) | |
| **Scenario**: 5 units start available, but 3 go `OUT_OF_SERVICE` by step 5. Incidents arrive in waves every 8 steps throughout the 60-step episode. | |
| **Objective**: Maintain city-wide throughput and P1 survival despite progressive resource degradation. | |
| **Grader Logic**: | |
| ``` | |
| score = 0.35 Γ resolution_ratio | |
| + 0.25 Γ p1_survival | |
| + 0.15 Γ coverage | |
| + 0.15 Γ (1 - backlog_ratio) | |
| + 0.10 Γ mean_reward | |
| - 0.25 Γ escalation_ratio | |
| ``` | |
| **Why it's hard**: No single optimal strategy β agents must continuously rebalance between throughput and coverage as available resources shrink and incident demand grows. | |
| **Scoring:** 35% resolution + 25% P1 survival + 15% coverage + 15% backlog management + 10% step reward β 25% escalation penalty. | |
| --- | |
| ## Unit Types | |
| | Unit | Code | Speed | Primary Use | | |
| |---|---|---|---| | |
| | Engine | `ENGINE` | 0.8 bl/s | Structure fires, hazmat support | | |
| | Ladder | `LADDER` | 0.6 bl/s | Multi-story fires, rescues | | |
| | Medic | `MEDIC` | 1.0 bl/s | Medical emergencies, trauma | | |
| | Patrol | `PATROL` | 1.2 bl/s | Shootings, MVAs, crowd control | | |
| | Hazmat | `HAZMAT` | 0.5 bl/s | Chemical/biological spills | | |
| ## Incident Types | |
| | Incident | Recommended Units | Default Severity | | |
| |---|---|---| | |
| | `CARDIAC_ARREST` | MEDIC | P1 | | |
| | `STRUCTURE_FIRE` | ENGINE Γ 2, LADDER | P2 | | |
| | `SHOOTING` | MEDIC, PATROL Γ 2 | P1 | | |
| | `MULTI_VEHICLE_ACCIDENT` | MEDIC, PATROL | P2 | | |
| | `BUILDING_COLLAPSE` | ENGINE, LADDER, MEDIC Γ 2 | P1 | | |
| | `HAZMAT_SPILL` | HAZMAT, ENGINE | P2 | | |
| | `OVERDOSE` | MEDIC | P2 | | |
| | `MISSING_PERSON` | PATROL | P3 | | |
| --- | |
| ## OpenEnv Interface | |
| ```python | |
| import asyncio | |
| from src.openenv_environment import OpenEnvEnvironment | |
| from src.models import Action, DispatchAction | |
| async def main(): | |
| env = OpenEnvEnvironment(task_id="multi_incident", seed=42) | |
| # Reset to initial state | |
| obs = await env.reset() | |
| print(obs.result) # "dispatch center online" | |
| # Get legal actions (protocol-validated) | |
| legal = env.legal_actions() | |
| # Take a step | |
| action = legal[0] | |
| obs, reward, done = await env.step(action) | |
| print(f"reward={reward:.3f}, done={done}, protocol_ok={obs.protocol_ok}") | |
| # Inspect full state | |
| state = env.state() | |
| print(f"step={state.step_count}, city_time={state.city_time}s") | |
| asyncio.run(main()) | |
| ``` | |
| --- | |
| ## API Endpoints | |
| | Endpoint | Method | Description | | |
| |---|---|---| | |
| | `/health` | GET | Health check β returns `{"status": "healthy"}` | | |
| | `/reset` | POST | Reset environment; body: `{"task_id": "...", "seed": 42}` (both optional) | | |
| | `/step` | POST | Execute an action; body: `{"action": {...}}` | | |
| | `/state` | GET | Current full environment state | | |
| | `/tasks` | GET | List all available tasks with metadata | | |
| | `/dashboard/state` | GET | Extended state for live HTML dashboard | | |
| | `/schema` | GET | JSON schemas for Action, Observation, State | | |
| | `/metadata` | GET | Environment name, version, description | | |
| --- | |
| ## Quick Start | |
| ```bash | |
| # Install dependencies | |
| pip install -r requirements.txt | |
| # Run the demo (non-interactive, no LLM required) | |
| python demo.py | |
| # Start the API server | |
| python -m src.server.app | |
| # Run random agent baseline (no API key required) | |
| USE_RANDOM=true python inference.py | |
| # Run LLM agent | |
| API_BASE_URL=https://router.huggingface.co/v1 \ | |
| MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct \ | |
| HF_TOKEN=your_token \ | |
| python inference.py | |
| # Run full test suite | |
| pytest tests/ -v | |
| ``` | |
| --- | |
| ## Docker | |
| ### Build & Run | |
| ```bash | |
| # Build image | |
| docker build -t citywide-dispatch-supervisor . | |
| # Run on port 7860 (required for HF Spaces) | |
| docker run -p 7860:7860 citywide-dispatch-supervisor | |
| # Health check | |
| curl http://localhost:7860/health | |
| # Reset to a specific task | |
| curl -X POST http://localhost:7860/reset \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"task_id": "multi_incident", "seed": 42}' | |
| ``` | |
| --- | |
| ## Hugging Face Spaces Deployment | |
| This repository is deployed as a Docker-based HF Space. | |
| 1. Create a new HF Space β select **Docker** | |
| 2. Push this repository to the Space | |
| 3. The server reads `PORT` from the environment (HF sets `PORT=7860`) | |
| 4. Once running, the following endpoints are publicly available: | |
| - `GET /health` | |
| - `POST /reset` | |
| - `POST /step` | |
| - `GET /state` | |
| Validate your deployment with the prevalidation script: | |
| ```bash | |
| bash samplematerial/prevalidation.sh https://your-space.hf.space . | |
| ``` | |
| --- | |
| ## Environment Variables | |
| | Variable | Description | Default | | |
| |---|---|---| | |
| | `API_BASE_URL` | LLM API endpoint | `https://router.huggingface.co/v1` | | |
| | `MODEL_NAME` | Model identifier | `meta-llama/Llama-3.1-8B-Instruct` | | |
| | `HF_TOKEN` | HuggingFace API key | β | | |
| | `USE_RANDOM` | Set `true` for deterministic random baseline | `false` | | |
| | `PORT` | Server port | `7860` | | |
| --- | |
| ## Baseline Scores | |
| Scores normalized to `[0.0, 1.0]` using `sum(rewards) / max_steps`. | |
| Run with `USE_RANDOM=true python inference.py` (seed=42, fully deterministic). | |
| | Task | Difficulty | Max Steps | Random Agent Score | | |
| |---|---|---|---| | |
| | `single_incident` | Easy | 20 | 0.2000 | | |
| | `multi_incident` | Medium | 40 | 0.3117 | | |
| | `mass_casualty` | Hard | 60 | 0.4645 | | |
| | `shift_surge` | Hard | 60 | 0.3183 | | |
| > **Note:** Earlier README versions showed higher scores (~0.30β0.74) from a different scoring path (`observation.score`). These figures use the canonical competition normalization: `sum(step_rewards) / max_steps`, clamped to `[0.0, 1.0]`. | |
| ### What the scores mean | |
| A random agent scoring **0.20 on the easiest task** confirms the environment is not trivially solvable β there is no reward for random dispatching. The gradient from 0.20 β 0.46 across tasks reflects genuine increasing complexity, not just more steps. | |
| A well-prompted frontier LLM (GPT-4o, Llama-3.1-70B) is expected to score **0.55β0.75 on single_incident** and **0.30β0.45 on shift_surge**, demonstrating the environment meaningfully differentiates agent capability. | |
| LLM agents (`meta-llama/Llama-3.1-8B-Instruct` via `https://router.huggingface.co/v1`) are expected to score meaningfully higher on easy and medium tasks by correctly prioritizing P1 incidents and matching unit types. | |
| Run the baseline matrix (random + LLM reruns) and emit a JSON report: | |
| ```bash | |
| API_BASE_URL=https://router.huggingface.co/v1 \ | |
| MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct \ | |
| HF_TOKEN=your_token \ | |
| python scripts/run_baseline_matrix.py --random-runs 1 --llm-runs 3 --output-json baseline_report.json | |
| ``` | |
| Windows PowerShell shortcut: | |
| ```powershell | |
| $env:HF_TOKEN="your_token" | |
| powershell -ExecutionPolicy Bypass -File scripts/run_nemotron_baseline.ps1 -RandomRuns 1 -LlmRuns 3 | |
| ``` | |
| --- | |
| ## Project Structure | |
| ``` | |
| . | |
| βββ src/ | |
| β βββ models.py # Pydantic typed contracts (Action, Observation, State) | |
| β βββ protocol.py # Dispatch protocol validator | |
| β βββ physics.py # City-grid movement / ETA helpers | |
| β βββ city_schema.py # City topology + unit configuration loader | |
| β βββ state_machine.py # Core dispatch state machine | |
| β βββ rewards.py # Reward engine + episode graders | |
| β βββ phraseology.py # Dispatch phraseology renderer/judge | |
| β βββ api.py # REST API client wrapper | |
| β βββ grading.py # Centralized episode grading router | |
| β βββ benchmark.py # Benchmark runner (list/run all tasks) | |
| β βββ openenv_environment.py # OpenEnv-compatible environment wrapper | |
| β βββ tasks/ | |
| β β βββ registry.py # Task registry + deterministic scenario fixtures | |
| β β βββ single_incident.py # Easy task + grader | |
| β β βββ multi_incident.py # Medium task + grader | |
| β β βββ mass_casualty.py # Hard task + grader | |
| β β βββ shift_surge.py # Hard task + grader | |
| β βββ server/ | |
| β β βββ app.py # FastAPI server (reset/step/state endpoints) | |
| β β βββ requirements.txt | |
| β β βββ Dockerfile | |
| β βββ visualizer/ | |
| β βββ viewer.py # Read-only 2D Matplotlib visualizer | |
| βββ data/ | |
| β βββ metro_city.json # Large city schema (default) | |
| β βββ city_small.json # Small city schema (testing) | |
| βββ tests/ # TDD test suite (~20 test modules) | |
| βββ samplematerial/ | |
| β βββ prevalidation.sh # HF Space + Docker validation script | |
| βββ demo.py # Non-interactive demo (no LLM required) | |
| βββ inference.py # Competition inference script | |
| βββ live_dashboard.html # Browser-based live dashboard | |
| βββ validate_local.py # Local pre-submission validation | |
| βββ openenv.yaml # OpenEnv specification | |
| βββ pyproject.toml # uv project config | |
| βββ requirements.txt # pip dependencies | |
| βββ Dockerfile # Root Docker build | |
| ``` | |
| --- | |
| ## Live Dashboard | |
| After starting the server and calling `/reset`, open `live_dashboard.html` in a browser: | |
| ```bash | |
| # Terminal 1: start server | |
| python -m src.server.app | |
| # Terminal 2: reset to a task | |
| curl -X POST http://localhost:7860/reset \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"task_id": "multi_incident"}' | |
| # Browser: open live_dashboard.html | |
| ``` | |
| The dashboard polls `/dashboard/state` every 500ms and renders: | |
| - Unit cards (status, ETA, assignment, location) | |
| - Incident cards (type, severity, status, assigned units) | |
| - City map (2D grid with unit and incident markers) | |
| - Per-step reward component bars | |
| --- | |
| ## 2D Visualizer (Programmatic) | |
| ```python | |
| import asyncio | |
| from src.openenv_environment import OpenEnvEnvironment | |
| from src.visualizer.viewer import Viewer2D | |
| async def main(): | |
| env = OpenEnvEnvironment(task_id="multi_incident", seed=42) | |
| await env.reset() | |
| Viewer2D().render_to_file("frame.png", env.state()) | |
| env.close() | |
| asyncio.run(main()) | |
| ``` | |
| --- | |
| --- | |
| ## Determinism | |
| All scenarios are deterministic under a fixed seed: | |
| ```python | |
| env1 = OpenEnvEnvironment(task_id="shift_surge", seed=42) | |
| env2 = OpenEnvEnvironment(task_id="shift_surge", seed=42) | |
| # env1 and env2 produce identical episodes | |
| ``` | |
| Incident positions include small seeded perturbations for realism; the overall episode structure (waves, unit positions, incident types) is fully reproducible. | |
| --- | |
| ## Running Tests | |
| ```bash | |
| # Full test suite | |
| pytest tests/ -v | |
| # Individual modules | |
| pytest tests/test_state_machine.py -v | |
| pytest tests/test_rewards.py -v | |
| pytest tests/test_openenv_integration.py -v | |
| pytest tests/test_inference.py -v | |
| ``` | |
| --- | |
| ## Pre-Submission Validation | |
| ```bash | |
| # Full local validation (tests + inference + docker + benchmark scores) | |
| python validate_local.py | |
| # OpenEnv spec validation | |
| openenv validate | |
| # HF Space validation (requires deployed space) | |
| bash samplematerial/prevalidation.sh https://your-space.hf.space . | |
| # Windows (explicit Git Bash) | |
| "C:/Program Files/Git/bin/bash.exe" samplematerial/prevalidation.sh https://your-space.hf.space . | |
| ``` | |
| --- | |
| ## License | |
| MIT License | |