File size: 44,614 Bytes
85d3923 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 | # SchedulingOptEnv
## A Markov Decision Environment for Training Autonomous Scheduling Optimisation Agents
**Meta Γ Scaler β OpenEnv Hackathon Submission**
**Team Vector** β Vittal Mukunda Β· Nikhilesh Nuthalapati Β· Stavan Rahul Khobare (Lead)
---
## Table of Contents
1. [Abstract](#1-abstract)
2. [Introduction](#2-introduction)
3. [Background and Motivation](#3-background-and-motivation)
4. [System Architecture](#4-system-architecture)
5. [Methodology](#5-methodology)
6. [The Scheduling Instance Corpus](#6-the-scheduling-instance-corpus)
7. [Task Definitions and Grading Functions](#7-task-definitions-and-grading-functions)
8. [Reward Design Philosophy](#8-reward-design-philosophy)
9. [Data Models](#9-data-models)
10. [API Specification](#10-api-specification)
11. [Inference and Baseline](#11-inference-and-baseline)
12. [Setup and Installation](#12-setup-and-installation)
13. [How to Get Results](#13-how-to-get-results)
14. [End-to-End Walkthrough](#14-end-to-end-walkthrough)
15. [Evaluation and Scoring](#15-evaluation-and-scoring)
16. [Project Structure](#16-project-structure)
17. [Dependencies](#17-dependencies)
18. [Glossary](#18-glossary)
---
## 1. Abstract
We present **SchedulingOptEnv**, a real-world AI agent training environment built on the OpenEnv framework for the Meta Γ Scaler OpenEnv Hackathon. The environment frames combinatorial scheduling optimisation as a sequential decision problem β a Markov Decision Process (MDP) β exposing agents to three progressively harder sub-tasks: binary feasibility determination, multi-class constraint-violation classification, and full schedule repair.
Each task is paired with a structured, differentiable reward function that provides dense, partial-progress signals rather than sparse binary outcomes. This design choice ensures that an AI agent always has a meaningful learning signal at every step, even when its answer is wrong β accelerating convergence during training.
The environment ships with:
- A **12-instance scheduling corpus** covering five distinct constraint-violation classes and two fully feasible baseline schedules
- A **FastAPI HTTP inference server** with 7 endpoints, exposing the full OpenEnv API contract
- A **standalone inference script** (`inference.py`) using the OpenAI client with configurable `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN` environment variables
- A **GPT-4o-mini baseline** with oracle mock fallback for offline verification
- A **Docker container** deployable to Hugging Face Spaces with a single command
Oracle baseline achieves perfect scores (1.0) on all three tasks. The environment is fully spec-compliant with the OpenEnv standard and passes all automated pre-submission validation checks.
---
## 2. Introduction
### 2.1 What Is This Project?
In plain terms: **SchedulingOptEnv is a training gym where AI learns to detect, classify, and fix broken work schedules.**
Imagine a factory that needs to schedule 5 machines and 20 jobs. Some jobs can only run after other jobs finish. Some machines can only handle 2 jobs at once. Some machines are offline for maintenance from 3pm to 6pm. If someone hands the factory manager a schedule that violates any of these rules, the schedule is broken and production will fail.
This project creates a structured training playground where an AI agent can:
1. Look at a proposed schedule
2. Decide whether it is valid or broken
3. If broken, identify which type of rule was broken
4. Produce a corrected schedule that satisfies all constraints and minimises total time
The agent learns by trial and error: it receives a graded reward when it responds, and over thousands of practice rounds, it improves.
### 2.2 Why Scheduling?
Scheduling is one of the most practically important and computationally hard problems in computer science. It appears across nearly every industry:
| Industry | Scheduling Problem |
|----------|--------------------|
| Manufacturing | Assigning jobs to machines, managing shift constraints |
| Healthcare | Booking operating rooms, allocating staff and equipment |
| Aviation | Scheduling flights, crew rotations, gate assignments |
| Cloud computing | Allocating compute tasks to servers with capacity limits |
| Construction | Sequencing tasks with dependencies and resource limits |
| Logistics | Routing vehicles with time-window constraints |
Despite its industrial importance, prior benchmarks for evaluating AI agents on scheduling tasks were either:
- **Offline only** β agents see a problem once and produce a single answer, with no iterative improvement
- **Narrowly scoped** β focused purely on optimisation, not on the constraint-satisfaction and repair workflow that human planners actually use day-to-day
SchedulingOptEnv fills this gap. It creates an interactive, multi-step environment where an agent can reason, receive feedback, and refine its answers β mirroring the real cognitive workflow of a scheduling expert.
### 2.3 What Is OpenEnv?
OpenEnv is a framework specification by Meta and Hugging Face for building standardised, interactive AI agent training environments. It defines a common API contract β `step()`, `reset()`, `state()` β that any AI training system can connect to. By building SchedulingOptEnv on top of OpenEnv, any compatible AI agent or training loop can immediately start learning from our environment without any custom integration work.
---
## 3. Background and Motivation
### 3.1 The Scheduling Problem Formally
A scheduling instance consists of:
- A set of **jobs** `J = {J1, J2, ..., Jn}`, each with:
- `duration` β how long the job takes to run
- `deadline` β the latest time by which the job must be completed
- `dependencies` β a list of job IDs that must complete before this job starts
- `resource_req` β how much capacity this job consumes on its assigned machine
- A set of **machines** `M = {M1, M2, ..., Mm}`, each with:
- `capacity` β how many jobs can run concurrently
- `available_start` β the earliest time the machine is operational
- `available_end` β the latest time the machine is operational
- A **proposed schedule** β a set of assignments `{(job_id, machine_id, start_time)}` specifying when and where each job runs
A schedule is **feasible** if and only if it satisfies all four constraint categories simultaneously:
| Constraint | Formal Definition |
|-----------|-------------------|
| **Capacity** | At every time `t`, the number of jobs concurrently running on machine `m` does not exceed `m.capacity` |
| **Deadline** | For every job `j`, `start_time(j) + duration(j) β€ deadline(j)` |
| **Precedence** | For every job `j` and every predecessor `p β dependencies(j)`: `start_time(j) β₯ start_time(p) + duration(p)` |
| **Availability** | For every job `j` on machine `m`: `start_time(j) β₯ m.available_start` and `start_time(j) + duration(j) β€ m.available_end` |
A schedule is **infeasible** if it violates any one of these.
### 3.2 Why This Is Hard
Even checking whether a schedule is feasible is non-trivial for large instances. Repairing a broken schedule to be both feasible and optimal is NP-hard in general. Human planners develop intuition over years of experience. This project asks: can we create an environment rich enough for an AI to develop similar intuition through reinforcement learning?
### 3.3 The Gap in Existing Benchmarks
Existing scheduling benchmarks (e.g., JSP, FJSP) focus on finding optimal schedules from scratch. They do not model:
- The iterative detect β classify β repair workflow
- Partial-credit rewards for nearly-correct repairs
- Multi-step episodes where agents can refine answers
- Diverse constraint violation types as a classification target
SchedulingOptEnv addresses all of these gaps.
---
## 4. System Architecture
The system is organised into five logical layers, each building on the one below it.
```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LAYER 5 β EXTERNAL CLIENTS β
β AI agents, training loops, researchers, the inference script β
β Communicate via HTTP JSON requests to port 7860 β
ββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β HTTP / JSON
ββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββ
β LAYER 4 β API SERVER (server.py) β
β FastAPI application with 7 REST endpoints β
β Validates requests, routes to environment, serialises responses β
β Raises HTTP 400/422 for malformed input β
ββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β Python function calls
ββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββ
β LAYER 3 β ENVIRONMENT (environment.py) β
β SchedulingOptEnv class β the core "game engine" β
β Manages: episode lifecycle, instance selection, step counting, β
β termination logic, cumulative reward tracking, action history β
β Holds: INSTANCE_BANK (12 scheduling problems), grader singletons β
ββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββ
β β
ββββββββββββββΌβββββββββββ ββββββββββββββΌβββββββββββββββββββ
β LAYER 2a β TASKS β β LAYER 2b β GRADERS β
β (tasks/ folder) β β (graders/ folder) β
β β β β
β task1_easy.py β β grader_detection.py β
β task2_medium.py β β FeasibilityGrader β
β task3_hard.py β β (binary, synonym-aware) β
β β β β
β Each task exposes: β β grader_classification.py β
β - episode runner β β ConflictGrader β
β - instance accessor β β (family-aware partial) β
β - task metadata β β β
ββββββββββββββ¬βββββββββββ β grader_fix.py β
β β RepairGrader β
β β (4-component additive) β
β ββββββββββββββ¬βββββββββββββββββββ
β β
ββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββΌβββββββββββββββββββ
β LAYER 1 β DATA MODELS (models.py) β
β Pydantic v2 schemas: Observation, Action, Reward β
β The shared type contract all layers use to communicate β
β Enforces field types, constraints (score β [0.0, 1.0]), and docs β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Additional components (alongside Layer 4):
inference.py β OpenAI-client inference script (API_BASE_URL / MODEL_NAME / HF_TOKEN)
baseline.py β GPT-4o-mini evaluation with oracle mock fallback
openenv.yaml β OpenEnv metadata manifest (machine-readable spec declaration)
Dockerfile β Container definition (python:3.11-slim, port 7860)
```
### 4.1 Component Responsibilities
**`models.py`** β Defines the three Pydantic v2 schemas that every other component imports:
- `Observation` β what the agent sees at each step (the schedule, task ID, instructions, step number)
- `Action` β what the agent submits (its text response + task ID)
- `Reward` β what the grader returns (float score + human-readable feedback string)
**`environment.py`** β The central engine. Contains:
- `INSTANCE_BANK` β the list of 12 scheduling instances with ground-truth answers
- `SchedulingOptEnv` class with `reset()`, `step()`, and `state()` methods
- Task-aware instance routing (Tasks 2 and 3 only see infeasible instances)
- Round-robin instance cycling to ensure all instances are covered during training
- Per-step contextual hints embedded in the `context` field of each Observation
**`server.py`** β FastAPI web application. Creates one shared `SchedulingOptEnv` instance and exposes it over 7 HTTP endpoints. Handles input validation and error responses.
**`graders/`** β Three independent grader classes, each implementing a `grade(action, ground_truth) β float` method. Graders are stateless between calls but store the last grading breakdown in `last_breakdown` for the environment to surface in the `info` dict.
**`tasks/`** β Thin wrappers that expose episode-running logic and instance pools for each task. These are used both by the environment and directly in the inference script.
**`inference.py`** β Standalone evaluation script. Reads `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN` from environment variables. Creates an OpenAI client pointed at the configured endpoint. Runs all 32 episodes (12 + 10 + 10) and emits `[START]`, `[STEP]`, and `[END]` structured JSON logs to stdout.
---
## 5. Methodology
### 5.1 MDP Formulation
The environment is formalised as a Markov Decision Process with the following components:
| MDP Component | Definition |
|--------------|-----------|
| **State** `S` | Current scheduling instance + task type + step count + episode history |
| **Observation** `O` | `{schedule_instance: str (JSON), task_id: str, context: str, step_number: int}` |
| **Action** `A` | `{response: str, task_id: str}` |
| **Reward** `R` | Float β [0.0, 1.0] from the task-specific grader |
| **Horizon** `T` | 3 steps (Task 1) / 5 steps (Task 2) / 8 steps (Task 3) |
| **Terminal** | `done = True` when T is reached OR reward β₯ 0.95 |
The observation is always a JSON-encoded scheduling instance paired with a context string that instructs the agent on what to do. The action is always a text string β the agent's natural-language or structured-JSON response.
### 5.2 Episode Lifecycle
Every episode follows this exact sequence:
```
1. reset(task_id) is called
βββ Validate task_id β {feasibility_check, conflict_classification, schedule_repair}
βββ Select instance pool for this task
βββ Pick next instance via round-robin counter
βββ Set step = 0, done = False, cumulative_reward = 0.0
βββ Build initial context string (task instructions)
βββ Return Observation
2. Agent reads the Observation and formulates a response
3. step(action) is called
βββ Increment step counter
βββ Select correct grader based on task_id
βββ Call grader.grade(action, instance_ground_truth)
βββ Clamp reward to [0.0, 1.0]
βββ Accumulate cumulative_reward
βββ Log (step, action, reward) to history
βββ Check termination: done = (step β₯ max_steps OR reward β₯ 0.95)
βββ Return (Observation, reward, done, info)
4. If done = False β repeat from step 2
If done = True β episode ends; call reset() to start the next episode
```
### 5.3 Task-Aware Instance Routing
Not all 12 instances are appropriate for every task:
- **Task 1 (Feasibility Check)** β Uses all 12 instances (10 infeasible + 2 feasible), because the agent must learn to recognise both valid and invalid schedules
- **Task 2 (Conflict Classification)** β Uses only the 10 infeasible instances, because the task presupposes the schedule is broken
- **Task 3 (Schedule Repair)** β Uses only the 10 infeasible instances with known optimal repairs
### 5.4 Progressive Difficulty Design
The three tasks form a deliberate cognitive ladder:
```
Task 1 β Feasibility Check (EASY)
β Binary yes/no decision
β No structural reasoning required beyond spotting a single violation
β Target accuracy: ~90%
Task 2 β Conflict Classification (MEDIUM)
β Must identify WHICH of 5 violation categories is present
β Requires understanding of all four constraint types
β Target accuracy: ~60%
Task 3 β Schedule Repair (HARD)
β Must produce a syntactically valid, structurally correct, constraint-satisfying,
near-optimal repaired schedule as JSON
β Requires full combinatorial reasoning
β Target accuracy: ~30%
```
This design ensures that agents can make measurable progress through the curriculum, and that the hardest task is genuinely challenging for frontier models.
---
## 6. The Scheduling Instance Corpus
The environment ships with 12 hand-crafted scheduling instances. Each instance is a Python dictionary with the following structure:
```python
{
"instance": { # exposed to the agent
"problem_id": "P01",
"jobs": [...],
"machines": [...],
"proposed_schedule": {"assignments": [...]}
},
"is_feasible": False, # ground truth for Task 1
"violation_type": "resource_overload", # ground truth for Task 2
"optimal_schedule": {"assignments": [...]}, # ground truth for Task 3
"optimal_makespan": 7, # used for optimality scoring in Task 3
"description": "..." # human-readable summary
}
```
### 6.1 Full Instance Catalogue
| # | Problem ID | Feasible | Violation Class | Key Constraint Broken |
|---|-----------|---------|----------------|----------------------|
| 0 | P01 | No | `resource_overload` | J1[0,4) and J2[2,5) overlap on M1 (capacity=1) |
| 1 | P02 | No | `deadline_violation` | J1 starts at t=5, finishes at t=10 > deadline=8 |
| 2 | P03 | No | `precedence_violation` | J2 starts at t=0 but depends on J1 which ends at t=8 |
| 3 | P04 | No | `availability_conflict` | J1 starts at t=5, before M1's window opens at t=8 |
| 4 | P05 | No | `capacity_exceeded` | 3 concurrent jobs on M1 with capacity=2 |
| 5 | P06 | No | `resource_overload` | J1 and J2 overlap on capacity-1 M1 (second instance) |
| 6 | P07 | No | `deadline_violation` | Precedence chain forces J3 past its hard deadline |
| 7 | P08 | No | `precedence_violation` | J3 starts before both J1 and J2 complete |
| 8 | P09 | No | `availability_conflict` | J1 extends into M1's maintenance window |
| 9 | P10 | No | `capacity_exceeded` | 4 concurrent jobs on M1 with capacity=3 |
| 10 | P11 | Yes | β | Fully feasible 3-job, 2-machine schedule |
| 11 | P12 | Yes | β | Fully feasible 5-job, 3-machine schedule with precedence |
### 6.2 Violation Class Distribution
| Class | Count | Description |
|-------|-------|-------------|
| `resource_overload` | 2 | Two jobs time-overlap on a single-capacity machine |
| `deadline_violation` | 2 | A job finishes after its required deadline |
| `precedence_violation` | 2 | A job starts before its predecessor completes |
| `availability_conflict` | 2 | A job is scheduled outside a machine's availability window |
| `capacity_exceeded` | 2 | Concurrent jobs exceed a machine's capacity limit |
| Feasible (no violation) | 2 | Used only in Task 1 |
Each violation class appears exactly twice, ensuring balanced representation and preventing class imbalance during training.
---
## 7. Task Definitions and Grading Functions
### 7.1 Task 1 β Feasibility Check (Easy)
**Objective:** Determine whether a proposed schedule satisfies all four constraint categories.
**Action space:** `{"feasible", "infeasible"}`
**Episode horizon:** 3 steps
**Grader: `FeasibilityGrader`**
The grader normalises common synonyms before scoring, so agents that reply with natural language equivalents still receive full credit:
| Synonyms treated as "feasible" | Synonyms treated as "infeasible" |
|-------------------------------|----------------------------------|
| feasible, valid, correct, satisfiable, yes, ok, pass | infeasible, invalid, incorrect, unsatisfiable, no, violated, conflict, fail, impossible, broken |
**Scoring function:**
```
R(action, ground_truth) =
1.0 if normalise(action.response) == ground_truth.is_feasible
0.1 if response is non-empty but cannot be parsed OR is wrong
0.0 if response is empty
```
The `0.1` score for a wrong-but-present answer is deliberate. It ensures the agent always has a gradient signal β even a completely wrong answer tells the training system "the agent tried; adjust its policy."
---
### 7.2 Task 2 β Conflict Classification (Medium)
**Objective:** Identify the constraint violation type present in an infeasible schedule from the closed vocabulary of five classes.
**Action space:** `{resource_overload, deadline_violation, precedence_violation, availability_conflict, capacity_exceeded}`
**Episode horizon:** 5 steps
**Grader: `ConflictGrader`**
The grader is aware of semantic "constraint families" β groups of violation types that are conceptually related. This enables partial credit for answers that are wrong but not completely off-base.
**Constraint families:**
- **Resource-limit family:** `resource_overload` and `capacity_exceeded` β both concern concurrent job count on a machine
- **Temporal-ordering family:** `deadline_violation` and `precedence_violation` β both concern job timing and sequencing
- **Standalone:** `availability_conflict` β concerns machine operational windows
**Scoring function:**
```
R(action, ground_truth) =
1.0 if action.response == ground_truth.violation_type (exact match)
0.5 if action.response is in the same family as ground_truth (partial credit)
0.1 if action.response is a valid category but wrong family (attempted)
0.0 if action.response is not a valid category (unparseable)
```
The grader also normalises spacing and dashes to underscores, so `"deadline violation"` and `"deadline-violation"` both map to `"deadline_violation"` before scoring.
---
### 7.3 Task 3 β Schedule Repair (Hard)
**Objective:** Return a corrected schedule as a JSON object that resolves all constraint violations and minimises total makespan.
**Required output format:**
```json
{
"assignments": [
{"job_id": "J1", "machine_id": "M1", "start_time": 0},
{"job_id": "J2", "machine_id": "M1", "start_time": 4},
{"job_id": "J3", "machine_id": "M2", "start_time": 0}
]
}
```
**Episode horizon:** 8 steps
**Grader: `RepairGrader`**
The grader evaluates four independent components and sums them additively to a maximum of 1.0:
**Component 1 β JSON Parseability (20% of score)**
```
0.20 if the response parses as a valid JSON object
0.00 if the response is not valid JSON (no partial credit at this level)
```
The parser uses three strategies in sequence:
1. Direct `json.loads()` β handles pure JSON responses
2. Strip markdown code fences (```` ``` ````) then parse β handles LLM wrapping
3. Brace-counting to extract the outermost `{...}` block β handles prose-wrapped JSON
**Component 2 β Schema Validity (20% of score)**
```
0.20 if the JSON contains "assignments" list, every assignment has
{job_id, machine_id, start_time}, start_time β₯ 0, and every job
from the instance appears exactly once
0.00 otherwise
```
**Component 3 β Constraint Satisfaction (40% of score)**
Checks all four constraint categories independently. Each is worth 10% of the total score (0.10 points):
```
capacity_ok β no machine has more concurrent jobs than its capacity
deadlines_ok β every job finishes at or before its deadline
precedence_ok β every job starts after all its predecessors finish
availability_ok β every job runs within its machine's operational window
score += 0.40 Γ (number of satisfied categories / 4)
```
**Component 4 β Makespan Optimality (20% of score)**
```
0.20 if makespan(response) β€ optimal_makespan Γ 1.30 (within 30% of optimal)
0.10 if makespan(response) β€ optimal_makespan Γ 1.60 (within 60% of optimal)
0.00 if makespan(response) > optimal_makespan Γ 1.60 (too slow)
```
Where `makespan` is defined as the maximum finish time across all jobs: `max(start_time(j) + duration(j)) βj`.
**Full scoring formula:**
```
R = 0.20 Γ parseable_json(response)
+ 0.20 Γ valid_schema(response)
+ 0.40 Γ (satisfied_constraints / 4)
+ 0.20 Γ optimality_score(makespan, optimal_makespan)
```
---
## 8. Reward Design Philosophy
### 8.1 Dense vs Sparse Rewards
A key design principle in this environment is the use of **dense rewards** β reward signals that give partial credit throughout the episode, not just at the end.
**Sparse reward (bad for learning):**
```
score = 1.0 if the answer is perfect
score = 0.0 otherwise
```
This gives the agent almost no information. If the answer is wrong, the agent doesn't know if it was slightly wrong or completely wrong.
**Dense reward (what this environment uses):**
```
score = sum of partial credits for each correct sub-component
```
This tells the agent exactly which parts of its answer were right and which were wrong, enabling it to make targeted improvements in subsequent steps.
### 8.2 Why Partial Credit Matters
Consider a Schedule Repair attempt that produces syntactically valid JSON, covers all jobs, and fixes 3 out of 4 constraints, but is 50% slower than optimal:
```
score = 0.20 (JSON valid)
+ 0.20 (schema valid)
+ 0.30 (3/4 constraints: 0.40 Γ 0.75)
+ 0.10 (makespan within 60% of optimal)
= 0.80
```
The agent knows it is close. It knows it missed one constraint and its solution is a bit slow. It can target those specific issues in the next step.
### 8.3 The 0.1 Floor for Task 1 and Task 2
Even an incorrect answer receives a non-zero score (0.1) as long as the response is non-empty and parseable. This is intentional: it prevents the agent from learning to produce empty responses as a strategy to avoid penalties, and it keeps the gradient non-zero for wrong-but-present answers.
---
## 9. Data Models
All inter-component communication uses three Pydantic v2 schemas defined in `models.py`:
### Observation
```python
class Observation(BaseModel):
schedule_instance: str # JSON-encoded scheduling problem
task_id: str # "feasibility_check" | "conflict_classification" | "schedule_repair"
context: str # Natural-language instructions for the current step
step_number: int # Current step (0-indexed, ge=0)
```
### Action
```python
class Action(BaseModel):
response: str # Agent's text answer (e.g., "infeasible", "resource_overload", or JSON)
task_id: str # Must match the current episode's task_id
```
### Reward
```python
class Reward(BaseModel):
score: float # Grading result, enforced β [0.0, 1.0]
feedback: str # Human-readable explanation of the score
```
---
## 10. API Specification
The environment is exposed as a RESTful HTTP API via FastAPI, running on port **7860** (the Hugging Face Spaces default).
### Endpoints
| Method | Path | Description | Request Body | Response |
|--------|------|-------------|-------------|----------|
| `GET` | `/health` | Liveness probe | None | `{"status": "ok"}` |
| `POST` | `/reset` | Start new episode | `{"task_id": "..."}` | `Observation` |
| `POST` | `/step` | Submit action | `{"response": "...", "task_id": "..."}` | `StepResponse` |
| `GET` | `/state` | Internal state snapshot | None | Full state dict |
| `GET` | `/tasks` | Task catalogue | None | Array of task descriptions |
| `POST` | `/grader` | Direct grader invocation | `{action, ground_truth}` | `{"score": float}` |
| `GET` | `/baseline` | Run GPT-4o-mini baseline | None | Per-task scores |
### StepResponse Schema
```json
{
"observation": { "schedule_instance": "...", "task_id": "...", "context": "...", "step_number": 1 },
"reward": 1.0,
"done": true,
"info": {
"step_reward": 1.0,
"cumulative_reward": 1.0,
"steps_remaining": 2,
"instance_description": "...",
"grading_breakdown": { ... }
}
}
```
The `info.grading_breakdown` dict exposes the full internal grading decision β predicted vs expected, per-constraint pass/fail flags, makespan ratio β enabling training loops to inspect the decision without parsing the float reward.
---
## 11. Inference and Baseline
### 11.1 inference.py
The primary evaluation script. Uses three environment variables:
| Variable | Purpose | Example |
|----------|---------|---------|
| `API_BASE_URL` | Base URL for the OpenAI-compatible API | `https://api.openai.com/v1` |
| `MODEL_NAME` | Model identifier | `gpt-4o-mini` |
| `HF_TOKEN` | API key or Hugging Face token | `hf_...` or `sk-...` |
**Log format:**
Every episode emits three structured JSON log lines to stdout:
```
[START] {"task_id": "feasibility_check", "instance_id": 0}
[STEP] {"task_id": "feasibility_check", "instance_id": 0, "step": 1, "action": "infeasible", "reward": 1.0, "done": true, "feedback": "Correct."}
[END] {"task_id": "feasibility_check", "instance_id": 0, "final_reward": 1.0}
```
Final summary line:
```json
{"event": "eval_end", "summary": {"feasibility_check": {"average_score": 1.0, "num_instances": 12}, "conflict_classification": {"average_score": 1.0, "num_instances": 10}, "schedule_repair": {"average_score": 1.0, "num_instances": 10}, "overall_average": 1.0}}
```
**Oracle fallback:** When `HF_TOKEN` is not set, the script falls back to deterministic mock responses (the ground-truth answers) and scores 1.0 on all tasks. This enables offline verification of the grading pipeline without any API access.
### 11.2 baseline.py
An alternative evaluation script (`baseline.py`) that calls the graders directly (without HTTP) and supports the same GPT-4o-mini / oracle-mock pattern. Useful for rapid local testing.
### 11.3 Baseline Scores
| Mode | Task 1 | Task 2 | Task 3 | Overall |
|------|--------|--------|--------|---------|
| Oracle mock (no API key) | 1.000 | 1.000 | 1.000 | 1.000 |
| GPT-4o-mini (with API key) | ~0.90 | ~0.60 | ~0.30 | ~0.60 |
The mock oracle achieves perfect scores by design β it is used to verify the grading pipeline, not to claim AI performance. The GPT-4o-mini scores reflect realistic expectations based on the designed difficulty levels.
---
## 12. Setup and Installation
### 12.1 Prerequisites
| Requirement | Version |
|-------------|---------|
| Python | β₯ 3.11 |
| pip | β₯ 22.0 |
| Docker Desktop *(for container testing)* | β₯ 20.10 |
| Git | β₯ 2.30 |
### 12.2 Local Installation
```bash
# Step 1 β Clone the repository
git clone https://github.com/Vittal-Mukunda/OpenEnv-Hackathon-Meta-x-Scaler.git
cd OpenEnv-Hackathon-Meta-x-Scaler
# Step 2 β Create a virtual environment (strongly recommended)
python -m venv .venv
# Activate it:
# On macOS / Linux:
source .venv/bin/activate
# On Windows (PowerShell):
.venv\Scripts\Activate.ps1
# On Windows (CMD):
.venv\Scripts\activate.bat
# Step 3 β Install dependencies
pip install -r requirements.txt
# Step 4 β Start the server
uvicorn server:app --host 0.0.0.0 --port 7860
# Step 5 β Verify the server is alive (new terminal)
curl http://localhost:7860/health
# Expected: {"status":"ok"}
```
### 12.3 Docker Deployment
```bash
# Step 1 β Open Docker Desktop and wait for it to fully start
# Step 2 β Build the image (takes 1-2 minutes first time)
docker build -t scheduling-env .
# Step 3 β Run the container
docker run -p 7860:7860 scheduling-env
# Step 4 β Verify in a new terminal
curl http://localhost:7860/health
# Expected: {"status":"ok"}
# Step 5 β Stop the container
# Press Ctrl+C in the terminal running Docker
```
### 12.4 Hugging Face Spaces Deployment
1. Create a new Hugging Face Space at huggingface.co/new-space
2. Select **Docker** as the SDK
3. Push this repository to the Space:
```bash
git remote add space https://huggingface.co/spaces/<your-username>/<space-name>
git push space master
```
4. The Space will auto-detect the Dockerfile, build, and deploy. Port 7860 is exposed automatically.
---
## 13. How to Get Results
### 13.1 Quick Test β Oracle Mock (No API Key Needed)
This runs the full evaluation pipeline with deterministic oracle answers. Use this to verify everything is working correctly.
```bash
python inference.py
```
**Expected output:**
```
{"event": "eval_start", "mode": "oracle mock", "model": "gpt-4o-mini"}
[START] {"task_id": "feasibility_check", "instance_id": 0}
[STEP] {"task_id": "feasibility_check", "instance_id": 0, "step": 1, "action": "infeasible", "reward": 1.0, "done": true, "feedback": "Correct."}
[END] {"task_id": "feasibility_check", "instance_id": 0, "final_reward": 1.0}
... (32 episodes total) ...
{"event": "eval_end", "summary": {"feasibility_check": {"average_score": 1.0, "num_instances": 12}, "conflict_classification": {"average_score": 1.0, "num_instances": 10}, "schedule_repair": {"average_score": 1.0, "num_instances": 10}, "overall_average": 1.0}}
```
**What the output means:**
- Each `[START]` line: a new episode begins on a specific instance
- Each `[STEP]` line: the agent submitted an answer; shows what it said and what score it received
- Each `[END]` line: the episode is over; shows the final reward
- The last line: aggregated scores per task and overall
### 13.2 Real LLM Evaluation (With API Key)
```bash
# Using OpenAI
export API_BASE_URL=https://api.openai.com/v1
export MODEL_NAME=gpt-4o-mini
export HF_TOKEN=sk-your-openai-key-here
python inference.py
# Using Hugging Face Inference API
export API_BASE_URL=https://api-inference.huggingface.co/v1
export MODEL_NAME=meta-llama/Meta-Llama-3-8B-Instruct
export HF_TOKEN=hf_your-token-here
python inference.py
```
### 13.3 Testing Individual Endpoints
With the server running (`uvicorn server:app --port 7860`):
```bash
# Health check
curl http://localhost:7860/health
# Start a feasibility episode
curl -X POST http://localhost:7860/reset \
-H "Content-Type: application/json" \
-d '{"task_id": "feasibility_check"}'
# Submit an answer
curl -X POST http://localhost:7860/step \
-H "Content-Type: application/json" \
-d '{"response": "infeasible", "task_id": "feasibility_check"}'
# Start a classification episode
curl -X POST http://localhost:7860/reset \
-H "Content-Type: application/json" \
-d '{"task_id": "conflict_classification"}'
# Submit classification
curl -X POST http://localhost:7860/step \
-H "Content-Type: application/json" \
-d '{"response": "resource_overload", "task_id": "conflict_classification"}'
# Start a repair episode
curl -X POST http://localhost:7860/reset \
-H "Content-Type: application/json" \
-d '{"task_id": "schedule_repair"}'
# Submit a repaired schedule
curl -X POST http://localhost:7860/step \
-H "Content-Type: application/json" \
-d '{"response": "{\"assignments\": [{\"job_id\": \"J1\", \"machine_id\": \"M1\", \"start_time\": 0}, {\"job_id\": \"J2\", \"machine_id\": \"M1\", \"start_time\": 4}]}", "task_id": "schedule_repair"}'
# Check full internal state
curl http://localhost:7860/state
# Grade an action directly (without going through an episode)
curl -X POST http://localhost:7860/grader \
-H "Content-Type: application/json" \
-d '{"action": {"response": "deadline_violation", "task_id": "conflict_classification"}, "ground_truth": {"violation_type": "deadline_violation"}}'
```
### 13.4 Interpreting the Score
| Score Range | Meaning |
|------------|---------|
| 1.0 | Perfect answer |
| 0.8 β 0.99 | Excellent β nearly correct, minor issue |
| 0.5 β 0.79 | Good β correct constraint family or partial repair |
| 0.2 β 0.49 | Partial β JSON parsed, some constraints fixed |
| 0.1 β 0.19 | Attempted β something was submitted, but mostly wrong |
| 0.0 | Empty response or completely unparseable |
---
## 14. End-to-End Walkthrough
### Full Example: Task 3 Schedule Repair
This walkthrough shows exactly what happens when an agent attempts to repair instance P01 (resource_overload: J1[0,4) and J2[2,5) overlap on M1 capacity=1).
**Step 1 β Reset**
```
POST /reset
{"task_id": "schedule_repair"}
```
The environment:
1. Validates task_id = "schedule_repair"
2. Selects the infeasible instance pool (10 instances)
3. Picks instance 0 (P01) via round-robin
4. Sets step=0, done=False
5. Returns Observation:
```json
{
"schedule_instance": "{\"problem_id\": \"P01\", \"jobs\": [{\"id\": \"J1\", \"duration\": 4, ...}, ...], \"proposed_schedule\": {\"assignments\": [{\"job_id\": \"J1\", \"start_time\": 0}, {\"job_id\": \"J2\", \"start_time\": 2}]}}",
"task_id": "schedule_repair",
"context": "The proposed_schedule is infeasible. Return ONLY a JSON object with key \"assignments\"...",
"step_number": 0
}
```
**Step 2 β Agent submits broken JSON**
```
POST /step
{"response": "Here is my repair: {bad json", "task_id": "schedule_repair"}
```
RepairGrader tries all 3 parse strategies. All fail. Returns:
- `json_parseable: false` β score = 0.0
- `done: false` (step 1/8)
**Step 3 β Agent submits valid JSON, wrong schema**
```
POST /step
{"response": "{\"jobs\": [...]}", "task_id": "schedule_repair"}
```
RepairGrader: JSON parses (0.20), but no "assignments" key (schema fails). Returns:
- score = 0.20
- `done: false` (step 2/8)
**Step 4 β Agent submits correct schema, partial constraints**
```json
{"assignments": [{"job_id": "J1", "machine_id": "M1", "start_time": 0}, {"job_id": "J2", "machine_id": "M1", "start_time": 2}, {"job_id": "J3", "machine_id": "M2", "start_time": 0}]}
```
Grader checks:
- JSON: β
(0.20)
- Schema: β
all jobs present (0.20)
- Capacity: β J1[0,4) and J2[2,5) still overlap on M1
- Deadlines: β
- Precedence: β
- Availability: β
- Constraints: 3/4 = 0.30
- Makespan: 7 vs optimal 7 β ratio 1.0 β 0.20
Score = 0.20 + 0.20 + 0.30 + 0.20 = **0.90**
**Step 5 β Agent submits optimal repair**
```json
{"assignments": [{"job_id": "J1", "machine_id": "M1", "start_time": 0}, {"job_id": "J2", "machine_id": "M1", "start_time": 4}, {"job_id": "J3", "machine_id": "M2", "start_time": 0}]}
```
Grader checks:
- JSON: β
(0.20)
- Schema: β
(0.20)
- Capacity: β
J1[0,4), J2[4,7) no overlap
- Deadlines: β
- Precedence: β
- Availability: β
- Constraints: 4/4 = 0.40
- Makespan: 7 vs optimal 7 β ratio 1.0 β 0.20
Score = **1.00**. `done: true`. Episode ends.
---
## 15. Evaluation and Scoring
### 15.1 Pre-Submission Checklist
| Check | Pass Condition |
|-------|---------------|
| HF Space deploys | `GET /health` returns 200 and `{"status":"ok"}` |
| OpenEnv spec compliance | `openenv.yaml` valid; typed models; `step()`/`reset()`/`state()` respond correctly |
| Dockerfile builds | `docker build` completes without errors |
| Baseline reproduces | `python inference.py` completes without error and produces scores |
| 3+ tasks with graders | All three tasks return scores β [0.0, 1.0] |
### 15.2 Hackathon Scoring Breakdown
| Criterion | Weight | This Project |
|-----------|--------|-------------|
| Real-world utility | 30% | Industrial scheduling (manufacturing, cloud, healthcare) |
| Task & grader quality | 25% | 3 tasks, difficulty range easyβhard, deterministic graders |
| Environment design | 20% | Dense rewards, clean state, sensible episode boundaries |
| Code quality & spec compliance | 15% | Typed models, documented, Docker works, spec compliant |
| Creativity & novelty | 10% | Scheduling is rare in OpenEnv; multi-component repair grader is novel |
---
## 16. Project Structure
```
.
βββ inference.py # Primary inference script (API_BASE_URL/MODEL_NAME/HF_TOKEN)
βββ baseline.py # GPT-4o-mini baseline with oracle fallback
βββ server.py # FastAPI HTTP server (7 endpoints, port 7860)
βββ environment.py # SchedulingOptEnv core + INSTANCE_BANK (12 problems)
βββ models.py # Pydantic v2 data models (Observation, Action, Reward)
βββ openenv.yaml # OpenEnv metadata manifest
βββ Dockerfile # Container definition (python:3.11-slim, port 7860)
βββ requirements.txt # Python dependencies
βββ tasks/
β βββ __init__.py # Task module exports
β βββ task1_easy.py # Feasibility check β episode runner + instance accessor
β βββ task2_medium.py # Conflict classification β episode runner + instance accessor
β βββ task3_hard.py # Schedule repair β episode runner + instance accessor
βββ graders/
βββ __init__.py # Grader exports
βββ grader_detection.py # FeasibilityGrader (binary, synonym-aware)
βββ grader_classification.py # ConflictGrader (family-aware partial credit)
βββ grader_fix.py # RepairGrader (4-component additive reward)
```
---
## 17. Dependencies
| Package | Version | Purpose |
|---------|---------|---------|
| `fastapi` | β₯ 0.104 | HTTP server framework |
| `uvicorn` | β₯ 0.24 | ASGI server (runs FastAPI) |
| `pydantic` | β₯ 2.5 | Data validation and serialisation |
| `openai` | β₯ 1.6 | LLM inference (used in inference.py and baseline.py) |
| `pyyaml` | β₯ 6.0 | YAML manifest parsing (openenv.yaml) |
| `httpx` | β₯ 0.25 | Async HTTP client (used internally by FastAPI) |
All dependencies are intentionally lightweight and ship in `python:3.11-slim` with no additional system packages required.
---
## 18. Glossary
| Term | Definition |
|------|-----------|
| **MDP** | Markov Decision Process β a formal framework for sequential decision problems |
| **Episode** | One complete round from `reset()` to terminal state |
| **Horizon** | Maximum number of steps allowed per episode |
| **Terminal** | State where `done = True` β episode must end |
| **Dense reward** | A reward signal that provides partial credit at every step, not just at the end |
| **Sparse reward** | A reward signal that is non-zero only at the final step (bad for learning) |
| **Feasible** | A schedule that satisfies all four constraint categories |
| **Infeasible** | A schedule that violates at least one constraint |
| **Makespan** | The total time from the start of the first job to the finish of the last job |
| **Optimal makespan** | The minimum achievable makespan for a given set of jobs and machines |
| **Instance** | One specific scheduling problem (jobs + machines + proposed schedule + ground truth) |
| **Grader** | A Python class that scores an agent's action against ground truth |
| **OpenEnv** | The framework specification by Meta/HuggingFace that this environment implements |
| **Constraint family** | A group of semantically related violation types used for partial-credit scoring |
| **Oracle** | A mock agent that always produces the ground-truth answer β used to verify the grading pipeline |
---
*Built for the Meta Γ Scaler OpenEnv Hackathon Β· April 2026 Β· MIT License*
|