Eshit commited on
Commit
66a57c6
Β·
1 Parent(s): 65c3b5c

Privatize internal notes; sync openenv.yaml action enum; split training requirements

Browse files
.gitignore CHANGED
@@ -68,3 +68,9 @@ Thumbs.db
68
  desktop.ini
69
  .idea/
70
  .vscode/
 
 
 
 
 
 
 
68
  desktop.ini
69
  .idea/
70
  .vscode/
71
+
72
+ # Private project notes (not for judges / public)
73
+ HACKATHON_ALIGNMENT.md
74
+ colab_prompts.md
75
+ private_notes/
76
+ .private/
HACKATHON_ALIGNMENT.md DELETED
@@ -1,410 +0,0 @@
1
- # Hackathon Alignment β€” Wildfire Containment Simulator
2
-
3
- This document walks through every topic in the organizers' **Hackathon Self-Serve Guide** PDF and describes, topic by topic, what our project currently does, where the gaps are, and concrete changes that would strengthen our submission. It is written to be directly actionable during the final sprint β€” each "Approach" passage reflects the code on disk today (not aspiration), and each "Potential issues / improvements" list points at specific files, specific lines of behavior, and specific hackathon judging criteria.
4
-
5
- Stack reminder (from `pyproject.toml`, `training/grpo_colab.ipynb`, `server/app.py`):
6
-
7
- - **Environment:** OpenEnv-style `WildfireEnv` in `env/wildfire_env.py` with Pydantic-typed `Action`/`Observation`/`StepResult`.
8
- - **Trainer:** TRL `GRPOTrainer` + Unsloth 4-bit LoRA on `unsloth/Qwen2.5-1.5B-Instruct`, 50 GRPO steps, 8 generations per prompt.
9
- - **Deployment:** FastAPI at port 7860 (`server/app.py`), Dockerized, deployable as a Hugging Face Space.
10
- - **Baselines:** `RandomAgent` and `HeuristicAgent` in `agents/`, scored by `graders/grader_{easy,medium,hard}.py`.
11
-
12
- ---
13
-
14
- ## 0) What you are building
15
-
16
- ### Approach
17
- We are building exactly the system the guide describes end-to-end: an OpenEnv-compliant RL environment (`env/`) with verifier/reward functions (`env/reward.py`, graders), a TRL `GRPOTrainer` loop (`training/grpo_colab.ipynb`), Unsloth 4-bit quantization + LoRA for efficiency, and a FastAPI/Docker deployment suitable for a Hugging Face Space (`server/app.py`, `Dockerfile`, `openenv.yaml`). The task is a long-horizon disaster-response decision problem β€” an LLM acts as an Incident Commander dispatching crews, tankers, firebreaks and recon flights over 80–300 steps per episode. Every piece in the "Environment β†’ verifier β†’ TRL β†’ Unsloth β†’ OpenEnv/Spaces" pipeline the PDF specifies has a real implementation in this repo.
18
-
19
- ### Potential issues / improvements
20
- - **Pipeline is technically complete but not yet empirically closed.** `README.md` still has `{TBD}` placeholders for the trained-model numbers, and `scripts/results.json` only contains random/heuristic baselines. Judges will discount a project whose headline claim (trained LLM beats heuristic) is not demonstrated. Highest-leverage action: run the training notebook's Section 6 evaluation, paste the numbers into README, and back them with `demos/` GIFs.
21
- - **No "value model is verifier" narrative is explicit in the repo.** Write a one-paragraph "Why GRPO + RLVR fits this env" blurb into the README so the judges can see, in 30 seconds, that we understood the intended stack.
22
-
23
- ---
24
-
25
- ## 1) Start with the right project idea
26
-
27
- ### Approach
28
- The task satisfies all three properties in the guide:
29
-
30
- 1. **Step-by-step action** β€” `env/wildfire_env.py:step()` executes exactly one `Action` per tick, cycling through 11 deterministic sub-steps (validate β†’ execute β†’ spread β†’ suppress β†’ weather β†’ moisture β†’ smoke β†’ cooldowns β†’ reveal expiry β†’ hard-tier events β†’ reward).
31
- 2. **Programmatic verification** β€” The grader family (`graders/grader_easy.py` and siblings) computes `total_reward`, `containment_pct`, `pop_saved_pct`, `crew_casualty` from the ground-truth `env.state()` β€” no human judgment required.
32
- 3. **Difficulty calibrated so success probability > 0** β€” The heuristic baseline currently scores +7.0 Β± 0.26 on easy, +3.93 on medium, +5.32 on hard (`scripts/results.json`); the `tier_scale` in `env/fire_spread.py` (1.0 / 0.7 / 0.55) is explicitly tuned so that rollouts routinely produce non-zero reward.
33
-
34
- ### Potential issues / improvements
35
- - **Medium-tier variance is huge** β€” heuristic gets `[-0.85, 7.09, 7.08, 0.01, 6.31]` across seeds 42–46, std 3.57. That bimodal distribution (either total save or near-total loss) is exactly the pattern the PDF warns about under "so hard that the model never succeeds." If the model sees mostly the bad mode early in training, learning will stall. **Fix:** inspect why seeds 42 and 45 fail for the heuristic β€” likely the two ignition points spawn on opposite sides of a populated cluster β€” and tighten `_find_ignition_candidate` in `env/wildfire_env.py` to guarantee at least one winning crew-deployment plan exists.
36
- - **Hard tier scores look suspiciously uniform at 6.7** β€” four of five seeds returning the identical value 6.7 strongly suggests the episode terminates at the same early exit condition (probably "all population lost" or "fire self-extinguishes before staggered ignition"). The variance should be investigated before training on hard.
37
-
38
- ---
39
-
40
- ## 2) Understand the minimum RL loop before you build
41
-
42
- ### Approach
43
- The 5-step RL loop is implemented cleanly and discoverable inside `training/grpo_colab.ipynb` cell `code-rollout` and `code-grpo-setup`:
44
-
45
- 1. **Prompt** β€” `serialize_observation(obs, step, max_steps)` in `env/serialization.py` formats the observation into a structured LLM prompt (SITUATION / GRID SUMMARY / RESOURCES / RECENT EVENTS / Available actions).
46
- 2. **Generation** β€” model.generate inside `collect_rollout` and inside `reward_fn`.
47
- 3. **Execute** β€” `parse_action()` β†’ `env.step(action)` returns a `StepResult`.
48
- 4. **Reward** β€” decomposed step reward plus terminal spike, produced inside `env/reward.py` and assembled in `wildfire_env.step()`.
49
- 5. **Update** β€” `GRPOTrainer.train()` does the gradient step; 8 generations per prompt, 50 steps, lr 5e-6.
50
-
51
- ### Potential issues / improvements
52
- - **The loop inside `reward_fn` is not the same as the loop inside `collect_rollout`.** `reward_fn` uses the candidate completion only at step 0 and then runs the **heuristic** for 14 more steps. That is a legitimate variance-reduction trick (rollouts dominated by the heuristic have less noise), but it means the gradient signal mostly measures "how good is this single first action followed by a scripted policy" β€” not "how good is this model's long-horizon plan." Consider a hybrid: sample 50% of rewards from heuristic-continuation and 50% from pure-model continuation.
53
- - **`collect_rollout` is defined in the notebook but never actually called during training.** It's essentially dead code. Either delete it or wire it into an evaluation loop so it earns its keep.
54
-
55
- ---
56
-
57
- ## 3) Decide whether you need SFT first
58
-
59
- ### Approach
60
- We are following the guide's "usually RL from a capable base" path: start from `unsloth/Qwen2.5-1.5B-Instruct` (already instruction-tuned on general chat), add no SFT warm-up, and rely on the base model's JSON-formatting ability plus our tolerant 3-layer parser (`env/action_parser.py`) to get non-zero reward on the very first rollout. The easy tier's high reward ceiling (~+8) and the heuristic continuation inside `reward_fn` both substantially raise the probability that the first few rollouts see positive reward, which is the precondition the PDF calls out.
61
-
62
- ### Potential issues / improvements
63
- - **No evidence we measured what the pre-RL model actually outputs.** We should run 10 rollouts of the un-trained Qwen-2.5-1.5B through `collect_rollout`, count: (a) JSON parse success rate, (b) semantically-valid action rate, (c) mean episode reward. If JSON success is below ~70%, do a tiny SFT pass (even 50 heuristic-generated examples) purely for format priming β€” that is exactly the "light SFT first" pattern the guide endorses for hackathons.
64
- - **Heuristic trajectory harvesting is cheap and we already own the heuristic.** A simple script that runs `HeuristicAgent` over seeds 0–199 and logs `(prompt, action_json)` pairs would yield ~10k-30k supervised examples for a warm-up pass. This is optional but high-return insurance.
65
- - **The system prompt in the notebook is minimal** (`'Respond with ONLY a valid JSON action object and nothing else.'`). A richer system prompt that includes the full action schema (as `inference.py` already has in `SYSTEM_PROMPT`) would cut early format failures.
66
-
67
- ---
68
-
69
- ## 4) Design the environment before you design the trainer
70
-
71
- ### Approach
72
- The environment is the first-class artifact in this repo: `env/` has 13 modules, the trainer is a single notebook. The `WildfireEnv` class in `env/wildfire_env.py` exposes the four methods the guide requires:
73
-
74
- - `reset(task_id, seed)` β†’ `Observation` (deterministic from seed via `np.random.default_rng(seed)` passed to every sub-system).
75
- - `step(action)` β†’ `StepResult` (11-step tick pipeline, never crashes on bad input).
76
- - `state()` β†’ full ground truth dict (used only by graders, documented as "NOT for agent use").
77
- - Reward is computed inside `step()` via `RewardCalculator.compute_step_reward` + `compute_terminal_reward`.
78
-
79
- The five design questions the guide poses are each answered explicitly:
80
-
81
- - **What does the agent observe?** β€” `Observation` in `env/models.py:298` (grid, weather, resources, stats, recent_events, briefing).
82
- - **What actions can it take?** β€” `ActionType` enum (7 types) with Pydantic per-type field validation.
83
- - **What ends an episode?** β€” `_check_termination` in `wildfire_env.py:470` β€” time limit, fire extinguished (with staggered-ignition protection), or total population lost.
84
- - **Reward?** β€” documented in the README and `openenv.yaml`.
85
- - **Abuse/infinite-loop prevention?** β€” `episode_length` hard cap, `_validate_action` returns safe messages without exceptions, `parse_action` has a 3-layer fallback that can never return a non-Pydantic-valid `Action`.
86
-
87
- ### Potential issues / improvements
88
- - **Observation is enormous.** On hard tier (40Γ—40), the grid alone is 1600 `CellObservation` objects, which `serialize_observation` then summarizes via BFS clustering in `env/serialization.py`. Verify the prompt token count is comfortably under `MAX_SEQ_LENGTH=2048`; on hard tier with many fire clusters it may be tight. Add a `len(tokenizer(prompt).input_ids)` assertion at the top of `reward_fn` for the first few calls.
89
- - **Sensor noise is asymmetric.** Wind speed/direction get Β±5 km/h / Β±20Β° noise (`env/weather.py`), but moisture, smoke, fire intensity are exact. That is fine, but we should document it so judges don't assume we forgot.
90
- - **`state()` can be accessed at any time** β€” there is no enforcement that agents only see `Observation`. A malicious agent author could just call `env.state()` during the grader loop. For competition integrity, lock down `state()` when the caller is an `Agent` interface, or at minimum document that it is a grading-only hook.
91
-
92
- ---
93
-
94
- ## 5) Build the environment using OpenEnv
95
-
96
- ### Approach
97
- The project is structured as a Python package exposing the OpenEnv contract:
98
-
99
- - `action` / `observation` / `state` dataclasses live in `env/models.py` as Pydantic models (stricter than dataclasses β€” the guide's recommendation is satisfied and exceeded).
100
- - `WildfireEnv.reset`/`.step`/`.state` implement the methods.
101
- - `server/app.py` wraps the env in a FastAPI app with `/reset`, `/step`, `/state`, `/health`, `/` (HTML landing), `/docs` (Swagger).
102
- - `openenv.yaml` declares the environment class, action space, observation space, reward range (-8 to +8), and three tasks (`easy`/`medium`/`hard`).
103
- - A root `app.py` shim and `Dockerfile` publish it as a Space on port 7860.
104
-
105
- The separation the guide calls for β€” "environment handles world dynamics and scoring, trainer handles optimization, model just learns to act" β€” is honored: `env/` has no trainer dependency, and the trainer notebook only imports from `env/`, `agents/`, `graders/` via the public surface.
106
-
107
- ### Potential issues / improvements
108
- - **`_env` is a module-level singleton in `server/app.py`.** Concurrent `/reset` calls from different clients will clobber each other's episode state. Fine for a demo, but a judge running two browser tabs will see garbled behavior. Either switch to a per-request env factory, or document the single-tenant assumption on the HTML landing page.
109
- - **`openenv.yaml` is slightly out of sync with `env/models.py`.** The YAML lists six action types (`deploy_crew, move_crew, drop_retardant, build_firebreak, recon_flight, idle`) but `ActionType` defines seven β€” `ORDER_CREW_OBJECTIVE` is missing from the YAML. Add it before pushing to Space so the env manifest matches reality.
110
- - **No `openenv init` scaffold in tree** β€” we hand-built the package. That is fine, but run `openenv push` (or the equivalent `git push` to the Space repo) *now*, not the night before the deadline, so any manifest-mismatch surprises surface early. This is the guide's Topic 13 point restated.
111
-
112
- ---
113
-
114
- ## 6) Keep the task simple at first
115
-
116
- ### Approach
117
- The three-tier curriculum is a literal implementation of the "easy β†’ medium β†’ hard" progression the guide describes, and `env/curriculum.py`'s `CurriculumController` auto-promotes the trainer from easy β†’ medium when a rolling 10-episode average crosses 4.0, and medium β†’ hard at 3.5. It also auto-demotes if average falls below 50% of the prior threshold. The training notebook wires this into `reward_fn` so every batch updates the tier. Heuristic scores confirm success is possible at every tier (means of 7.0 / 3.93 / 5.32).
118
-
119
- ### Potential issues / improvements
120
- - **Curriculum promotion happens inside `reward_fn`, but the training dataset is frozen at `build_prompt_dataset(50)` *before* `trainer.train()` is called.** Concretely: even if the controller promotes `easy β†’ medium` at step 10, the prompts being scored from step 10 onward are still the **easy** prompts generated up front. The `tier` column in each dataset row is the tier that was active at dataset-build time, not the current tier. **This is a real bug** and it partially explains why `training_stats.json` shows the model spending steps 0-9 on `easy`, then the `tier` field flips to `medium` at step 10 β€” but every rollout is still running on easy-generated prompts. Fix: rebuild the dataset (or use a dataset-generating callback) whenever the controller returns a promotion.
121
- - **Curriculum thresholds are hard-coded and have not been validated.** 4.0 and 3.5 were picked to match the heuristic's scores, but the *initial* model scores before RL starts are unknown. If Qwen-2.5-1.5B starts at e.g. 5.5 on easy, it will promote on step 1 β€” too fast. Log the first 20 rewards before enabling promotion.
122
- - **No rollouts ever happen on `hard` during the first 20 steps** according to `training_stats.json` β€” only at step 20+ does `hard` appear. Given the 50-step budget, only ~30 of the 50 GRPO steps ever see hard-tier gradients. If hard is the theme's centerpiece (long-horizon planning), that is too few.
123
-
124
- ---
125
-
126
- ## 7) Design rewards carefully
127
-
128
- ### Approach
129
- Our reward was intentionally restructured during "Prompt 2" (see `Summary.txt` and `prompts.md`) to match exactly the guide's multi-component advice:
130
-
131
- - **Dense step reward** (`compute_step_reward`) β€” `0.4Β·Ξ” containment + 0.4Β·Ξ” population_safety βˆ’ 0.1Β·redundant_action_flag`.
132
- - **Sparse terminal reward** (`compute_terminal_reward`) β€” `+5.0` for zero population lost, an efficiency bonus up to `+2.0` for finishing early, `βˆ’3.0Β·loss_pct` for partial loss, `βˆ’2.0` for any crew casualty, `+1.0` briefing-adherence bonus if all priority zones survive, and `βˆ’0.01Β·invalid_action_count` capped at `βˆ’0.2`.
133
-
134
- Reward range is ~`βˆ’8` to `+8`, documented in `openenv.yaml:100`. This produces meaningfully-separated advantages for GRPO (the guide's whole justification for wide rewards).
135
-
136
- ### Potential issues / improvements
137
- - **`containment_pct` is reported as an integer percentage (0–100) in `ClusterStats` but as a fraction (0–1) inside `_snapshot_state` / `compute_step_reward`.** Verify we aren't accidentally multiplying by 100 somewhere β€” a single unit error here means the delta-containment term dominates or vanishes entirely.
138
- - **Only two delta components drive the dense reward.** The guide stresses "multiple independent reward functions." Good candidates that we already compute but don't reward: resource efficiency (wasted vs. total actions), area-saved ratio, briefing-compliance (already terminal-only β€” promote it to a per-step signal).
139
- - **Briefing adherence stuffs the raw `Grid` object into `terminal_state["_grid_ref"]` inside `wildfire_env.py:229`.** That leaks a mutable handle into the reward calculator. If the grader ever serializes the state dict (e.g. for logging), it will explode on the non-JSON-serializable Grid. Cleaner: compute the priority-zone survival boolean inline in `wildfire_env.py` and pass only a bool.
140
- - **Redundant-action detection is too shallow.** `_is_redundant` only compares `action_type + target_row + target_col` of the immediately prior action. A model that alternates `DEPLOY_CREW(0,0) / MOVE_CREW(crew_0, N)` in a loop gets no penalty. Either widen the window or add a format-compliance signal (next bullet).
141
- - **Missing reward component: `action_validity`.** Right now an invalid action silently costs `0.02Β·count` inside the legacy reward, and `0.01Β·count` (capped 0.2) inside the terminal reward. For GRPO this is too subtle. Add a per-step `-0.05 if not action_was_valid` signal so that producing syntactically valid JSON is itself rewarded β€” this is the single highest-leverage fix for the early training regime where most LLM outputs are malformed.
142
- - **Missing reward component: `parse_status` bonus.** `parse_action` returns one of `json_success`, `regex_fallback`, `safe_idle`. Reward `json_success` with a small bonus (`+0.02`) so the model learns clean JSON, not just barely-parseable regex output. This is genuinely an independent signal and directly matches the guide's "reward format compliance" example.
143
-
144
- ---
145
-
146
- ## 8) Protect yourself against reward hacking
147
-
148
- ### Approach
149
- Several anti-hacking defenses are already in place:
150
-
151
- - **Action validation is enforced twice** β€” Pydantic validates field presence/types at construction time, and `_validate_action` in `wildfire_env.py:407` enforces bounds. A malformed action never crashes the env; it just returns a penalty.
152
- - **Hard timeouts** β€” each tier has a fixed `episode_length`, and `_check_termination` guarantees termination.
153
- - **No unrestricted global state** β€” the env is fully seeded via `np.random.default_rng(seed)`. The agent has no way to mutate the RNG or grid from within the LLM's output channel.
154
- - **No arbitrary code execution** β€” the parser takes strings and produces a constrained Pydantic `Action`; it does not `eval` anything.
155
- - **Fog-of-war and smoke occlusion are computed on the server side** β€” the agent cannot read hidden cells through the observation API.
156
- - **Grader is separate from the env** β€” `graders/grader_*.py` calls `env.reset/step/state`; the agent never touches the grader, so it cannot alter how it's being scored.
157
-
158
- ### Potential issues / improvements
159
- - **The biggest hacking surface we haven't closed: `parse_action` silently downgrades to IDLE.** A model that outputs garbage constantly will get IDLE-reward (mostly 0 + whatever deltas the environment produces on its own), which on easy tier can still exceed `+5` because the heuristic-scale fire spread is mild. Confirm this with a controlled experiment: run 50 episodes with a "pure garbage agent" (returns random strings) and see what the episode reward is. If it's > +2, that is a free-reward exploit and we need a per-step "safe_idle fallback" penalty of at least `-0.2`.
160
- - **`reward_fn` continues with the heuristic for 14 steps after the model's one action.** Under adversarial framing, *the model doesn't even need to act well* β€” the heuristic will recover most episodes. The gradient signal will be dominated by the heuristic's rollout, not the model's policy. Add at least some pure-model rollouts (say, 2 of every 8 generations) so the model is directly scored on its full trajectory.
161
- - **The `redundant_action` penalty is bypassable** by inserting an IDLE action between two identical real actions β€” the comparison is only against `_prev_action`. Fix by tracking a short sliding window of recent actions.
162
- - **Human inspection is not wired in.** The training notebook logs `mean_reward` per step to `training_stats.json` and prints to stdout, but completions are never sampled to disk. Add a `if step % 10 == 0: print(first_completion)` block so we can eyeball what the model is actually generating and catch reward hacking the moment it starts.
163
- - **No seed diversity audit** β€” `SEED_POOL = list(range(100))` in the notebook. If 100 seeds are cycled through 50 steps Γ— 8 generations = 400 rollouts, some seeds repeat ~4Γ—. That's fine, but a model that memorizes seed-specific fire patterns would look good in training and fall apart on eval seeds 42-46. Use a larger pool or sample seeds without replacement per batch.
164
- - **`_ignite_initial_fires` is seed-deterministic.** That is great for reproducibility, bad for generalization. Evaluate on held-out seeds (say 200-250) to confirm no over-fitting.
165
-
166
- ---
167
-
168
- ## 9) Use process-aware feedback when you can
169
-
170
- ### Approach
171
- Our reward is primarily outcome-based (delta containment, terminal survival), but the guide's "lightweight process checks" category has footholds already:
172
-
173
- - The 3-layer parser (`json_success` / `regex_fallback` / `safe_idle`) is a ready-made process signal β€” we just aren't using it as a reward component yet.
174
- - `recent_events` in the observation shows the agent what happened immediately after its last action, giving the LLM in-context process feedback even if the gradient signal doesn't encode it.
175
- - `info["reward_breakdown"]` on every step (see `wildfire_env.py:244`) already decomposes containment / population / efficiency / speed / area / invalid_actions β€” a perfect vector for process-aware per-step rewards.
176
-
177
- ### Potential issues / improvements
178
- - **`reward_breakdown` is computed every step but never used for gradients.** Promote it to a multi-head reward list that `GRPOTrainer` consumes β€” TRL's `reward_funcs` parameter accepts a **list** of callables. Splitting our current scalar into `[containment_reward, population_reward, efficiency_reward, format_reward]` would give us the "multiple independent reward functions" the guide explicitly recommends, with near-zero code cost.
179
- - **No LLM-as-a-judge anywhere.** The guide says "lightweight" process checks are the hackathon sweet spot β€” we are on the right side of that warning. Do not add judges under deadline pressure.
180
- - **Briefing adherence is only terminal.** Convert it to per-step: at each step, count priority zones currently safe vs. burning, and reward the delta. That gives the model sub-episode feedback on instruction-following, which is explicitly Theme 2's whole point.
181
-
182
- ---
183
-
184
- ## 10) Pick the right training stack
185
-
186
- ### Approach
187
- We are running the exact stack the guide recommends:
188
-
189
- - **TRL `GRPOTrainer`** pinned to `0.12.1` in the install cell (chosen to avoid the mergekit/llm_blender eager-import issues in newer TRL releases β€” see notebook cell `code-install`).
190
- - **Unsloth** with `FastLanguageModel.from_pretrained(..., load_in_4bit=True)` and `get_peft_model(..., r=16, lora_alpha=32)` on `q/k/v/o_proj`.
191
- - **OpenEnv** shape β€” `reset/step/state` with FastAPI wrapper per Topic 5.
192
-
193
- ### Potential issues / improvements
194
- - **LoRA target modules are minimal.** `['q_proj', 'k_proj', 'v_proj', 'o_proj']` covers attention only. For Qwen-2.5 you'd typically also adapt `gate_proj`, `up_proj`, `down_proj` (the MLP) to meaningfully shift behavior. That bumps trainable parameter count ~2Γ— but is well within T4 memory for a 1.5B model. Strongly recommend adding these β€” our model is probably under-expressive right now.
195
- - **`num_generations=8, batch=1, grad_accum=4`** means each optimizer step sees 32 trajectories of gradient signal. Fine, but `max_completion_length=128` is tight β€” actions with `reason` strings or edge-case JSON formatting may get truncated. Bump to 192 if latency allows.
196
- - **TRL version pin is a ticking clock.** 0.12.1 is several releases behind. Document exactly *why* in a comment (we already do) so a reviewer doesn't assume carelessness.
197
- - **Unsloth install line is heavy.** `unsloth[colab-new] @ git+https://...` pulls the latest, which may break on the day of judging. If we can, pin to a specific commit.
198
-
199
- ---
200
-
201
- ## 11) Prefer GRPO / RLVR style training for verifiable tasks
202
-
203
- ### Approach
204
- Our task is verifiable end-to-end β€” `env.state()` returns fully objective population-lost / cells-burned / containment-pct numbers, and our reward is computed from those numbers plus deterministic action-history counters. No learned reward model is used anywhere. We are using TRL's GRPO (`GRPOTrainer`), which the guide specifically endorses over older PPO setups. The verifier (`env/reward.py`) was built before the training notebook β€” the right order.
205
-
206
- ### Potential issues / improvements
207
- - **The "verifier" is de facto the env itself, not a separate module.** That is fine but it means if the env has a bug, the verifier has the same bug. Add a small `tests/test_reward_is_deterministic.py` that runs a fixed heuristic-agent rollout twice on the same seed and asserts the reward sequence is bitwise identical β€” guards against accidentally-stochastic reward paths.
208
- - **No unit test asserts that `compute_terminal_reward` lies in `[-8, +8]`.** Add one. The advertised range in `openenv.yaml` is a judging talking point; violating it quietly would be embarrassing.
209
- - **GRPO reference model is implicit.** TRL's `GRPOTrainer` uses the pre-training model as the KL reference. On Unsloth 4-bit that KL-reference setup is sometimes subtly broken if the `ref_model` path isn't configured. Confirm the `kl_coef` and reference model are actually contributing to the loss by logging the `kl` column; if it's exactly 0 throughout training, GRPO has degenerated to REINFORCE.
210
-
211
- ---
212
-
213
- ## 12) Keep inference fast
214
-
215
- ### Approach
216
- Several efficiency choices align with this guidance:
217
-
218
- - **Unsloth 4-bit** roughly halves memory vs. a standard 8-bit load and gives ~2Γ— generation speedup on T4.
219
- - **`max_completion_length=128`** caps generation time per rollout at a few hundred ms.
220
- - **Heuristic continuation in `reward_fn`** β€” the single most effective speed trick in the notebook. Only the first of 15 rollout steps runs the LLM; the remaining 14 run the hand-coded heuristic at zero GPU cost. That directly addresses the guide's "inference dominates runtime" warning.
221
- - **Vectorized grid ops** β€” `env/grid.py` and `env/fire_spread.py` use NumPy for all per-cell loops, not Python.
222
- - **Observation serialization clusters cells** into bounding boxes before sending to the LLM, so the grid-summary string is O(regions), not O(cells).
223
-
224
- ### Potential issues / improvements
225
- - **The LoRA model is loaded without `FastLanguageModel.for_inference()` in some call paths.** Double-check: `collect_rollout` does the switch, but `reward_fn` does *not* β€” it calls `model.generate` in training mode. This can be 2-4Γ— slower than inference mode on Unsloth. Either (a) move the `FastLanguageModel.for_inference` toggle into `reward_fn` and restore `for_training` before the optimizer step, or (b) use `with torch.no_grad():` around the generate call if mode switches are awkward.
226
- - **No batched generation.** `reward_fn` iterates over `completions` list serially. If TRL passes them as a batch, great β€” but if it sends one at a time and we could otherwise batch them, we're leaving throughput on the floor.
227
- - **`model.device` is called but the notebook doesn't assert GPU.** On a T4 it will always be CUDA, but on a reviewer's CPU-only env it will silently run and take hours. Add `assert torch.cuda.is_available()` at the top of Section 1 to fail fast.
228
- - **`serialize_observation` on hard tier is O(rowsΒ·cols) for every rollout step**. For 40Γ—40Γ—300 steps per episode Γ— 8 generations Γ— 50 training steps that's 192M cell touches. Profile it β€” if it's > 5% of wall time, cache the previous serialization and diff.
229
-
230
- ---
231
-
232
- ## 13) Deploy your environment early
233
-
234
- ### Approach
235
- We have the deployment artifacts ready: `server/app.py` (FastAPI on port 7860), `Dockerfile` (`python server/app.py`), `openenv.yaml` manifest, and a Hugging Face Space reference in the README (`Eshit/Wildfire-Containment-Simulator`). The README claims it is live. The root `app.py` is a shim that forwards to `server.app:main` so both `docker run` and `python app.py` start the same server. Endpoints provided: `/`, `/health`, `/reset`, `/step`, `/state`, `/docs` (auto-generated Swagger).
236
-
237
- ### Potential issues / improvements
238
- - **Verify the Space is actually live and up-to-date.** If the repo has drifted since the last push (e.g., new actions in `models.py`), the Space will 500. Before the demo, run `curl https://<space-url>/health`, then `curl -X POST .../reset?task_id=easy&seed=42`, then step. A working remote demo is a judging multiplier.
239
- - **No automated deploy pipeline.** GitHub Actions is wired for CI tests (per the README badge) but not for pushing to the Space. At minimum, document the push command in `training/README.md` or `AGENTS.md` so any teammate can redeploy in 30 seconds.
240
- - **`_env` singleton means we can't easily show a second demo tab in parallel.** For the demo video this may be fine; for a live judge interaction it could embarrass. If time allows, switch `server/app.py` to a `dict[session_id -> WildfireEnv]` keyed on a cookie or header.
241
- - **The HTML landing page uses `&#128293;` instead of UTF-8 πŸ”₯** β€” that is fine but looks dated. Polish up the `/` endpoint HTML β€” a sharper landing page is free product polish.
242
-
243
- ---
244
-
245
- ## 14) Scale only after the environment is stable
246
-
247
- ### Approach
248
- The repo shows we followed this order. `prompts.md` confirms a sequence: (1) "Repo Cleanup & Test Scaffolding" with smoke tests β†’ (2) "Reward Restructuring" with reward tests β†’ (3) "Observation-to-Text Serializer" β†’ (4) "LLM Action Parser" β†’ (5) "Replay / GIF Renderer" β†’ ... *then* training. The test suite (`tests/test_smoke.py`, `test_reward.py`, `test_serialization.py`, `test_action_parser.py`, `test_rendering.py`, `test_briefing.py`, `test_curriculum.py`, `test_graders.py`, `test_dashboard.py`, `test_eval_compare.py`) verifies reset works, step works, rewards are sensible, the parser never crashes, graders run to completion, and renderings are non-empty β€” exactly the "before you scale" checklist the guide prescribes. Logs are visible via `recent_events` on every observation and via `info["events"]` in the StepResult.
249
-
250
- ### Potential issues / improvements
251
- - **Batch size was not stepped up after stabilization.** We are still at `per_device_train_batch_size=1, grad_accum=4` β€” the guide's "only then, increase batch sizes" step hasn't happened. On T4 we can probably go to `batch=2, grad_accum=2` (same effective batch, faster wall clock) without running out of VRAM. Try it.
252
- - **Prompt dataset is 50 rows and never resampled.** After the environment stabilized we should have diversified prompts β€” e.g., starting each prompt from a random step offset, not always step 0. Right now every training prompt is a *fresh reset* β€” so the model never learns mid-episode state recognition.
253
- - **No throughput benchmark is checked in.** Add a `scripts/bench_rollout.py` that times 10 full rollouts and prints steps/sec. That number going into judging ("our env runs 480 steps/sec on T4") is an objective achievement to cite.
254
-
255
- ---
256
-
257
- ## 15) Monitor the right things during training
258
-
259
- ### Approach
260
- `training_stats.json` already logs `(step, tier, mean_reward)` per GRPO step, and `scripts/plot_dashboard.py` renders training curves with tier-promotion markers. The GRPO trainer's own `logging_steps=1` means we see reward, loss, and KL every step in stdout. `scripts/eval_compare.py` produces a multi-agent comparison table against saved baselines.
261
-
262
- ### Potential issues / improvements
263
- - **We only log `mean_reward`** β€” the guide explicitly warns against watching a single scalar. We should also log: (a) `json_success_rate`, (b) `regex_fallback_rate`, (c) `safe_idle_rate`, (d) `invalid_action_count`, (e) `pop_lost_rate`, (f) `crew_casualty_rate`, (g) `mean_episode_length`. All of these are already computed in `info["reward_breakdown"]` on every step β€” we just need to aggregate them.
264
- - **No generation sampling to disk.** The guide's last bullet under this topic: "inspect actual generations during training." Right now we never save any completion strings. Add a step where every 10 training steps, we save the first completion of each of the 8 generations to `training/samples/step_{n}.txt`. That alone would let us catch reward hacking within 10 steps instead of at epoch end.
265
- - **No TensorBoard / W&B hook.** TRL supports both out of the box; 3 lines in `GRPOConfig(report_to="tensorboard")`. Worth it for the judge-demo screenshot alone.
266
-
267
- ---
268
-
269
- ## 16) Save models correctly
270
-
271
- ### Approach
272
- We are doing the right thing according to the guide's warning: `model.save_pretrained('checkpoints/final')` on the **LoRA-adapted 4-bit model**, followed by an explicit verification step that reloads via `FastLanguageModel.from_pretrained(final_ckpt, load_in_4bit=True)` and prints success. We do not attempt a 4-bit β†’ 16-bit upcast and merge. The `checkpints-140/` directory in the repo shows the checkpoint format is the HuggingFace adapter-only layout (`adapter_config.json`, `adapter_model.safetensors`) β€” exactly the "adapters directly" path the guide recommends.
273
-
274
- ### Potential issues / improvements
275
- - **Directory name is misspelled: `checkpints-140` (missing an 'o')**. Not a bug, but if any download script or README link uses the correct spelling it will 404. Rename to `checkpoints-140` or at least document the typo.
276
- - **Post-training inference is only tested inside the notebook, not via the server path.** After saving, the next logical step is: (a) bake the adapter into the `Dockerfile` so the Space serves the trained model, or (b) leave the server as a pure env and let the LLM live elsewhere. We have implicitly chosen (b) β€” confirm that decision in the README so judges understand the architecture.
277
- - **No test that the checkpoint actually improves behavior.** Save and reload succeed even if the adapter is all zeros. Add an assertion: `assert trained_mean_easy > untrained_mean_easy + 0.5` inside Section 6 of the notebook.
278
- - **Adapter export as a downloadable zip is documented** but not scripted. A single `scripts/package_adapter.py` would be more reliable than a Colab-specific recipe in the README.
279
-
280
- ---
281
-
282
- ## 17) How to structure your team over the hackathon
283
-
284
- ### Approach
285
- The four roles in the guide all have owners' fingerprints in the repo:
286
-
287
- - **Person A (Environment)** β€” `env/` (13 modules), `server/app.py`, `Dockerfile`, `openenv.yaml`. Every component is separated cleanly.
288
- - **Person B (Verifier / Rewards)** β€” `env/reward.py`, `graders/` (one per tier), `env/action_parser.py` (anti-corruption layer between LLM and env).
289
- - **Person C (Training)** β€” `training/grpo_colab.ipynb`, `training/README.md`, `training_stats.json`, `scripts/plot_dashboard.py`, `scripts/eval_compare.py`.
290
- - **Person D (Demo / Product)** β€” `scripts/run_demo.py`, `scripts/find_demo_seed.py`, `scripts/replay.py`, `env/rendering.py`, the HTML landing page in `server/app.py`, README narrative.
291
-
292
- ### Potential issues / improvements
293
- - **The demo person has the thinnest deliverables right now.** `demos/` GIFs and the `{TBD}` rows in the README are the weakest part of the current submission. Carve out explicit time for: (a) a 60-second demo video with heuristic-vs-trained side-by-side, (b) final benchmark numbers, (c) a polished Space landing page.
294
- - **Verifier role is sharing code with the environment role.** `env/reward.py` is the verifier β€” that is fine in a small team, but when iterating on reward we should version-tag the reward file (e.g., a `REWARD_VERSION = "v2_decomposed"` constant) so we can tell from a checkpoint which reward it was trained against.
295
-
296
- ---
297
-
298
- ## 18) A practical 1-day execution plan
299
-
300
- ### Approach
301
- Mapping our current state to the guide's 9 phases:
302
-
303
- - **Phase 1 (narrow task):** Done β€” easy tier is the narrow task, hard is the stretch goal.
304
- - **Phase 2 (build the env):** Done β€” `env/` is feature-complete.
305
- - **Phase 3 (build rewards):** Done β€” decomposed step+terminal; 4+ reward components.
306
- - **Phase 4 (deploy):** Partially done β€” Docker + FastAPI work locally; Space reference exists but needs verification.
307
- - **Phase 5 (train small):** Done once β€” `training_stats.json` shows 50 GRPO steps completed, final checkpoint in `checkpints-140/`.
308
- - **Phase 6 (inspect for hacking):** **Not done** β€” no completions have been saved to disk during training.
309
- - **Phase 7 (add curriculum):** Done β€” `CurriculumController` with known caveat (dataset is frozen, see Topic 6).
310
- - **Phase 8 (train bigger):** **Not done** β€” no second training run with larger batch / more steps / diversified prompts.
311
- - **Phase 9 (save and demo):** Partially done β€” checkpoint saved; demo video and eval-table numbers outstanding.
312
-
313
- ### Potential issues / improvements
314
- - **Priority for the remaining time, in order:**
315
- 1. Fix the frozen-dataset bug (Topic 6) and run a second training pass with a *live* curriculum.
316
- 2. Generate training-time completion samples and eyeball them (Topic 8 / 15).
317
- 3. Populate the `{TBD}` rows in the README with real numbers (Topic 19).
318
- 4. Record the demo video and push the Space.
319
- 5. (If time remains) Expand LoRA target modules and bump `max_steps` to 150.
320
- - **Do not start any new feature.** Every incomplete feature at submission time is a judge-question risk.
321
-
322
- ---
323
-
324
- ## 19) What judges or reviewers will likely find compelling
325
-
326
- ### Approach
327
- We have five of the six compelling-project elements in place:
328
-
329
- - **Clear environment design** β€” `env/` with separated subsystems, documented Pydantic models, `openenv.yaml` manifest.
330
- - **Objective reward functions** β€” verifiable from `env.state()`, no LLM judge.
331
- - **Evidence of model improvement** β€” `training_stats.json` shows ~+4 to +5 across training (noisy but present).
332
- - **Prevention against reward hacking** β€” typed actions, 3-layer parser, episode timeouts.
333
- - **Reproducible deployment story** β€” Dockerfile + openenv.yaml + Space reference.
334
-
335
- The one missing element is **a sharp demo**. The 5-beat demo format the guide recommends (baseline attempt β†’ verifier output β†’ trained attempt β†’ measurable improvement β†’ safeguards) maps naturally onto our `scripts/run_demo.py` + `scripts/replay.py` + README narrative. The GIF renderer (`env/rendering.py`) is ready; we just need to produce the three clips.
336
-
337
- ### Potential issues / improvements
338
- - **The README's banner claim uses `{TBD}%` for both heuristic and trained numbers.** This is the first thing judges see. Replace it with real numbers or soften the phrasing.
339
- - **Nothing visually distinguishes a trained run from an untrained run** in the demo assets today. A side-by-side GIF (two panels, same seed) would be a 10x force multiplier for our 60-second pitch.
340
- - **The "safeguards" story isn't spelled out anywhere user-facing.** Turn this document's Topic 8 section into two bullet points on the README so judges can see we thought about reward hacking.
341
- - **We have a unique selling point the guide doesn't: Theme 2 (long-horizon + instruction following).** The `OperationalBriefing` is a genuinely novel element β€” we should call it out explicitly in the pitch. "Most RL agents don't follow instructions. Ours reads a briefing, identifies priority zones, and gets rewarded for obeying the commander's intent."
342
-
343
- ---
344
-
345
- ## 20) Suggested problem statement theme directions
346
-
347
- ### Approach
348
- The README declares **Theme 2: Long-Horizon Planning & Instruction Following**, and every design decision flows from that choice:
349
-
350
- - **Long-horizon:** 300-step hard episodes, sparse terminal reward (+5 only on full survival), rewarding recovery from staggered ignition and crew loss.
351
- - **Instruction following:** `OperationalBriefing` on reset, explicit per-episode priority zones, briefing-adherence reward term.
352
-
353
- Both pillars are first-class features of the environment, not after-thoughts.
354
-
355
- ### Potential issues / improvements
356
- - **The briefing currently contains two priority zones and some infrastructure, but only the priority zones contribute to the adherence reward.** If "instruction following" is our pitch, the briefing should have *more* followable directives that the reward tracks β€” e.g., "maintain Corridor X open" β†’ reward if no fire ever crosses the corridor; "conserve recon for mid-episode" β†’ reward if recon is used after step 50.
357
- - **The "long horizon" claim is only hard-tier.** On easy tier, 80 steps is not "long horizon" by RL standards. Be precise in the pitch: mention that easy is a proving ground and the headline number is hard.
358
-
359
- ---
360
-
361
- ## 21) Common mistakes to avoid
362
-
363
- ### Approach
364
- Evaluating our project against the guide's blacklist:
365
-
366
- | Mistake | Our status |
367
- |---------|-----------|
368
- | Task so hard success is zero | βœ… Avoided β€” heuristic routinely scores positive on all tiers. |
369
- | Using only one reward function | ⚠️ Partially β€” we have multiple components but one combined scalar. |
370
- | Not checking for reward hacking | ⚠️ Partially β€” structural defenses in place, but no completion inspection loop. |
371
- | Training before env is stable | βœ… Avoided β€” see `prompts.md` ordering. |
372
- | Relying only on average reward | ❌ This is what we are currently doing. |
373
- | Forgetting timeouts / sandbox | βœ… Avoided β€” `episode_length` cap, Pydantic validation, parser fallback. |
374
- | Saving LoRA/QLoRA models incorrectly | βœ… Avoided β€” adapter-only save, explicit reload test. |
375
-
376
- ### Potential issues / improvements
377
- - **The two partial-credit items (multiple reward functions; completion inspection) are the cheapest wins left.** Both can be added in under an hour:
378
- - Split the reward scalar into a list of callables for TRL β€” see Topic 9.
379
- - Dump 1-2 completions per 10 training steps β€” see Topic 15.
380
- - **Relying only on average reward is the worst of the three issues.** Fix this before the final training run. Grep `training_stats.json`: the current file has exactly `mean_reward` and nothing else. This is the guide's single most-warned-against failure mode and we're committing it directly.
381
-
382
- ---
383
-
384
- ## 22) Learning Resources
385
-
386
- ### Approach
387
- The 5 video modules in the guide are aligned with code we've already written:
388
-
389
- - **Module 1 (Why OpenEnv?):** Our env implements the Gymnasium-like `reset/step/state` contract and is Dockerized β€” matches Sanyam's argument for a universal interface.
390
- - **Module 2 (Using existing envs):** Not directly applicable (we are *producing* an env, not consuming one), but Ben's three Space interfaces (server / repo / registry) are all reachable from our Space.
391
- - **Module 3 (Deploying envs):** `openenv init`-style scaffold exists (we hand-built it), local Uvicorn works (`python server/app.py`), Docker run works.
392
- - **Module 4 (Building your own):** `env/wildfire_env.py` + `env/models.py` are the business logic + models files Ben demonstrates; our parser and serializer are the "client" glue.
393
- - **Module 5 (Training + TRL / Wordle GRPO walkthrough):** Our `training/grpo_colab.ipynb` is the direct parallel β€” `reward_fn` is our `rollout function`, reward shaping is in `env/reward.py`, `GRPOTrainer` is used the same way.
394
-
395
- ### Potential issues / improvements
396
- - **We have not confirmed alignment with the Wordle walkthrough's exact `reward_fn` signature.** TRL changed the callback convention twice between 0.10 and 0.12; verify the `(completions, prompts, **kwargs)` form we use is the one 0.12.1 expects.
397
- - **No one on the team has watched Module 5 recently.** Put a 15-minute rewatch on the schedule tonight β€” Lewis's Wordle GRPO walkthrough was the direct blueprint for what we're doing and will expose any pattern we diverged from.
398
- - **Share this document with all four role-owners** before the final push. Every improvement listed above has an owner in Topic 17's table; making ownership explicit reduces coordination cost in the last 24 hours.
399
-
400
- ---
401
-
402
- ## Final summary β€” the three highest-leverage changes
403
-
404
- If we make exactly three code changes in the remaining time, they should be:
405
-
406
- 1. **Fix the frozen-dataset-vs-live-curriculum bug** (Topic 6). Regenerate the prompt dataset each time the `CurriculumController` returns a promotion. Without this, the model never actually trains on medium or hard prompts.
407
- 2. **Split the reward scalar into a list of reward functions and sample completions to disk every 10 steps** (Topics 7, 9, 15, 21). Cheap, directly addresses the guide's single most-repeated advice, and gives us the "multiple independent reward functions + human inspection" talking points for judging.
408
- 3. **Populate `{TBD}` rows in the README with real trained-model numbers and produce a side-by-side demo GIF** (Topics 19, 20). The narrative collapse without this β€” the whole submission relies on a measurable improvement claim that we have not yet measured.
409
-
410
- Everything else in this document is polish; those three are the difference between "we have the right architecture" and "we demonstrably won."
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
openenv.yaml CHANGED
@@ -36,14 +36,21 @@ environment:
36
 
37
  action_space:
38
  type: object
39
- description: "One action per step. Six action types with typed parameters."
40
  properties:
41
  action_type:
42
  type: string
43
- enum: [deploy_crew, move_crew, drop_retardant, build_firebreak, recon_flight, idle]
 
 
 
 
 
 
 
44
  crew_id:
45
  type: string
46
- description: "Required for deploy_crew, move_crew, build_firebreak"
47
  tanker_id:
48
  type: string
49
  description: "Required for drop_retardant"
@@ -57,6 +64,17 @@ action_space:
57
  type: string
58
  enum: [N, S, E, W, NE, NW, SE, SW]
59
  description: "Required for move_crew, build_firebreak"
 
 
 
 
 
 
 
 
 
 
 
60
  reason:
61
  type: string
62
  description: "Optional reason string for idle action"
 
36
 
37
  action_space:
38
  type: object
39
+ description: "One action per step. Seven action types with typed parameters."
40
  properties:
41
  action_type:
42
  type: string
43
+ enum:
44
+ - deploy_crew
45
+ - move_crew
46
+ - order_crew_objective
47
+ - drop_retardant
48
+ - build_firebreak
49
+ - recon_flight
50
+ - idle
51
  crew_id:
52
  type: string
53
+ description: "Required for deploy_crew, move_crew, build_firebreak, order_crew_objective"
54
  tanker_id:
55
  type: string
56
  description: "Required for drop_retardant"
 
64
  type: string
65
  enum: [N, S, E, W, NE, NW, SE, SW]
66
  description: "Required for move_crew, build_firebreak"
67
+ objective:
68
+ type: string
69
+ enum:
70
+ - hold
71
+ - advance
72
+ - retreat
73
+ - prioritize_north
74
+ - prioritize_south
75
+ - prioritize_east
76
+ - prioritize_west
77
+ description: "Required for order_crew_objective. Persistent directive that biases the crew's local policy until changed."
78
  reason:
79
  type: string
80
  description: "Optional reason string for idle action"
requirements.txt CHANGED
@@ -1,9 +1,31 @@
 
1
  pydantic>=2.0
2
  numpy>=1.24
3
- openai>=1.0
 
4
  fastapi>=0.100.0
5
  uvicorn>=0.23.0
 
 
 
 
 
 
6
  matplotlib>=3.7
7
  imageio>=2.28
 
 
 
8
  pytest>=7.0
9
  pytest-cov>=4.0
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ── Core environment runtime ────────────────────────────────────────────────
2
  pydantic>=2.0
3
  numpy>=1.24
4
+
5
+ # ── HTTP server (Hugging Face Space, OpenEnv contract) ──────────────────────
6
  fastapi>=0.100.0
7
  uvicorn>=0.23.0
8
+ python-multipart>=0.0.6
9
+
10
+ # ── LLM client (used by inference.py for OpenAI-compatible API rollouts) ────
11
+ openai>=1.0
12
+
13
+ # ── Replay rendering / dashboards ───────────────────────────────────────────
14
  matplotlib>=3.7
15
  imageio>=2.28
16
+ Pillow>=9.0
17
+
18
+ # ── Test suite ──────────────────────────────────────────────────────────────
19
  pytest>=7.0
20
  pytest-cov>=4.0
21
+
22
+ # ── Optional extras for evaluating a trained adapter locally ────────────────
23
+ # (Training itself is run on Colab/HF Space JupyterLab; see training/grpo_v2_colab.ipynb,
24
+ # which installs unsloth + trl + datasets inline. These pins here are purely so
25
+ # `python scripts/eval_trained_model.py` works in a stock Python venv.)
26
+ # torch>=2.1
27
+ # transformers>=4.45
28
+ # accelerate>=0.30
29
+ # peft>=0.10
30
+ # safetensors>=0.4
31
+
training/requirements.txt ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Training dependencies for SFT + GRPO
2
+ #
3
+ # These are the packages the Colab/HF-Space-JupyterLab notebooks install at
4
+ # runtime (see Section 1 of training/grpo_v2_colab.ipynb and training/sft_colab.ipynb).
5
+ # Pinned to versions that are known to work together with Unsloth 4-bit + Qwen2.5.
6
+ #
7
+ # Usage:
8
+ # pip install -r training/requirements.txt
9
+ #
10
+ # Note: Unsloth must be installed from its GitHub source for current Colab CUDA wheels.
11
+
12
+ # Project runtime (env, agents, graders)
13
+ -r ../requirements.txt
14
+
15
+ # RL framework
16
+ trl==0.15.2
17
+
18
+ # Datasets / tokenizers
19
+ datasets==3.4.1
20
+
21
+ # Quantized fine-tuning
22
+ unsloth @ git+https://github.com/unslothai/unsloth.git
23
+ peft>=0.11
24
+ bitsandbytes>=0.43
25
+
26
+ # Model + acceleration backbone
27
+ torch>=2.1
28
+ transformers>=4.45
29
+ accelerate>=0.30
30
+ safetensors>=0.4
31
+
32
+ # Logging
33
+ wandb>=0.17
34
+
35
+ # Misc utilities used by the notebooks
36
+ huggingface_hub>=0.24
37
+ sentencepiece>=0.2