MathisW78 commited on Apr 9

Commit

f748552

verified ·

1 Parent(s): c540401

Demo notebook payload (source + checkpoint + assets)

Browse files

Files changed (48) hide show

.gitattributes +3 -0
README.md +739 -0
ablation_assets/diagnosis_decision_tree.png +0 -0
ablation_assets/grad_alignment.png +3 -0
ablation_assets/gradient_conflict_map.png +0 -0
ablation_assets/group_comparison.png +0 -0
ablation_assets/group_summary.csv +5 -0
ablation_assets/hypothesis_verdicts.csv +22 -0
ablation_assets/main_results.csv +22 -0
ablation_assets/per_env_delta.png +3 -0
ablation_assets/per_env_win_rates.csv +22 -0
ablation_assets/repr_drift.png +0 -0
ablation_assets/results.json +0 -0
ablation_assets/score_comparison.png +3 -0
ablation_assets/score_delta.png +0 -0
checkpoint_inference.pth +3 -0
configs/defaults.yaml +242 -0
configs/final_qmul_gpu.yaml +176 -0
configs/final_ucl_gpu.yaml +158 -0
configs/smoke.yaml +16 -0
configs/ucl_gpu_bigger_model.yaml +103 -0
configs/ucl_gpu_learning_behaviour.yaml +103 -0
environments/.gitkeep +0 -0
main.py +255 -0
pyproject.toml +22 -0
src/__init__.py +0 -0
src/buffer.py +268 -0
src/config.py +164 -0
src/curriculum.py +143 -0
src/diffusion/__init__.py +0 -0
src/diffusion/forward.py +50 -0
src/diffusion/loss.py +162 -0
src/diffusion/sampling.py +398 -0
src/diffusion/schedules.py +88 -0
src/envs/__init__.py +0 -0
src/envs/discovery.py +166 -0
src/envs/minihack_env.py +454 -0
src/models/__init__.py +0 -0
src/models/denoiser.py +415 -0
src/planners/__init__.py +0 -0
src/planners/baselines.py +1247 -0
src/planners/collect.py +588 -0
src/planners/collect_oracle.py +185 -0
src/planners/inference.py +360 -0
src/planners/logging.py +291 -0
src/planners/offline.py +727 -0
src/planners/online.py +721 -0
src/planners/smoke.py +63 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+ablation_assets/grad_alignment.png filter=lfs diff=lfs merge=lfs -text
+ablation_assets/per_env_delta.png filter=lfs diff=lfs merge=lfs -text
+ablation_assets/score_comparison.png filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,739 @@

+# ReMDM Planner for MiniHack
+PyTorch implementation of **ReMDM** (Remasking Discrete Diffusion Model) for action-sequence planning in [MiniHack](https://github.com/facebookresearch/minihack) navigation environments. A dual-stream transformer generates 64-step action plans by iteratively denoising masked token sequences, conditioned on a 9x9 local crop and the full 21x79 dungeon map.
+> The primary training method is **DAgger** with BFS oracle supervision: the model is trained from scratch, with the buffer seeded by pure expert trajectories on the first iteration. A standalone **offline BC** mode is also available as an independent baseline trained on pre-collected datasets. The paper compares both methods head-to-head; neither depends on the other. An offline BC checkpoint can optionally warm-start DAgger, but this is not used in the paper. Generalises **zero-shot** from 4 in-distribution environments to 3 out-of-distribution environments.
+---
+## Pipeline
+```
+[Primary]  DAgger online training          main.py --mode dagger
+               |  (seed buffer with oracle demos on iter 0,
+               |   collect with model, label with oracle,
+               |   efficiency filter, curriculum sampling)
+               v  checkpoint
+[Evaluate] ID + OOD evaluation             main.py --mode inference --checkpoint iter8000.pth
+```
+```
+**Other modes:**
+[Collect]     Collect oracle demonstrations main.py --mode collect
+[Offline BC]  Train on pre-collected data   main.py --mode offline --data dataset.pt
+[Smoke test]  Quick end-to-end check        main.py --mode smoke
+DAgger trains from scratch and is the recommended pipeline. Offline BC (`--mode collect` + `--mode offline`) is an independent training method compared against DAgger in the paper. An offline BC checkpoint can optionally warm-start DAgger via `--checkpoint`, but this was not used in the paper results.
+```
+---
+## Environments
+**In-distribution (training):**
+| Environment | Description |
+|---|---|
+| `MiniHack-Room-Random-5x5-v0` | Small random room |
+| `MiniHack-Room-Random-15x15-v0` | Large random room |
+| `MiniHack-Corridor-R2-v0` | Two-room corridor |
+| `MiniHack-MazeWalk-9x9-v0` | Small maze |
+**Out-of-distribution (zero-shot evaluation):**
+| Environment | Description |
+|---|---|
+| `MiniHack-Room-Dark-15x15-v0` | Dark room (limited visibility) |
+| `MiniHack-Corridor-R5-v0` | Five-room corridor |
+| `MiniHack-MazeWalk-45x19-v0` | Large maze |
+---
+## Installation
+### Prerequisites
+**Python 3.12+** is required.
+**macOS (arm64):** Install cmake via Homebrew (needed to compile `nle` from source):
+```bash
+brew install cmake
+```
+**Linux (x86_64):** Pre-built wheels are available, but if building from source:
+```bash
+sudo apt-get install build-essential cmake bison flex libbz2-dev
+```
+### Setup
+```bash
+uv sync
+```
+This installs all dependencies from the lockfile, including `nle>=1.2.0` (from the maintained [NetHack-LE](https://github.com/NetHack-LE/nle) fork), `minihack`, `torch>=2.11.0`, `wandb`, `polars`, `orjson`, and `scipy`.
+### GPU support (optional)
+By default PyTorch runs on CPU. For NVIDIA CUDA 12:
+```bash
+uv pip install torch --index-url https://download.pytorch.org/whl/cu121
+```
+Verify GPU is detected:
+```bash
+uv run python -c "import torch; print(torch.cuda.is_available())"
+```
+---
+## Usage
+All modes share a single entry point. Defaults load from `configs/defaults.yaml`; any value can be overridden via `key=value` pairs.
+```bash
+python main.py --mode <MODE> [--config PATH] [key=value ...]
+```
+### Smoke test
+Collects a few oracle trajectories, trains under a tiny 5k env-step budget, and prints ID evaluation results.
+```bash
+python main.py --mode smoke
+```
+### Collect oracle demonstrations
+Run the BFS oracle across all 4 ID environments and save the trajectories as a `.pt` dataset for offline BC training. Uses multiprocessing for parallelism.
+```bash
+# Default: 5000 episodes per env, output to data/dataset.pt
+python main.py --mode collect
+# Custom episode count and output
+python main.py --mode collect collect_episodes_per_env=2000 \
+    collect_output=data/small_dataset.pt
+# Fewer workers (default: 8)
+python main.py --mode collect collect_num_workers=4
+# Reproducible with fixed seed
+python main.py --mode collect seed=42
+```
+The output `.pt` file is directly consumable by `--mode offline`:
+```bash
+python main.py --mode collect
+python main.py --mode offline --data data/dataset.pt
+```
+### Offline BC (optional)
+Train the diffusion model on pre-collected oracle demonstrations. The run length
+is controlled by `total_timesteps` — each env-step of the unified budget
+corresponds to one dataset sample, so total gradient steps =
+`total_timesteps // offline_batch_size`.
+Periodic ID + OOD evaluation runs during training on the cadence defined by
+`id_eval_every_timesteps` / `ood_eval_every_timesteps` (env-step units,
+converted internally to grad-step deltas via `// offline_batch_size`),
+mirroring the DAgger eval pattern. Results are logged to `eval_id/` and
+`eval_ood/` W&B namespaces.
+```bash
+python main.py --mode offline --data path/to/dataset.pt
+# Shorter / longer run (the same knob the DAgger and SB3 baselines use):
+python main.py --mode offline --data dataset.pt total_timesteps=500000
+# Resume from a step-level checkpoint (restores optimizer, scheduler,
+# step counter, and W&B run)
+python main.py --mode offline --data path/to/dataset.pt \
+    --checkpoint checkpoints/offline_step2000.pth
+```
+Step-level checkpoints are written every `checkpoint_every_timesteps` env-step
+equivalents (converted internally to `/ offline_batch_size` grad steps).
+Set to `0` to disable:
+```bash
+python main.py --mode offline --data dataset.pt checkpoint_every_timesteps=0
+```
+#### Compute-match overrides (paper-fair BC vs DAgger)
+For research comparisons against a specific DAgger checkpoint, four optional
+offline-only overrides bypass the env-step budget derivation. The
+sample-to-grad-step ratio between the two modes (~50×) makes a single shared
+`total_timesteps` budget unfair to one side; these knobs pin offline metrics
+in grad-step units instead. All default to `null` (backwards compatible).
+| Key | Purpose |
+|---|---|
+| `offline_total_grad_steps` | Pin gradient budget. Overrides `total_timesteps // offline_batch_size`. Use to match a DAgger iteration count (e.g. `60000` = 600 iters × 100 grad_steps_per_iter). |
+| `offline_eval_every_grad_steps` | ID/OOD eval cadence in grad-step units. Without this, env-step cadence applied to BC's dense per-sample budget yields hundreds of evals. |
+| `offline_checkpoint_every_grad_steps` | Checkpoint cadence in grad-step units. Same motivation. |
+| `offline_buffer_capacity` | Distinct from `buffer_capacity` (sized for DAgger's small FIFO). The full BC dataset has ~500k–1M sliding windows; using DAgger's cap silently truncates. |
+Example: train a fair offline BC baseline matched to DAgger@iter600
+(60k AdamW updates × 2048 batch):
+```bash
+python main.py --mode offline --data data/oracle_bc_qmul.pt \
+    --config configs/final_qmul_gpu.yaml
+```
+The `final_qmul_gpu.yaml` and `final_ucl_gpu.yaml` configs both ship with
+these overrides pre-set and with cross-cluster-identical training
+hyperparameters (only collection-worker counts and output paths differ).
+### DAgger online training
+Full DAgger loop: seed buffer with oracle data, collect with model, label with BFS oracle, filter by efficiency, train on buffer.
+```bash
+# From scratch (seeds buffer with oracle data automatically)
+python main.py --mode dagger
+# Resume from local checkpoint
+python main.py --mode dagger --checkpoint checkpoints/iter3000.pth
+# Resume from a W&B artifact
+python main.py --mode dagger \
+    --wandb-artifact entity/project/checkpoint-iter3000:latest
+# Skip warm-start from checkpoint (reinitialise model, keep config)
+python main.py --mode dagger --checkpoint checkpoints/iter3000.pth --no-warm-start
+# Override hyperparameters (total_timesteps is the unified run-length knob)
+python main.py --mode dagger total_timesteps=1000000 dagger_lr=0.0001
+# Use a GPU-optimised config (paper run, QMUL H200)
+python main.py --mode dagger --config configs/final_qmul_gpu.yaml
+```
+### Inference
+Evaluate a checkpoint on specified environments. Accepts either `--checkpoint` (local path) or `--wandb-artifact` (W&B artifact reference).
+```bash
+# All ID + OOD environments
+python main.py --mode inference --checkpoint checkpoints/iter8000.pth
+# From a W&B artifact
+python main.py --mode inference \
+    --wandb-artifact entity/project/checkpoint-iter8000:latest
+# Specific environments, save JSON
+python main.py --mode inference \
+    --checkpoint checkpoints/iter8000.pth \
+    --envs MiniHack-Room-Random-5x5-v0 MiniHack-MazeWalk-45x19-v0 \
+    --episodes 100 \
+    --output results.json
+# Custom .des scenario files
+python main.py --mode inference \
+    --checkpoint checkpoints/iter8000.pth \
+    --des environments/custom_level.des
+# Local-only ablation (zero out global map)
+python main.py --mode inference \
+    --checkpoint checkpoints/iter8000.pth --blind-global
+# Use training weights instead of EMA
+python main.py --mode inference --checkpoint iter8000.pth --no-ema
+```
+### Baselines (SB3 + Decision Transformer)
+Train and evaluate the head-to-head baselines used in the paper comparison.
+Six algorithms are wired in: standard discrete-action RL via Stable-Baselines3
+(`ppo`, `a2c`, `dqn`, `ppo-rnn`), Behavioural Cloning (`bc`) on oracle
+demonstrations, and a causal Decision Transformer (`dt`) with target-return
+conditioning. All six share the unified `cfg.total_timesteps` budget so the
+numbers are directly comparable to DAgger and offline BC.
+Hyperparameters live under the `baselines_*` namespace in `configs/defaults.yaml`
+(BC epochs / batch / LR, DT context length / depth / width, oracle episodes per
+env, eval cadence, DQN replay buffer, parallel SubprocVecEnv count, etc.). The
+runner writes per-seed checkpoints, SB3 logs, and an aggregated results JSON
+under `cfg.baselines_output_dir` (default `outputs/baselines/`); W&B runs land
+in a separate project (`cfg.baselines_wandb_project`, default `remdm-baselines`)
+so they don't pollute the main training leaderboards.
+```bash
+# PPO on the 4 ID maps for the unified env-step budget, 1 seed
+python main.py --mode baselines --algo ppo
+# DQN with a custom budget and 3 seeds
+python main.py --mode baselines --algo dqn \
+    --seeds 0 1 2 \
+    total_timesteps=1000000
+# Behavioural Cloning baseline (oracle demos -> SB3 ActorCriticPolicy)
+python main.py --mode baselines --algo bc --n-seeds 3
+# Decision Transformer (causal R/s/a transformer with target-return)
+python main.py --mode baselines --algo dt --seeds 0 1 2
+# Override the aggregated-results JSON destination
+python main.py --mode baselines --algo ppo --output results/ppo_smoke.json
+# Paper-fair comparison against the ReMDM online budget (~5.65M env-steps)
+python main.py --mode baselines --algo ppo total_timesteps=5650000
+```
+The BC and DT defaults (50 epochs, 5000 oracle trajectories per ID env, 64-token
+DT context, 256-D DT embedding) are tuned to match the data and compute scale of
+the offline BC and ReMDM runs reported in the paper.
+### CLI flags
+| Flag | Description |
+|---|---|
+| `--mode` | Required. One of `smoke`, `collect`, `offline`, `dagger`, `inference`, `baselines` |
+| `--config PATH` | Config file (default: `configs/defaults.yaml`) |
+| `--algo NAME` | Baseline algorithm (`ppo`, `a2c`, `dqn`, `ppo-rnn`, `bc`, `dt`); required with `--mode baselines` |
+| `--seeds N [N ...]` | Explicit seed list for `--mode baselines` |
+| `--n-seeds N` | Number of seeds starting from 0 (alternative to `--seeds`) |
+| `--data PATH` | Dataset `.pt` file (offline mode) |
+| `--checkpoint PATH` | Checkpoint `.pth` file |
+| `--wandb-artifact REF` | W&B artifact reference (e.g. `entity/project/name:latest`) |
+| `--no-warm-start` | Skip model warm-start from checkpoint (DAgger) |
+| `--no-ema` | Use training weights instead of EMA for inference |
+| `--envs ENV [ENV ...]` | Override evaluation environments |
+| `--des PATH [PATH ...]` | Custom `.des` scenario files for evaluation |
+| `--episodes N` | Episodes per environment (default: 50) |
+| `--output PATH` | Save evaluation results / aggregated baselines JSON |
+| `--blind-global` | Zero out global map observations (local-only ablation) |
+---
+## Architecture
+**`LocalDiffusionPlannerWithGlobal`** (~5.2M parameters):
+```
+Local stream:   9x9 glyphs -> Embedding(6000,64) -> CNN(64->32->64) -> Linear -> 1 token
+Global stream:  21x79 glyphs -> Embedding(6000,32) -> CNN(32->32->64) -> Pool(2,4) -> 8 tokens
+                Goal head: mean(global) -> MLP -> [B,2] staircase coords (aux loss)
+                Gate: sigmoid(learnable scalar, init=-3.0) * global_tokens
+Action stream:  Embedding(14, 256) + timestep_emb(100, 256) + position_emb(64, 256)
+Transformer:    concat [1 + 8 + 64 = 73 tokens] -> 4-layer encoder (256D, 4 heads, pre-norm)
+Output head:    last 64 tokens -> Linear(256, 12) -> action logits
+```
+The model takes `(local_obs, global_obs, noisy_action_seq, t_discrete)` and returns `{"actions": [B,64,12], "goal_pred": [B,2]}`.
+A `LocalDiffusionPlanner` variant (no global stream, no goal head) is also available for ablation studies.
+---
+## Diffusion
+**Forward process (MDLM):** Each action token is independently replaced with `MASK` (token 12) with probability `1 - alpha(t)`, where `alpha(t)` follows a linear or cosine schedule. PAD tokens (13) are never masked.
+**Loss:** Cross-entropy on masked positions only, averaged globally across the batch. By default uses a flat average (matching the reference implementation). Optional SUBS importance weighting `w(t) = -alpha'(t) / (1 - alpha(t))`, clipped to `[0, 1000]`, can be enabled via `use_importance_weighting: true`. Optional label smoothing via `label_smoothing` (default 0.0).
+**Reverse sampling (ReMDM):** Over `K` denoising steps (default 10):
+1. Model predicts logits; apply temperature scaling and top-K filtering.
+2. Sample predictions; compute per-token confidence.
+3. **MaskGIT unmask:** commit the `n_unmask` highest-confidence masked positions.
+4. **ReMDM remask:** stochastically re-mask committed positions to allow refinement.
+5. Final step: commit all remaining positions.
+**Greedy sampling:** Used during DAgger data collection for deterministic rollouts. Same MaskGIT progressive unmasking loop but with argmax decoding (no temperature, no top-K, no remasking). Uses fewer denoising steps (`diffusion_steps_collect: 5`) for faster collection.
+### Remasking strategies
+| Strategy | Formula | Description |
+|---|---|---|
+| `rescale` | `p = eta * sigma_max` | Proportional to noise level |
+| `cap` | `p = min(eta, sigma_max)` | Fixed upper bound |
+| `conf` | `p = eta * sigma_max * (1 - confidence)` | Low-confidence tokens remasked more |
+---
+## Configuration
+### Key hyperparameters
+**Model**
+| Parameter | Default | Description |
+|---|---|---|
+| `n_embd` | 256 | Transformer hidden dimension |
+| `n_head` | 4 | Attention heads |
+| `n_layer` | 4 | Transformer blocks |
+| `n_global_tokens` | 8 | Global stream context tokens |
+| `seq_len` | 64 | Action plan length |
+| `dropout` | 0.0 | Transformer dropout (0.0 -- forward masking regularises) |
+| `ema_decay` | 0.999 | EMA smoothing for inference weights |
+| `global_gate_init` | -3.0 | Initial value for global gate logit |
+**Diffusion**
+| Parameter | Default | Description |
+|---|---|---|
+| `noise_schedule` | `linear` | `linear` or `cosine` |
+| `num_diffusion_steps` | 100 | Discrete timestep resolution |
+| `diffusion_steps_eval` | 10 | Denoising iterations at inference |
+| `diffusion_steps_collect` | 5 | Denoising iterations during DAgger collection |
+| `remask_strategy` | `conf` | `rescale`, `cap`, or `conf` |
+| `eta` | 0.15 | Remasking strength |
+| `temperature` | 0.5 | Sampling temperature |
+| `top_k` | 4 | Top-K filtering |
+| `replan_every` | 16 | Env steps before replanning |
+| `loss_weight_clip` | 1000.0 | SUBS importance weight clip bound |
+| `label_smoothing` | 0.0 | Label smoothing for cross-entropy |
+| `use_importance_weighting` | false | SUBS w(t) in loss (off = flat average) |
+| `physics_aware_sampling` | false | Penalise hazardous actions at inference |
+**Training budget (unified)**
+Offline BC, DAgger, and the SB3 baselines all share a single env-step budget
+expressed in `total_timesteps` (matching the SB3 convention). This is the only
+knob that should change to scale a run up or down.
+| Parameter | Default | Description |
+|---|---|---|
+| `total_timesteps` | 2,000,000 | Env-step budget shared across offline / DAgger / SB3 |
+| `id_eval_every_timesteps` | 25,000 | ID eval cadence (env-steps) |
+| `ood_eval_every_timesteps` | 25,000 | OOD eval cadence (env-steps) |
+| `checkpoint_every_timesteps` | 125,000 | Checkpoint cadence (env-steps) |
+- **Offline BC:** each dataset sample is one env.step() equivalent, so total
+  gradient steps = `total_timesteps // offline_batch_size`. The cosine LR
+  schedule's `T_max` derives from the same quantity, so runs of different
+  lengths still decay to the 10% floor at their end.
+- **DAgger:** the training loop tracks cumulative `env.step()` calls (model +
+  oracle rollouts combined) and halts when the running total reaches
+  `total_timesteps`. `episodes_per_iteration` and `grad_steps_per_iteration`
+  control the collect/train ratio but **must not** scale with the budget.
+- **Fairness caveat — `ema_decay`:** this is an absolute-update-count constant
+  (half-life ~ `1 / (1 − decay)` steps). If `total_timesteps` shifts by more
+  than ~2× from the default, the fraction of training covered by the EMA
+  window changes. For very short or very long runs, consider setting a
+  matching decay manually.
+**Training**
+| Parameter | Default | Description |
+|---|---|---|
+| `offline_lr` | 0.0003 | BC learning rate (cosine-decayed to 10% over `total_grad_steps`) |
+| `dagger_lr` | 0.00003 | DAgger learning rate (constant) |
+| `offline_batch_size` | 3584 | Offline BC batch size |
+| `dagger_batch_size` | 3584 | DAgger batch size |
+| `offline_grad_clip` | 1.0 | Gradient norm clip (offline) |
+| `dagger_grad_clip` | 1.0 | Gradient norm clip (DAgger) |
+| `weight_decay` | 0.0001 | AdamW weight decay (both optimizers) |
+| `grad_steps_per_iteration` | 100 | Gradient steps per DAgger iteration |
+| `episodes_per_iteration` | 30 | Episodes collected per DAgger iteration |
+| `aux_loss_weight` | 0.5 | Weight for auxiliary goal loss |
+| `buffer_capacity` | 10000 | Replay buffer size (windows) |
+| `efficiency_multiplier` | 1.5 | DAgger efficiency filter threshold |
+| `curriculum_preseed` | true | Pre-seed curriculum with 50/50 prior |
+| `curriculum_queue_size` | 100 | Curriculum window size per environment |
+**Data Collection**
+| Parameter | Default | Description |
+|---|---|---|
+| `collect_episodes_per_env` | 5000 | Oracle episodes per ID environment |
+| `collect_num_workers` | 8 | Parallel process workers for collection |
+| `collect_output` | `data/dataset.pt` | Output path for collected dataset |
+**Evaluation**
+| Parameter | Default | Description |
+|---|---|---|
+| `eval_episodes_per_env` | 50 | Episodes per environment at eval time |
+| `checkpoint_eval_episodes` | 50 | Episodes per env at checkpoint eval |
+(Eval and checkpoint *cadences* are expressed in env-steps under
+**Training budget (unified)** above.)
+**Performance**
+| Parameter | Default | Description |
+|---|---|---|
+| `use_amp` | false | Mixed-precision (FP16) training via `torch.amp` |
+| `torch_compile` | false | `torch.compile` the model for fused kernels |
+| `num_collection_workers` | 8 | Parallel workers for DAgger episode collection |
+**Logging**
+| Parameter | Default | Description |
+|---|---|---|
+| `use_wandb` | true | Enable W&B logging |
+| `wandb_project` | `remdm-minihack` | W&B project name |
+| `wandb_resume_id` | null | W&B run ID for resumption |
+| `offline_log_every` | 10 | Stdout/W&B log frequency (offline steps) |
+| `seed` | null | RNG seed (null = random) |
+### Config presets
+| File | Purpose |
+|---|---|
+| `configs/defaults.yaml` | Base defaults for all modes |
+| `configs/smoke.yaml` | Fast smoke test (`total_timesteps=5000`, small buffer, W&B off) |
+| `configs/ucl_gpu_bigger_model.yaml` | UCL GPU exploration with a larger model (384D, 6 heads) |
+| `configs/ucl_gpu_learning_behaviour.yaml` | UCL GPU learning-behaviour study (eta=0.18, B=6144) |
+| `configs/final_qmul_gpu.yaml` | **Paper run, QMUL H200.** Drives both `--mode dagger` (reproduces the iter600 checkpoint) and `--mode offline` (compute-matched fair BC baseline: 60k grad steps × B=2048). AMP + torch.compile + 32 collection workers. |
+| `configs/final_ucl_gpu.yaml` | **Paper run, UCL 3090 Ti 24 GB.** Identical training hyperparams to the QMUL config for cross-cluster fairness; only `num_collection_workers` (8 instead of 32) and output paths differ. |
+---
+## DAgger Training Loop
+Each DAgger iteration:
+1. **Curriculum sampling:** Select an environment weighted by difficulty (low win-rate environments sampled more).
+2. **Model rollout:** Generate plans with the EMA model using greedy sampling; execute with replanning every 16 steps. Collects `episodes_per_iteration` (default 30) episodes per iteration.
+3. **Oracle rollout:** Run the BFS oracle on the **same seed** for comparison.
+4. **Efficiency filter:** Add the oracle trajectory to the buffer if the model failed or took >1.5x the oracle's steps.
+5. **Budget accounting:** Advance `env_steps_total += model_steps + oracle_steps`. The training loop halts when the running total reaches `total_timesteps`.
+6. **Training:** Sample from the replay buffer; run `grad_steps_per_iteration` gradient steps, updating EMA weights after each gradient step.
+Collection uses GPU-batched rollouts when on CUDA with `episodes_per_iteration > 1`, falling back to threaded CPU collection or sequential collection as appropriate.
+The BFS oracle uses a 5-tier priority: (1) kick adjacent doors, (2) BFS to staircase, (3) BFS to frontier, (4) BFS to farthest tile, (5) random cardinal.
+---
+## Reward Shaping
+The environment wrapper applies shaped rewards to guide learning:
+| Component | Value | Condition |
+|---|---|---|
+| Win bonus | +20.0 | Episode won |
+| BFS progress | +0.5 * (prev_dist - curr_dist) | Closer to staircase |
+| Exploration | +0.05 | New tile visited |
+| Step penalty | -0.01 | Every step |
+---
+## Project Structure
+```
+minihack-ReMDM-planner/
+├── configs/
+│   ├── defaults.yaml                   Base hyperparameters
+│   ├── smoke.yaml                      Smoke test overrides
+│   ├── ucl_gpu_bigger_model.yaml       UCL GPU (larger model: 384D, 6 heads)
+│   ├── ucl_gpu_learning_behaviour.yaml UCL GPU learning-behaviour study
+│   ├── final_qmul_gpu.yaml             Paper run: DAgger + fair offline BC (QMUL H200)
+│   └── final_ucl_gpu.yaml              Paper run: DAgger + fair offline BC (UCL 3090 Ti)
+├── environments/                      Custom .des scenario files
+├── src/
+│   ├── config.py                      YAML config loader with CLI overrides
+│   ├── buffer.py                      ReplayBuffer with offline-protected FIFO
+│   ├── curriculum.py                  DynamicCurriculum + efficiency_filter
+│   ├── diffusion/
+│   │   ├── schedules.py               Linear and cosine noise schedules
+│   │   ├── forward.py                 Forward masking process q(z_t | x_0)
+│   │   ├── loss.py                    MDLM ELBO + auxiliary goal loss
+│   │   └── sampling.py                ReMDM reverse sampling with remasking
+│   ├── models/
+│   │   └── denoiser.py                LocalDiffusionPlannerWithGlobal + ModelEMA
+│   ├── envs/
+│   │   ├── minihack_env.py            AdvancedObservationEnv + BFS oracle
+│   │   └── discovery.py               Env registry scanner + inference benchmark
+│   └── planners/
+│       ├── collect.py                 run_model_episode + DataCollector
+│       ├── collect_oracle.py          Standalone oracle data collection
+│       ├── offline.py                 Offline BC trainer
+│       ├── online.py                  DAgger Trainer + checkpointing
+│       ├── inference.py               Evaluator + result formatting
+│       ├── baselines.py               SB3 + Decision Transformer baselines
+│       ├── smoke.py                   Smoke-test runner
+│       └── logging.py                 Centralised W&B + stdout logging
+├── experiments/
+│   └── rl_finetuning/                 RL fine-tuning ablation suite
+│       ├── run_ablations.py           CLI entry point
+│       ├── configs/                   Ablation config files
+│       ├── ablations/                 Loss, optimizer, registry, training
+│       ├── diagnostics/               Gradient, representation, timestep metrics
+│       └── analysis/                  Plots, tables, reports
+├── scripts/
+│   ├── hf_upload.py                   HuggingFace Hub upload utility
+│   └── profile_dagger.py             DAgger iteration profiler
+├── main.py                            CLI entry point (smoke/collect/offline/dagger/inference/baselines)
+├── pyproject.toml                     PEP 621 project metadata + dependencies
+├── uv.lock                            Deterministic lockfile
+└── README.md
+```
+---
+## W&B Metric Namespaces
+| Namespace | Contents |
+|---|---|
+| `diffusion/` | `loss`, `loss_diff`, `loss_aux` |
+| `train/` | `buffer_size`, `buffer_online_frac`, `model_won`, `added_to_buffer`, `episodes_collected`, `model_steps`, `oracle_steps`, `efficiency_ratio`, `lr`, `grad_norm`, `global_gate`, `env_steps`, `progress` |
+| `speed/` | `iter_time_sec`, `collect_time_sec`, `train_step_time_sec`, `samples_per_sec`, `env_steps_per_sec`, `gpu_memory_mb` |
+| `perf/` | `iter_time_s`, `collect_time_s`, `train_time_s`, `grad_steps_per_sec` (legacy compat) |
+| `model/` | `param_norm`, `param_drift_from_init`, `ema_gate_value` (every 10 iters) |
+| `eval_id/{env}/` | Per-environment win rate, avg steps, avg reward (in-distribution) |
+| `eval_ood/{env}/` | Per-environment win rate, avg steps, avg reward (out-of-distribution) |
+| `eval_id/` | `mean_win_rate` |
+| `eval_ood/` | `mean_win_rate` |
+| `curriculum/{env}/` | `win_rate` per training environment |
+| `ckpt_eval_id/`, `ckpt_eval_ood/` | Per-env metrics at checkpoint time |
+| `ckpt_eval/` | `id_winrate`, `ood_winrate` |
+| `offline/` | `final_loss`, `total_steps`, `total_timesteps` (summary only) |
+Both DAgger and offline BC emit to `eval_id/` and `eval_ood/` namespaces.
+Offline mode reuses the same `Evaluator` and EMA-weight evaluation path as
+DAgger, so curves are directly comparable across modes.
+---
+## Checkpoint Format
+**DAgger checkpoint:**
+```python
+{
+    "model_state_dict":     ...,
+    "ema_state_dict":       ...,
+    "optimizer_state_dict": ...,
+    "scheduler_state_dict": ...,
+    "curriculum_state":     {...},
+    "iteration":            int,
+    "env_steps":            int,   # cumulative env.step() calls so far
+    "wandb_run_id":         str | None,
+    "rng_states":           {"torch", "numpy", "python"},
+}
+```
+**Offline BC checkpoint** (step-level, file `offline_step{N}.pth`, saved when
+`checkpoint_every_timesteps > 0`):
+```python
+{
+    "model_state_dict":     ...,
+    "ema_state_dict":       ...,
+    "optimizer_state_dict": ...,
+    "scheduler_state_dict": ...,
+    "step":                 int,
+    "env_steps":            int,   # step * offline_batch_size
+    "wandb_run_id":         str | None,
+}
+```
+**Offline final checkpoint** (saved at the end of offline training):
+```python
+{
+    "model_state_dict":     ...,
+    "ema_state_dict":       ...,
+    "wandb_run_id":         str | None,
+}
+```
+Inference uses EMA weights by default. Pass `--no-ema` to use training weights.
+### W&B Artifacts
+Checkpoints are automatically uploaded as versioned W&B artifacts (type `"model"`) at each checkpoint save. Each artifact contains the `.pth` weights and a `config.yaml` snapshot of all hyperparameters used.
+To resume from an artifact:
+```bash
+# DAgger resume
+python main.py --mode dagger \
+    --wandb-artifact entity/project/checkpoint-iter3000:latest
+# Inference
+python main.py --mode inference \
+    --wandb-artifact entity/project/checkpoint-iter8000:v2
+```
+The artifact reference format is `entity/project/artifact-name:version` where version is `latest`, `v0`, `v1`, etc.
+### W&B Run Resumption
+All training loops save the W&B run ID in their checkpoints. When resuming from a checkpoint, the run ID is automatically extracted and passed to `wandb.init(resume="must")`, so metrics continue on the same W&B curves with no gaps.
+```bash
+# DAgger: automatic -- run ID is read from the checkpoint
+python main.py --mode dagger --checkpoint checkpoints/iter2000.pth
+# Offline BC: automatic
+python main.py --mode offline --data dataset.pt \
+    --checkpoint checkpoints/offline_step2000.pth
+# Manual override (e.g. checkpoint saved before this feature was added):
+python main.py --mode dagger --checkpoint old_checkpoint.pth \
+    wandb_resume_id=abc123xyz
+# Ablation suite:
+python experiments/rl_finetuning/run_ablations.py \
+    --checkpoint path/to/ckpt.pth --all --use_wandb \
+    --wandb_resume_id abc123xyz
+```
+The run ID is visible in the W&B dashboard URL: `wandb.ai/.../runs/<run-id>`.
+---
+## Performance Tuning
+Three config keys control performance optimisations. Defaults are set for GPU training; override for CPU or different hardware.
+### Mixed precision (`use_amp: true`)
+Wraps training forward/backward in `torch.amp.autocast("cuda")` with `GradScaler`. Active in both offline BC and DAgger training.
+- **Measured speedup:** 2.2x on gradient steps, 1.7x on full smoke test wall-clock
+- **Memory:** peak GPU stays ~16 GB at B=3584 (same as FP32 due to embedding-heavy model)
+- **Correctness:** loss trajectory and win rates statistically equivalent to FP32
+- **When to use:** always on GPU. No effect on CPU (autocast is a no-op)
+- **Default:** `false` in `defaults.yaml`; enabled in GPU-specific configs
+### torch.compile (`torch_compile: true`)
+Applies `torch.compile(model, mode="default")` before training. Falls back gracefully if no C compiler is found (common on managed GPU nodes).
+- **Measured speedup:** none beyond AMP alone. Not recommended for primary training.
+- **Default:** `false` in `defaults.yaml`; opt in via the `final_*_gpu.yaml` configs.
+- **When to use:** experimental only. May help on future PyTorch versions with better dynamic shape support.
+### Parallel collection (`num_collection_workers: N`)
+DAgger episode collection supports three strategies (auto-selected):
+1. **GPU-batched** (default on CUDA with `episodes_per_iteration > 1`): all envs in lockstep
+2. **Threaded CPU** (fallback when `num_collection_workers > 0`): `ThreadPoolExecutor` with CPU model copies
+3. **Sequential** (reference behaviour): one episode at a time
+- **Default:** `8` workers in `defaults.yaml`
+- **When to use:** GPU-batched is preferred; workers primarily affect the CPU fallback path
+### Profiling
+Run `python scripts/profile_dagger.py [key=value ...]` to profile DAgger iteration components. Supports all config overrides (e.g., `use_amp=true`).
+---
+## Implementation Notes
+- **MDLM loss** returns `0.0` (not NaN) when no masked positions exist in the batch. Uses global averaging by default; SUBS importance weighting is opt-in via `use_importance_weighting: true`.
+- **PAD tokens** are never masked during the forward process and are excluded from the loss.
+- **Sampling paths:** Evaluation uses stochastic ReMDM sampling (temperature, top-K, remasking) with `diffusion_steps_eval` (default 10) steps. DAgger collection uses greedy argmax sampling (deterministic, no remasking) with `diffusion_steps_collect` (default 5) steps for faster rollouts.
+- **`remdm_sample`** guarantees a fully committed output (no MASK tokens) via a final-step commit and an assertion check. A min-keep 10% safety net prevents degenerate all-masked states.
+- **EMA** shadow weights are updated after every gradient step (not per iteration). The `DataCollector` syncs the latest EMA weights before each rollout.
+- **Curriculum** initialises with a 50/50 prior per environment (configurable via `curriculum_preseed`) and uses bucket-based weights over the rolling win-rate: low `[0, 0.15)` → 0.2, medium `[0.15, 0.85)` → 1.0, high `[0.85, 1.0]` → 0.1.
+- **Replay buffer** pins offline data at the front; only online samples are FIFO-evicted. Returns `None` on empty buffer (callers handle gracefully).
+- **Global gate** initialises at `sigmoid(-3.0) ~ 0.047`, starting nearly closed to prevent the global stream from destabilising early training.
+- **Dropout** is set to 0.0 by default. The discrete diffusion forward masking already regularises; dropout on top is redundant.
+- **DAgger warm-start:** On iteration 0, the buffer is seeded with 3 oracle trajectories per ID environment (12 total), giving the curriculum and training loop data to work with immediately.

ablation_assets/diagnosis_decision_tree.png ADDED Viewed

ablation_assets/grad_alignment.png ADDED Viewed

Git LFS Details

SHA256: 4f1186a56694e83030a800c6e098f302228018159c170f4bb4fc3203a9e23e6e
Pointer size: 131 Bytes
Size of remote file: 320 kB

ablation_assets/gradient_conflict_map.png ADDED Viewed

ablation_assets/group_comparison.png ADDED Viewed

ablation_assets/group_summary.csv ADDED Viewed

	@@ -0,0 +1,5 @@

+Group,N,Mean,Best,Worst,StdDev
+Baseline,1,0.5625,0.5625,0.5625,0.0
+A,6,0.6021,0.6667,0.5583,0.0358
+B,7,0.4988,0.6542,0.0625,0.1834
+C,7,0.6125,0.6458,0.5833,0.0184

ablation_assets/hypothesis_verdicts.csv ADDED Viewed

	@@ -0,0 +1,22 @@

+Method,Group,Score,Delta_Baseline,Verdict,Hypothesis
+advantage_clip,B,0.4958,-0.0667,NEUTRAL,If clipping helps: large advantage magnitudes destabilise training
+attention_only,C,0.6167,0.0542,IMPROVEMENT,"If attention-only works: model needs routing updates, not feature updates"
+baseline_rl,Baseline,0.5625,0.0,NEUTRAL,Diagnoses whether the RL signal alone causes collapse
+bc_wins,B,0.5708,0.0083,NEUTRAL,If BC on wins helps: the return weighting is the specific cause
+entropy_bonus,B,0.5708,0.0083,NEUTRAL,If entropy bonus helps: collapse is mode-collapse; not a gradient problem
+ewc,A,0.6667,0.1042,IMPROVEMENT,If EWC helps: forgetting pretrained representations is the proximate cause
+ffn_only,C,0.6083,0.0458,NEUTRAL,If FFN-only works: stored knowledge (FFN as memory) needs updating; not attention
+frozen_backbone,C,0.6167,0.0542,IMPROVEMENT,If frozen backbone helps: deep gradient flow into backbone causes collapse
+gradient_surgery,B,0.6542,0.0917,IMPROVEMENT,If PCGrad helps: gradients are conflicting and resolvable by projection
+head_only,C,0.5958,0.0333,NEUTRAL,If head-only works: backbone representations are fine; only decision boundary needs updating
+kl_penalty,A,0.5583,-0.0042,NEUTRAL,If this helps: catastrophic forgetting is the primary cause; soft regularisation suffices
+layer_ablation_top1,C,0.6208,0.0583,IMPROVEMENT,Minimal unfrozen depth needed; collapse depth correlates with gradient flow depth
+layer_ablation_top2,C,0.6458,0.0833,IMPROVEMENT,Minimal unfrozen depth needed; collapse depth correlates with gradient flow depth
+layer_ablation_top3,C,0.5833,0.0208,NEUTRAL,Minimal unfrozen depth needed; collapse depth correlates with gradient flow depth
+llrd,A,0.625,0.0625,IMPROVEMENT,If LLRD helps: deep gradient flow into early layers corrupts representations
+lora,A,0.6042,0.0417,NEUTRAL,If LoRA works: too many unconstrained degrees of freedom cause collapse
+low_t,B,0.55,-0.0125,NEUTRAL,If low-t helps: high-t (coarse-structure) gradients are biased
+mixed_replay,A,0.5833,0.0208,NEUTRAL,If mixed replay helps: online data distribution alone is too corrupted
+normalized_adv,B,0.0625,-0.5,COLLAPSE,If std normalisation helps: simple mean normalisation is too loose
+t_curriculum,B,0.5875,0.025,NEUTRAL,If curriculum helps: ordering of learning signals matters
+trust_region_kl,A,0.575,0.0125,NEUTRAL,If hard constraint helps: soft KL is insufficient -- a hard boundary is needed

ablation_assets/main_results.csv ADDED Viewed

	@@ -0,0 +1,22 @@

+Method,Group,Score,Delta_Pretrained,Delta_Baseline,Verdict
+ewc,A,0.6667,0.0792,0.1042,IMPROVEMENT
+gradient_surgery,B,0.6542,0.0667,0.0917,IMPROVEMENT
+layer_ablation_top2,C,0.6458,0.0583,0.0833,IMPROVEMENT
+llrd,A,0.625,0.0375,0.0625,IMPROVEMENT
+layer_ablation_top1,C,0.6208,0.0333,0.0583,IMPROVEMENT
+frozen_backbone,C,0.6167,0.0292,0.0542,IMPROVEMENT
+attention_only,C,0.6167,0.0292,0.0542,IMPROVEMENT
+ffn_only,C,0.6083,0.0208,0.0458,NEUTRAL
+lora,A,0.6042,0.0167,0.0417,NEUTRAL
+head_only,C,0.5958,0.0083,0.0333,NEUTRAL
+t_curriculum,B,0.5875,-0.0,0.025,NEUTRAL
+mixed_replay,A,0.5833,-0.0042,0.0208,NEUTRAL
+layer_ablation_top3,C,0.5833,-0.0042,0.0208,NEUTRAL
+trust_region_kl,A,0.575,-0.0125,0.0125,NEUTRAL
+entropy_bonus,B,0.5708,-0.0167,0.0083,NEUTRAL
+bc_wins,B,0.5708,-0.0167,0.0083,NEUTRAL
+baseline_rl,Baseline,0.5625,-0.025,0.0,NEUTRAL
+kl_penalty,A,0.5583,-0.0292,-0.0042,NEUTRAL
+low_t,B,0.55,-0.0375,-0.0125,NEUTRAL
+advantage_clip,B,0.4958,-0.0917,-0.0667,NEUTRAL
+normalized_adv,B,0.0625,-0.525,-0.5,COLLAPSE

ablation_assets/per_env_delta.png ADDED Viewed

Git LFS Details

SHA256: 8a205ec378a19466551c384d990d43f644008b711ecf43c0c014c97d4517b78d
Pointer size: 131 Bytes
Size of remote file: 120 kB

ablation_assets/per_env_win_rates.csv ADDED Viewed

	@@ -0,0 +1,22 @@

+Method,MiniHack-Room-Random-5x5-v0,MiniHack-Room-Random-15x15-v0,MiniHack-Corridor-R2-v0,MiniHack-MazeWalk-9x9-v0
+advantage_clip,0.9,0.95,0.25,0.15
+attention_only,1.0,0.9,0.5,0.35
+baseline_rl,1.0,0.7,0.3,0.1
+bc_wins,0.9,0.7,0.4,0.1
+entropy_bonus,0.9,0.4,0.45,0.15
+ewc,1.0,0.85,0.6,0.3
+ffn_only,1.0,1.0,0.35,0.3
+frozen_backbone,0.95,0.9,0.5,0.4
+gradient_surgery,1.0,0.9,0.45,0.25
+head_only,1.0,0.8,0.3,0.3
+kl_penalty,0.9,1.0,0.2,0.45
+layer_ablation_top1,0.9,0.75,0.2,0.2
+layer_ablation_top2,0.95,0.9,0.35,0.25
+layer_ablation_top3,0.95,0.75,0.4,0.45
+llrd,0.8,0.9,0.4,0.25
+lora,1.0,0.75,0.2,0.2
+low_t,1.0,0.6,0.45,0.15
+mixed_replay,0.95,0.75,0.45,0.2
+normalized_adv,0.1,0.0,0.1,0.1
+t_curriculum,1.0,0.8,0.3,0.15
+trust_region_kl,0.95,0.75,0.45,0.25

ablation_assets/repr_drift.png ADDED Viewed

ablation_assets/results.json ADDED Viewed

The diff for this file is too large to render. See raw diff

ablation_assets/score_comparison.png ADDED Viewed

Git LFS Details

SHA256: 93ee968f4869c03806b3243b8a95db5dab7f48c963151b5186693bd37330cd02
Pointer size: 131 Bytes
Size of remote file: 116 kB

ablation_assets/score_delta.png ADDED Viewed

checkpoint_inference.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4619870dd5fcdb2f1575c4a458e128f3da31f9a75a73562d9d316f60f288df20
+size 20991233

configs/defaults.yaml ADDED Viewed

	@@ -0,0 +1,242 @@

+# ── Environments ──────────────────────────────────────────────────────
+id_envs:
+  - MiniHack-Room-Random-5x5-v0
+  - MiniHack-Room-Random-15x15-v0
+  - MiniHack-Corridor-R2-v0
+  - MiniHack-MazeWalk-9x9-v0
+ood_envs:
+  - MiniHack-Room-Dark-15x15-v0
+  - MiniHack-Corridor-R5-v0
+  - MiniHack-MazeWalk-45x19-v0
+crop_size: 9
+map_h: 21
+map_w: 79
+action_dim: 12
+mask_token: 12
+pad_token: 13
+# ── Model ─────────────────────────────────────────────────────────────
+n_embd: 256
+n_head: 4
+n_layer: 4
+n_global_tokens: 8
+seq_len: 64
+global_gate_init: -3.0
+# Transformer dropout. 0.0 is deliberate — discrete diffusion forward masking
+# already regularises; dropout on top is redundant.
+dropout: 0.0
+ema_decay: 0.999
+# ── Diffusion (MDLM) ─────────────────────────────────────────────────
+noise_schedule: linear
+num_diffusion_steps: 100
+loss_weight_clip: 1000.0
+label_smoothing: 0.0
+# Use SUBS importance weighting w(t) in loss. Off by default (flat average
+# matching reference). Enable for MDLM ELBO experiments.
+use_importance_weighting: false
+# ReMDM stochastic remask base fraction
+eta: 0.15
+# Remasking strategy: rescale | cap | conf
+remask_strategy: conf
+# ── Inference ─────────────────────────────────────────────────────────
+# Number of reverse denoising steps at inference.
+# Reference uses 5 (aggressive). Higher = better quality, slower.
+diffusion_steps_eval: 10
+# Denoising steps during DAgger collection. Fewer than eval since
+# collection only needs "good enough" plans for efficiency comparison.
+diffusion_steps_collect: 5
+temperature: 0.5
+top_k: 4
+replan_every: 16
+# Soft-penalise hazardous cardinal actions during stochastic sampling.
+# Not active in the reference evaluation pipeline; off by default.
+physics_aware_sampling: false
+# ── Training budget (unified) ────────────────────────────────────────
+# Total environment-step budget for training. Matches the SB3
+# `total_timesteps` convention so runs can be compared apples-to-apples
+# across offline BC, DAgger, and SB3 baselines.
+#
+# • DAgger: cumulative env.step() calls across model + oracle rollouts.
+#   Training stops once this budget is exhausted. `episodes_per_iteration`
+#   and `grad_steps_per_iteration` control the collect/train ratio;
+#   they do NOT change the total compute.
+# • Offline BC: each dataset sample corresponds to one env.step() that
+#   collected it. Total gradient steps = total_timesteps // batch_size,
+#   i.e. the training consumes exactly `total_timesteps` samples.
+#
+# Fairness invariant — parameters that scale AUTOMATICALLY with this
+# budget:
+#   * offline LR cosine T_max (= total_timesteps / offline_batch_size)
+#   * id_eval_every_timesteps / ood_eval_every_timesteps (env-step cadence)
+#   * checkpoint_every_timesteps (env-step cadence; offline converts via
+#     / offline_batch_size)
+#
+# Parameters held FIXED across different budgets (tuning knobs, not
+# fairness knobs):
+#   * offline_batch_size, dagger_batch_size — per-step SNR
+#   * offline_lr, dagger_lr — peak learning rate
+#   * weight_decay, *_grad_clip, efficiency_multiplier, aux_loss_weight,
+#     loss_weight_clip, label_smoothing — optimisation regularisers
+#   * episodes_per_iteration, grad_steps_per_iteration — the collect/train
+#     ratio is itself a design choice; scaling these would confound
+#     collection coverage with update density
+#   * curriculum_queue_size, buffer_capacity — in absolute units by design
+#
+# Fairness caveat — `ema_decay` is an absolute-update-count constant
+# (half-life ≈ 1 / (1 − decay) steps). If total_timesteps shifts by more
+# than ~2x from the default, the fraction of training covered by the EMA
+# window changes. For very short or very long runs, consider manually
+# setting a matching decay (shorter run → lower decay, longer → higher).
+total_timesteps: 2000000
+# Evaluation + checkpoint cadence, in env-step units. These scale with
+# total_timesteps so every run gets ~N eval points and ~M checkpoints
+# regardless of budget. For offline BC, the cadence is converted to
+# gradient-step intervals via `/ offline_batch_size`.
+id_eval_every_timesteps: 25000
+ood_eval_every_timesteps: 25000
+checkpoint_every_timesteps: 125000
+# ── Offline BC ────────────────────────────────────────────────────────
+offline_lr: 0.0003
+offline_batch_size: 3584
+offline_grad_clip: 1.0
+aux_loss_weight: 0.5
+# ── Offline BC compute-match overrides (all opt-in, default null) ───
+# These exist solely to support paper-fair comparisons against a
+# specific DAgger iteration count, where the env-step / grad-step
+# ratio between the two modes (~50x) makes a single shared
+# `total_timesteps` budget unfair to one side. When null, offline
+# falls back to the env-step-derived defaults.
+#
+# offline_total_grad_steps: pin gradient budget (e.g. 60000 to match
+#   600 DAgger iters × 100 grad_steps_per_iter).
+# offline_eval_every_grad_steps: ID/OOD eval cadence in grad-step
+#   units. Without this, dense env-step cadence yields ~500 evals.
+# offline_checkpoint_every_grad_steps: checkpoint cadence in grad-step
+#   units. Same motivation as eval cadence.
+# offline_buffer_capacity: distinct from `buffer_capacity` (which is
+#   sized for DAgger's small FIFO buffer). The full BC dataset has
+#   ~500k–1M sliding windows; using DAgger's cap silently truncates.
+offline_total_grad_steps: null
+offline_eval_every_grad_steps: null
+offline_checkpoint_every_grad_steps: null
+offline_buffer_capacity: null
+# ── DAgger ────────────────────────────────────────────────────────────
+dagger_lr: 0.00003
+dagger_batch_size: 3584
+dagger_grad_clip: 1.0
+weight_decay: 0.0001
+buffer_capacity: 10000
+episodes_per_iteration: 30
+grad_steps_per_iteration: 100
+efficiency_multiplier: 1.5
+curriculum_queue_size: 100
+# Pre-seed curriculum queues with 50/50 prior for uniform early sampling.
+curriculum_preseed: true
+eval_episodes_per_env: 50
+checkpoint_eval_episodes: 50
+# ── Performance ──────────────────────────────────────────────────────
+# Mixed-precision (FP16) training via torch.cuda.amp.
+# Speeds up forward/backward ~1.5-2x on GPU. No effect on CPU.
+use_amp: false
+# torch.compile the model for fused kernels (experimental).
+# May cause slow first iteration due to compilation. No effect on CPU.
+torch_compile: false
+# Number of parallel workers for DAgger episode collection.
+# 0 = sequential (reference behaviour). Recommended: 4-8 on multi-core.
+num_collection_workers: 8
+# ── Data Collection ─────────────────────────────────────────────────
+# Oracle episodes per ID environment for --mode collect.
+collect_episodes_per_env: 5000
+# Parallel environment workers for collection.
+collect_num_workers: 8
+# Output path for collected dataset.
+collect_output: "data/dataset.pt"
+# ── Checkpointing & Logging ──────────────────────────────────────────
+checkpoint_dir: checkpoints
+save_policy: true
+hub_run_id: null
+hub_repo_id: null
+use_wandb: true
+wandb_project: remdm-minihack
+wandb_entity: "mathis-weil-university-college-london-ucl-"
+wandb_run_name: null
+wandb_resume_id: null
+offline_log_every: 10
+seed: null
+# ── SB3 / DT baselines ───────────────────────────────────────────────
+# Baselines compared head-to-head against the diffusion planner.
+# Entry point:
+#   python main.py --mode baselines --algo {ppo,dqn,a2c,ppo-rnn,bc,dt}
+#
+# Algorithm families:
+#   * SB3 RL (ppo, a2c, dqn, ppo-rnn): consume `cfg.total_timesteps` as
+#     the env-step training budget — same convention as DAgger / offline
+#     BC. Use a custom MiniHack CNN feature extractor over the dict
+#     observation {"local": (1,9,9), "global": (1,21,79)}.
+#   * Behavioural Cloning (bc): collects oracle trajectories, trains an
+#     SB3 ActorCriticPolicy with a native PyTorch CE loop, evaluates on
+#     ID + OOD environments.
+#   * Decision Transformer (dt): collects oracle trajectories with
+#     return-to-go labels, trains a small causal transformer over
+#     interleaved (R, s, a) tokens, evaluates with target-return
+#     conditioning on ID + OOD environments.
+#
+# Number of parallel SB3 SubprocVecEnv workers per ID environment.
+# Effective n_envs = baselines_n_envs_per_id * len(id_envs). Default = 2
+# → 8 parallel envs over the 4 ID maps.
+baselines_n_envs_per_id: 2
+# DQN replay buffer capacity (transitions). Used only for --algo dqn.
+baselines_dqn_buffer_size: 100000
+# SB3 EvalCallback cadence in env-steps. Independent from
+# id/ood_eval_every_timesteps because SB3's eval pipeline is per
+# vector-env tick, not shared with the diffusion planner's evaluator.
+baselines_eval_freq_env_steps: 10000
+# Episodes per env at every eval trigger AND at the final BC / DT
+# manual evaluation pass. Falls back to eval_episodes_per_env (50) when
+# null so the comparison stays apples-to-apples with DAgger evals.
+baselines_eval_episodes_per_env: null
+# ── BC baseline ──────────────────────────────────────────────────────
+# Oracle trajectories collected per ID environment (seeds 0..N-1).
+# 5000 matches the offline BC dataset scale used by ReMDM.
+baselines_bc_oracle_episodes_per_env: 5000
+baselines_bc_epochs: 50
+baselines_bc_batch_size: 256
+baselines_bc_lr: 0.0003
+# ── Decision Transformer baseline ────────────────────────────────────
+# 5000 trajectories per ID env to match the BC / ReMDM data scale.
+baselines_dt_oracle_episodes_per_env: 5000
+baselines_dt_epochs: 50
+baselines_dt_context_len: 64
+baselines_dt_embed_dim: 256
+baselines_dt_n_layers: 4
+baselines_dt_n_heads: 4
+baselines_dt_lr: 0.0003
+baselines_dt_batch_size: 256
+# Maximum episode length covered by DT positional embeddings. MUST be
+# >= the longest oracle trajectory observed during data collection.
+# Aligned with baselines_dt_eval_max_steps so positional embeddings
+# cover the full eval-cap horizon.
+baselines_dt_max_ep_len: 200
+# DT eval rollout cap (steps before truncating an episode as a loss).
+baselines_dt_eval_max_steps: 200
+# ── Output / W&B ─────────────────────────────────────────────────────
+# Separate W&B project for baselines (kept distinct from the main
+# remdm-minihack project so baseline runs don't pollute training
+# leaderboards). Set to null to fall back to wandb_project.
+baselines_wandb_project: remdm-baselines
+# Where per-seed checkpoints, SB3 logs, and aggregated results JSON
+# are written. Resolved relative to the project root unless absolute.
+baselines_output_dir: outputs/baselines

configs/final_qmul_gpu.yaml ADDED Viewed

	@@ -0,0 +1,176 @@

+# =============================================================================
+# QMUL H200 GPU — final paper run config
+# =============================================================================
+#
+# This single config drives BOTH the final DAgger run that produced
+# `checkpoint_final/online/final.pth` AND the compute-matched offline
+# BC baseline used for the paper comparison.
+#
+#   --mode dagger    → reproduces the iter600 DAgger checkpoint recipe
+#   --mode offline   → trains a fair offline BC baseline against it
+#
+# ── Fairness analysis ───────────────────────────────────────────────
+#
+# DAgger compute at iter600 (the checkpointed model):
+#     600 iters × 100 grad_steps_per_iter × 2048 batch_size
+#         = 60,000 AdamW updates
+#         = 122,880,000 sample-equivalents
+#
+# The fair offline BC baseline matches this exactly:
+#     offline_total_grad_steps = 60,000  (override; pinned)
+#     offline_batch_size       = 2048    (matches DAgger; same SNR)
+#     weight_decay, grad_clip, aux_loss_weight, model arch, diffusion
+#         params: all matched. Model is identical between modes.
+#
+# LR strategy follows "best-of-each-method" rather than identical
+# optimisers — DAgger's 3e-5 constant is tuned for online refinement,
+# offline's 3e-4 cosine→3e-5 is BC standard from-scratch. Both
+# converge to the same effective late-training LR.
+#
+# Eval/checkpoint cadence is matched in *count* across modes (12 evals,
+# 6 checkpoints per run) via the offline_*_every_grad_steps overrides,
+# because the env-step→grad-step ratio differs by ~50× between modes.
+#
+# ── Hardware ─────────────────────────────────────────────────────────
+#
+# QMUL H200 (constrained VRAM allocation). The DAgger checkpoint was
+# produced on this hardware, so batch_size and AMP settings must
+# stay identical to the original run. AMP + torch.compile + 32-worker
+# collection are the original perf settings.
+# ── Environments ─────────────────────────────────────────────────────
+id_envs:
+  - MiniHack-Room-Random-5x5-v0
+  - MiniHack-Room-Random-15x15-v0
+  - MiniHack-Corridor-R2-v0
+  - MiniHack-MazeWalk-9x9-v0
+ood_envs:
+  - MiniHack-Room-Dark-15x15-v0
+  - MiniHack-Corridor-R5-v0
+  - MiniHack-MazeWalk-45x19-v0
+crop_size: 9
+map_h: 21
+map_w: 79
+action_dim: 12
+mask_token: 12
+pad_token: 13
+# ── Model (matches checkpoint) ───────────────────────────────────────
+n_embd: 256
+n_head: 4
+n_layer: 4
+n_global_tokens: 8
+seq_len: 64
+global_gate_init: -3.0
+dropout: 0.0
+ema_decay: 0.999
+# ── Diffusion (MDLM) — matches checkpoint ────────────────────────────
+noise_schedule: linear
+num_diffusion_steps: 100
+loss_weight_clip: 1000.0
+label_smoothing: 0.0
+use_importance_weighting: false
+eta: 0.15
+remask_strategy: conf
+# ── Inference / sampling — matches checkpoint ────────────────────────
+diffusion_steps_eval: 10
+diffusion_steps_collect: 5
+temperature: 0.5
+top_k: 4
+replan_every: 16
+physics_aware_sampling: false
+# ── Shared training budget (DAgger only) ─────────────────────────────
+# 5.65M env-steps reproduces the env-step budget consumed at iter600
+# of the original DAgger run. This figure is calibrated against a real
+# DAgger run with the same recipe (`p7wfp67q`, episodes_per_iteration=30,
+# grad_steps_per_iteration=100): summing the per-iter env steps over
+# the first 600 iterations gives 30 × 600 × mean(model_steps + oracle_steps)
+# ≈ 30 × 600 × (198 + 116) ≈ 5.65 M real env.step() calls.
+# (The earlier 3M figure was based on the buggy single-episode env-step
+# accounting in `online.py:155-169` — fixed in the same commit as this
+# config bump.) Used by `--mode dagger` only. Offline mode bypasses
+# this via `offline_total_grad_steps` below — the unified env-step
+# budget is fundamentally unfair when the sample-to-grad-step ratio
+# differs by ~50× between modes.
+total_timesteps: 5650000
+# Eval/checkpoint cadence in env-step units (DAgger mode).
+# Scaled with the corrected total_timesteps so the run still produces
+# ~12 ID/OOD evals and ~6 checkpoints over its full duration.
+# 470k → ~12 evals; 940k → ~6 checkpoints.
+id_eval_every_timesteps: 470000
+ood_eval_every_timesteps: 470000
+checkpoint_every_timesteps: 940000
+# Final-eval episode count (used by both ID/OOD eval triggers and
+# checkpoint-time evals; matches the original DAgger run).
+eval_episodes_per_env: 50
+checkpoint_eval_episodes: 50
+weight_decay: 0.0001
+aux_loss_weight: 0.5
+# ── DAgger (matches checkpoint_final/online/config_iter600.yaml) ─────
+dagger_lr: 0.00003
+dagger_batch_size: 2048
+dagger_grad_clip: 1.0
+buffer_capacity: 10000
+episodes_per_iteration: 30
+grad_steps_per_iteration: 100
+efficiency_multiplier: 1.5
+curriculum_queue_size: 100
+curriculum_preseed: true
+# ── Offline BC (compute-matched fair baseline) ───────────────────────
+# Per the fairness analysis above:
+#   * Same gradient compute as DAgger (60k AdamW updates × 2048 batch)
+#   * Same model, diffusion, weight_decay, grad_clip, aux_loss
+#   * BC-tuned LR + cosine schedule (best practice from-scratch)
+#   * Eval/checkpoint counts matched to DAgger via grad-step overrides
+offline_lr: 0.0003
+offline_batch_size: 2048
+offline_grad_clip: 1.0
+# Compute pin: 60,000 AdamW updates = exactly DAgger@iter600.
+offline_total_grad_steps: 60000
+# Eval cadence: 5,000 grad steps → 12 evals (matches DAgger eval count).
+offline_eval_every_grad_steps: 5000
+# Checkpoint cadence: 10,000 grad steps → 6 checkpoints (matches DAgger).
+offline_checkpoint_every_grad_steps: 10000
+# Buffer cap for offline mode only — must hold the full pre-collected
+# dataset (~1M sliding windows from 20k oracle trajectories). DAgger's
+# `buffer_capacity: 10000` would silently FIFO-evict 99% of the data.
+offline_buffer_capacity: 1500000
+# ── Performance (cluster-tuned, matches original DAgger run) ─────────
+use_amp: true
+torch_compile: true
+num_collection_workers: 32
+# ── Data collection (for offline BC dataset) ─────────────────────────
+# 5000 eps × 4 ID envs = 20k oracle trajectories. Strictly more than
+# the ~7k unique trajectories DAgger had in its filtered buffer at
+# iter600 — offline always gets a richer pre-collected pool, which is
+# the standard fairness asymmetry in BC vs DAgger comparisons.
+collect_episodes_per_env: 5000
+collect_num_workers: 32
+collect_output: data/oracle_bc_qmul.pt
+# ── Checkpointing & Logging ──────────────────────────────────────────
+checkpoint_dir: checkpoints_qmul
+save_policy: true
+hub_run_id: null
+hub_repo_id: null
+use_wandb: true
+wandb_project: remdm-minihack
+wandb_entity: "mathis-weil-university-college-london-ucl-"
+wandb_run_name: null
+# wandb_resume_id intentionally omitted — fresh runs by default.
+# Override on the CLI (`wandb_resume_id=...`) to continue an existing run.
+offline_log_every: 50
+seed: null

configs/final_ucl_gpu.yaml ADDED Viewed

	@@ -0,0 +1,158 @@

+# =============================================================================
+# UCL 3090 Ti GPU — final paper run config
+# =============================================================================
+#
+# This single config drives BOTH the final DAgger run and the
+# compute-matched offline BC baseline used for the paper comparison.
+#
+#   --mode dagger    → reproduces the iter600 DAgger checkpoint recipe
+#   --mode offline   → trains a fair offline BC baseline against it
+#
+# All training hyperparameters are IDENTICAL to `final_qmul_gpu.yaml`
+# so cross-cluster runs produce directly comparable results. The only
+# differences are hardware-specific perf knobs (collection workers).
+# See the QMUL config header for the full fairness analysis.
+#
+# ── Hardware ─────────────────────────────────────────────────────────
+#
+# UCL 3090 Ti — 24 GB VRAM. The 4-layer × 256-dim model with
+# batch=2048 and AMP fits with comfortable headroom (~6-8 GB peak).
+# Lower core count than the QMUL cluster, so collection workers
+# capped at 8.
+# ── Environments ─────────────────────────────────────────────────────
+id_envs:
+  - MiniHack-Room-Random-5x5-v0
+  - MiniHack-Room-Random-15x15-v0
+  - MiniHack-Corridor-R2-v0
+  - MiniHack-MazeWalk-9x9-v0
+ood_envs:
+  - MiniHack-Room-Dark-15x15-v0
+  - MiniHack-Corridor-R5-v0
+  - MiniHack-MazeWalk-45x19-v0
+crop_size: 9
+map_h: 21
+map_w: 79
+action_dim: 12
+mask_token: 12
+pad_token: 13
+# ── Model (matches checkpoint) ───────────────────────────────────────
+n_embd: 256
+n_head: 4
+n_layer: 4
+n_global_tokens: 8
+seq_len: 64
+global_gate_init: -3.0
+dropout: 0.0
+ema_decay: 0.999
+# ── Diffusion (MDLM) — matches checkpoint ────────────────────────────
+noise_schedule: linear
+num_diffusion_steps: 100
+loss_weight_clip: 1000.0
+label_smoothing: 0.0
+use_importance_weighting: false
+eta: 0.15
+remask_strategy: conf
+# ── Inference / sampling — matches checkpoint ────────────────────────
+diffusion_steps_eval: 10
+diffusion_steps_collect: 5
+temperature: 0.5
+top_k: 4
+replan_every: 16
+physics_aware_sampling: false
+# ── Shared training budget (DAgger only) ─────────────────────────────
+# 5.65M env-steps reproduces the env-step budget consumed at iter600
+# of the original DAgger run. Calibrated against a real DAgger run
+# with the same recipe (see QMUL config header for the full derivation).
+# The earlier 3M figure was based on the buggy single-episode env-step
+# accounting in `online.py:155-169` — fixed in the same commit as this
+# config bump. Used by `--mode dagger` only. Offline mode bypasses
+# this via `offline_total_grad_steps` below.
+total_timesteps: 5650000
+# Eval/checkpoint cadence in env-step units (DAgger mode).
+# Scaled with the corrected total_timesteps so the run still produces
+# ~12 ID/OOD evals and ~6 checkpoints over its full duration.
+id_eval_every_timesteps: 470000
+ood_eval_every_timesteps: 470000
+checkpoint_every_timesteps: 940000
+# Final-eval episode count (used by both ID/OOD eval triggers and
+# checkpoint-time evals; matches the original DAgger run).
+eval_episodes_per_env: 50
+checkpoint_eval_episodes: 50
+weight_decay: 0.0001
+aux_loss_weight: 0.5
+# ── DAgger (matches checkpoint_final/online/config_iter600.yaml) ─────
+dagger_lr: 0.00003
+dagger_batch_size: 2048
+dagger_grad_clip: 1.0
+buffer_capacity: 10000
+episodes_per_iteration: 30
+grad_steps_per_iteration: 100
+efficiency_multiplier: 1.5
+curriculum_queue_size: 100
+curriculum_preseed: true
+# ── Offline BC (compute-matched fair baseline) ───────────────────────
+# Per the fairness analysis (see QMUL config header):
+#   * Same gradient compute as DAgger (60k AdamW updates × 2048 batch)
+#   * Same model, diffusion, weight_decay, grad_clip, aux_loss
+#   * BC-tuned LR + cosine schedule (best practice from-scratch)
+#   * Eval/checkpoint counts matched to DAgger via grad-step overrides
+#
+# `offline_batch_size: 2048` is matched to DAgger (NOT the 4096 the
+# previous UCL config used) so per-update SNR is identical between
+# modes — this is the cleanest apples-to-apples optimisation
+# comparison. The 24 GB VRAM can hold a larger batch but using one
+# would confound the comparison.
+offline_lr: 0.0003
+offline_batch_size: 2048
+offline_grad_clip: 1.0
+# Compute pin: 60,000 AdamW updates = exactly DAgger@iter600.
+offline_total_grad_steps: 60000
+# Eval cadence: 5,000 grad steps → 12 evals (matches DAgger eval count).
+offline_eval_every_grad_steps: 5000
+# Checkpoint cadence: 10,000 grad steps → 6 checkpoints (matches DAgger).
+offline_checkpoint_every_grad_steps: 10000
+# Buffer cap for offline mode only — must hold the full pre-collected
+# dataset (~1M sliding windows from 20k oracle trajectories). DAgger's
+# `buffer_capacity: 10000` would silently FIFO-evict 99% of the data.
+offline_buffer_capacity: 1500000
+# ── Performance (cluster-tuned for 3090 Ti) ──────────────────────────
+use_amp: true
+torch_compile: true
+num_collection_workers: 8
+# ── Data collection (for offline BC dataset) ─────────────────────────
+# 5000 eps × 4 ID envs = 20k oracle trajectories. Strictly more than
+# the ~7k unique trajectories DAgger had in its filtered buffer at
+# iter600 — offline always gets a richer pre-collected pool, which is
+# the standard fairness asymmetry in BC vs DAgger comparisons.
+collect_episodes_per_env: 5000
+collect_num_workers: 8
+collect_output: data/oracle_bc_ucl.pt
+# ── Checkpointing & Logging ──────────────────────────────────────────
+checkpoint_dir: checkpoints_ucl
+save_policy: true
+hub_run_id: null
+hub_repo_id: null
+use_wandb: true
+wandb_project: remdm-minihack
+wandb_entity: "mathis-weil-university-college-london-ucl-"
+wandb_run_name: null
+# wandb_resume_id intentionally omitted — fresh runs by default.
+# Override on the CLI (`wandb_resume_id=...`) to continue an existing run.
+offline_log_every: 50
+seed: null

configs/smoke.yaml ADDED Viewed

	@@ -0,0 +1,16 @@

+# Smoke test overrides — fast end-to-end sanity check on CPU.
+# With total_timesteps=5000 and ~2 eps × ~30 avg steps × 2 (model+oracle)
+# = ~120 env steps/iter → ~40 iters → a few seconds per iter on CPU.
+buffer_capacity: 50
+dagger_batch_size: 256
+offline_batch_size: 256
+total_timesteps: 5000
+id_eval_every_timesteps: 2500
+ood_eval_every_timesteps: 2500
+checkpoint_every_timesteps: 2500
+episodes_per_iteration: 2
+grad_steps_per_iteration: 5
+eval_episodes_per_env: 2
+checkpoint_eval_episodes: 2
+num_collection_workers: 0
+use_wandb: false

configs/ucl_gpu_bigger_model.yaml ADDED Viewed

	@@ -0,0 +1,103 @@

+# ── Environments ──────────────────────────────────────────────────────
+id_envs:
+  - MiniHack-Room-Random-5x5-v0
+  - MiniHack-Room-Random-15x15-v0
+  - MiniHack-Corridor-R2-v0
+  - MiniHack-MazeWalk-9x9-v0
+ood_envs:
+  - MiniHack-Room-Dark-15x15-v0
+  - MiniHack-Corridor-R5-v0
+  - MiniHack-MazeWalk-45x19-v0
+crop_size: 9
+map_h: 21
+map_w: 79
+action_dim: 12
+mask_token: 12
+pad_token: 13
+# ── Model ─────────────────────────────────────────────────────────────
+n_embd: 384
+n_head: 6
+n_layer: 4
+n_global_tokens: 8
+seq_len: 64
+global_gate_init: -3.0
+# Transformer dropout. 0.0 is deliberate — discrete diffusion forward masking
+# already regularises; dropout on top is redundant.
+dropout: 0.0
+ema_decay: 0.999
+# ── Diffusion (MDLM) ─────────────────────────────────────────────────
+noise_schedule: linear
+num_diffusion_steps: 100
+loss_weight_clip: 1000.0
+label_smoothing: 0.0
+# Use SUBS importance weighting w(t) in loss. Off by default (flat average
+# matching reference). Enable for MDLM ELBO experiments.
+use_importance_weighting: false
+# ReMDM stochastic remask base fraction
+eta: 0.15
+# Remasking strategy: rescale | cap | conf
+remask_strategy: conf
+# ── Inference ─────────────────────────────────────────────────────────
+# Number of reverse denoising steps at inference.
+# Reference uses 5 (aggressive). Higher = better quality, slower.
+diffusion_steps_eval: 10
+diffusion_steps_collect: 5
+temperature: 0.5
+top_k: 4
+replan_every: 16
+# Soft-penalise hazardous cardinal actions during stochastic sampling.
+# Not active in the reference evaluation pipeline; off by default.
+physics_aware_sampling: false
+# ── Training budget (unified) ────────────────────────────────────────
+total_timesteps: 20000000
+id_eval_every_timesteps: 250000
+ood_eval_every_timesteps: 250000
+checkpoint_every_timesteps: 1250000
+# ── Offline BC ────────────────────────────────────────────────────────
+offline_lr: 0.0003
+offline_batch_size: 4608
+offline_grad_clip: 1.0
+aux_loss_weight: 0.5
+# ── DAgger ────────────────────────────────────────────────────────────
+dagger_lr: 0.00003
+dagger_batch_size: 4608
+dagger_grad_clip: 1.0
+weight_decay: 0.0001
+buffer_capacity: 10000
+episodes_per_iteration: 30
+grad_steps_per_iteration: 100
+efficiency_multiplier: 1.5
+curriculum_queue_size: 100
+# Pre-seed curriculum queues with 50/50 prior for uniform early sampling.
+curriculum_preseed: true
+eval_episodes_per_env: 50
+checkpoint_eval_episodes: 50
+# ── Performance ──────────────────────────────────────────────────────
+# Mixed-precision (FP16) training via torch.cuda.amp.
+# Speeds up forward/backward ~1.5-2x on GPU. No effect on CPU.
+use_amp: true
+# torch.compile the model for fused kernels (experimental).
+# May cause slow first iteration due to compilation. No effect on CPU.
+torch_compile: true
+# Number of parallel workers for DAgger episode collection.
+# 0 = sequential (reference behaviour). Recommended: 4-8 on multi-core.
+num_collection_workers: 8
+# ── Checkpointing & Logging ──────────────────────────────────────────
+checkpoint_dir: checkpoints_ucl_bigger_model
+save_policy: true
+hub_run_id: null
+hub_repo_id: null
+use_wandb: true
+wandb_project: remdm-minihack
+wandb_entity: "mathis-weil-university-college-london-ucl-"
+wandb_run_name: null
+offline_log_every: 10
+seed: null

configs/ucl_gpu_learning_behaviour.yaml ADDED Viewed

	@@ -0,0 +1,103 @@

+# ── Environments ──────────────────────────────────────────────────────
+id_envs:
+  - MiniHack-Room-Random-5x5-v0
+  - MiniHack-Room-Random-15x15-v0
+  - MiniHack-Corridor-R2-v0
+  - MiniHack-MazeWalk-9x9-v0
+ood_envs:
+  - MiniHack-Room-Dark-15x15-v0
+  - MiniHack-Corridor-R5-v0
+  - MiniHack-MazeWalk-45x19-v0
+crop_size: 9
+map_h: 21
+map_w: 79
+action_dim: 12
+mask_token: 12
+pad_token: 13
+# ── Model ─────────────────────────────────────────────────────────────
+n_embd: 256
+n_head: 4
+n_layer: 4
+n_global_tokens: 8
+seq_len: 64
+global_gate_init: -3.0
+# Transformer dropout. 0.0 is deliberate — discrete diffusion forward masking
+# already regularises; dropout on top is redundant.
+dropout: 0.0
+ema_decay: 0.999
+# ── Diffusion (MDLM) ─────────────────────────────────────────────────
+noise_schedule: linear
+num_diffusion_steps: 100
+loss_weight_clip: 1000.0
+label_smoothing: 0.0
+# Use SUBS importance weighting w(t) in loss. Off by default (flat average
+# matching reference). Enable for MDLM ELBO experiments.
+use_importance_weighting: false
+# ReMDM stochastic remask base fraction
+eta: 0.18
+# Remasking strategy: rescale | cap | conf
+remask_strategy: conf
+# ── Inference ─────────────────────────────────────────────────────────
+# Number of reverse denoising steps at inference.
+# Reference uses 5 (aggressive). Higher = better quality, slower.
+diffusion_steps_eval: 10
+diffusion_steps_collect: 5
+temperature: 0.5
+top_k: 4
+replan_every: 16
+# Soft-penalise hazardous cardinal actions during stochastic sampling.
+# Not active in the reference evaluation pipeline; off by default.
+physics_aware_sampling: false
+# ── Training budget (unified) ────────────────────────────────────────
+total_timesteps: 20000000
+id_eval_every_timesteps: 250000
+ood_eval_every_timesteps: 250000
+checkpoint_every_timesteps: 1250000
+# ── Offline BC ────────────────────────────────────────────────────────
+offline_lr: 0.0003
+offline_batch_size: 6144
+offline_grad_clip: 1.0
+aux_loss_weight: 0.5
+# ── DAgger ────────────────────────────────────────────────────────────
+dagger_lr: 0.00003
+dagger_batch_size: 6144
+dagger_grad_clip: 1.0
+weight_decay: 0.0001
+buffer_capacity: 10000
+episodes_per_iteration: 30
+grad_steps_per_iteration: 100
+efficiency_multiplier: 1.5
+curriculum_queue_size: 100
+# Pre-seed curriculum queues with 50/50 prior for uniform early sampling.
+curriculum_preseed: true
+eval_episodes_per_env: 50
+checkpoint_eval_episodes: 50
+# ── Performance ──────────────────────────────────────────────────────
+# Mixed-precision (FP16) training via torch.cuda.amp.
+# Speeds up forward/backward ~1.5-2x on GPU. No effect on CPU.
+use_amp: true
+# torch.compile the model for fused kernels (experimental).
+# May cause slow first iteration due to compilation. No effect on CPU.
+torch_compile: true
+# Number of parallel workers for DAgger episode collection.
+# 0 = sequential (reference behaviour). Recommended: 4-8 on multi-core.
+num_collection_workers: 8
+# ── Checkpointing & Logging ──────────────────────────────────────────
+checkpoint_dir: checkpoints_ucl_learning_behaviour
+save_policy: true
+hub_run_id: null
+hub_repo_id: null
+use_wandb: true
+wandb_project: remdm-minihack
+wandb_entity: "mathis-weil-university-college-london-ucl-"
+wandb_run_name: null
+offline_log_every: 10
+seed: null

environments/.gitkeep ADDED Viewed

File without changes

main.py ADDED Viewed

	@@ -0,0 +1,255 @@

+from __future__ import annotations
+import argparse
+import logging
+import random
+from pathlib import Path
+from typing import Any
+import numpy as np
+import torch
+from src.config import load_config
+from src.planners.baselines import ALL_BASELINE_ALGOS, run_baselines
+from src.planners.logging import Logger
+from src.planners.offline import run_offline
+from src.planners.online import run_dagger
+from src.planners.inference import run_inference
+from src.planners.collect_oracle import run_collect
+from src.planners.smoke import run_smoke
+# =============================================================================
+# Logging
+# =============================================================================
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
+)
+logger = logging.getLogger(__name__)
+# =============================================================================
+# Utils
+# =============================================================================
+def _parse_overrides(extras: list[str]) -> dict[str, Any]:
+    return {
+        k.lstrip("-"): v
+        for item in extras if "=" in item
+        for k, v in [item.split("=", 1)]
+    }
+def _set_seed(seed: int | None) -> int:
+    if seed is None:
+        seed = random.randint(0, 2**31 - 1)
+    random.seed(seed)
+    np.random.seed(seed)
+    torch.manual_seed(seed)
+    if torch.cuda.is_available():
+        torch.cuda.manual_seed_all(seed)
+    return seed
+# =============================================================================
+# CLI
+# =============================================================================
+def parse_args() -> tuple[argparse.Namespace, list[str]]:
+    parser = argparse.ArgumentParser(
+        description="ReMDM-MiniHack: Masked Diffusion Planner",
+    )
+    parser.add_argument(
+        "--mode",
+        required=True,
+        choices=[
+            "smoke", "offline", "dagger", "inference", "collect", "baselines",
+        ],
+    )
+    parser.add_argument("--config", default="configs/defaults.yaml")
+    parser.add_argument(
+        "--algo", default=None, choices=list(ALL_BASELINE_ALGOS),
+        help="Baseline algorithm (required for --mode baselines)",
+    )
+    parser.add_argument(
+        "--seeds", type=int, nargs="+", default=None,
+        help=(
+            "Explicit list of seeds for --mode baselines "
+            "(e.g. --seeds 0 1 2)."
+        ),
+    )
+    parser.add_argument(
+        "--n-seeds", type=int, default=None,
+        help=(
+            "Number of seeds starting from 0 (alternative to --seeds; "
+            "only used by --mode baselines)."
+        ),
+    )
+    parser.add_argument("--data", default=None)
+    parser.add_argument("--checkpoint", default=None)
+    parser.add_argument(
+        "--wandb-artifact", default=None,
+        help=(
+            "W&B artifact reference to download as checkpoint, e.g. "
+            "'entity/project/checkpoint-iter1000:latest'"
+        ),
+    )
+    parser.add_argument("--no-warm-start", action="store_true")
+    parser.add_argument("--no-ema", action="store_true")
+    parser.add_argument("--envs", nargs="+", default=None)
+    parser.add_argument(
+        "--des", nargs="+", default=None,
+        help="Paths to .des scenario files for custom environment evaluation",
+    )
+    parser.add_argument("--episodes", type=int, default=50)
+    parser.add_argument("--output", default=None)
+    parser.add_argument(
+        "--blind-global", action="store_true",
+        help="Zero out global map observations (local-only ablation)",
+    )
+    return parser.parse_known_args()
+# =============================================================================
+# Config
+# =============================================================================
+def build_config(args, extras):
+    config_path = args.config
+    if args.mode == "smoke" and config_path == "configs/defaults.yaml":
+        config_path = "configs/smoke.yaml"
+    cfg = load_config(config_path, _parse_overrides(extras))
+    seed = _set_seed(cfg.seed)
+    logger.info(f"Seed: {seed}")
+    return cfg
+# =============================================================================
+# Validation
+# =============================================================================
+def validate(args) -> None:
+    if args.mode == "inference" and not args.checkpoint and not args.wandb_artifact:
+        raise ValueError(
+            "--checkpoint or --wandb-artifact required for inference mode"
+        )
+    if args.mode == "baselines" and args.algo is None:
+        raise ValueError(
+            "--algo is required for --mode baselines "
+            f"(choose one of {list(ALL_BASELINE_ALGOS)})"
+        )
+def _resolve_seeds(args, cfg) -> list[int]:
+    """Build the seed list for --mode baselines."""
+    if args.seeds is not None:
+        return list(args.seeds)
+    if args.n_seeds is not None:
+        return list(range(int(args.n_seeds)))
+    return [cfg.seed if cfg.seed is not None else 0]
+# =============================================================================
+# Dispatch (no lambdas, cleaner)
+# =============================================================================
+def _resolve_path(p: str | None) -> str | None:
+    """Resolve a user-provided path to absolute, or return None."""
+    if p is None:
+        return None
+    return str(Path(p).resolve())
+def _resolve_checkpoint(args, cfg) -> str | None:
+    """Return a local checkpoint path from --checkpoint or --wandb-artifact."""
+    if args.checkpoint:
+        return _resolve_path(args.checkpoint)
+    artifact_ref = args.wandb_artifact
+    if artifact_ref:
+        from src.planners.logging import download_artifact
+        path = download_artifact(artifact_ref)
+        if path is None:
+            raise RuntimeError(
+                f"Failed to download W&B artifact: {artifact_ref}"
+            )
+        return path
+    return None
+def run_mode(mode: str, cfg, args) -> None:
+    data_path = _resolve_path(args.data)
+    output_path = _resolve_path(args.output)
+    des_files = (
+        [str(Path(d).resolve()) for d in args.des]
+        if args.des else None
+    )
+    if mode == "smoke":
+        run_smoke(cfg)
+    elif mode == "offline":
+        ckpt = _resolve_checkpoint(args, cfg)
+        run_offline(cfg, data_path, checkpoint_path=ckpt)
+    elif mode == "dagger":
+        ckpt = _resolve_checkpoint(args, cfg)
+        run_dagger(cfg, ckpt, args.no_warm_start)
+    elif mode == "collect":
+        run_collect(cfg)
+    elif mode == "baselines":
+        run_baselines(
+            cfg,
+            algo=args.algo,
+            seeds=_resolve_seeds(args, cfg),
+            output_path=output_path,
+        )
+    elif mode == "inference":
+        ckpt = _resolve_checkpoint(args, cfg)
+        if ckpt is None:
+            raise ValueError(
+                "--checkpoint or --wandb-artifact required for inference"
+            )
+        log = Logger(cfg)
+        run_inference(
+            cfg,
+            ckpt,
+            args.envs,
+            args.episodes,
+            output_path,
+            not args.no_ema,
+            log=log,
+            des_files=des_files,
+            blind_global=args.blind_global,
+        )
+        log.finish()
+# =============================================================================
+# Entry point
+# =============================================================================
+def main() -> None:
+    args, extras = parse_args()
+    validate(args)
+    cfg = build_config(args, extras)
+    if torch.cuda.is_available():
+        torch.set_float32_matmul_precision("high")
+    run_mode(args.mode, cfg, args)
+if __name__ == "__main__":
+    main()

pyproject.toml ADDED Viewed

	@@ -0,0 +1,22 @@

+[project]
+name = "minihack-remdm-planner"
+version = "0.1.0"
+description = "Add your description here"
+readme = "README.md"
+requires-python = ">=3.12"
+dependencies = [
+    "huggingface-hub>=1.8.0",
+    "ipython>=9.12.0",
+    "matplotlib>=3.10.8",
+    "minihack>=1.0.2",
+    "nle>=1.2.0",
+    "numpy>=2.4.4",
+    "orjson>=3.11.8",
+    "polars>=1.39.3",
+    "pyyaml>=6.0.3",
+    "sb3-contrib>=2.8.0",
+    "scipy>=1.17.1",
+    "stable-baselines3>=2.8.0",
+    "torch>=2.11.0",
+    "wandb>=0.25.1",
+]

src/__init__.py ADDED Viewed

File without changes

src/buffer.py ADDED Viewed

	@@ -0,0 +1,268 @@

+"""Replay buffer with offline-protected FIFO eviction.
+Ported from minihack_reference/src/buffer.py. Stores observation-action
+windows of fixed length ``seq_len``. Offline data is pinned at the front
+and never evicted; online samples use FIFO.
+"""
+from __future__ import annotations
+import numpy as np
+class ReplayBuffer:
+    """Fixed-capacity buffer with offline-protected FIFO eviction.
+    Offline samples (loaded once via ``load_offline_data``) are pinned
+    and never evicted. Online samples added via ``add`` are FIFO-evicted
+    when the total count exceeds ``capacity``.
+    Args:
+        capacity: Maximum total number of windows.
+        seq_len: Action-sequence window length.
+        pad_token: Token used to pad short sequences.
+    """
+    def __init__(
+        self, capacity: int, seq_len: int, pad_token: int,
+    ) -> None:
+        self._capacity = capacity
+        self._seq_len = seq_len
+        self._pad_token = pad_token
+        # Each element: (local [9,9], global [21,79], actions [seq_len])
+        self._offline: list[tuple[np.ndarray, np.ndarray, np.ndarray]] = []
+        self._online: list[tuple[np.ndarray, np.ndarray, np.ndarray]] = []
+        # Stacked array cache for fast sampling
+        self._cache_valid = False
+        self._cached_local: np.ndarray | None = None
+        self._cached_global: np.ndarray | None = None
+        self._cached_actions: np.ndarray | None = None
+    # ── Offline data ─────────────────────────────────────────────
+    def load_offline_data(
+        self,
+        data: dict | list,
+        allowed_envs: list[str],
+        metadata: dict | None = None,
+    ) -> None:
+        """Load pre-collected trajectories and slice into windows.
+        Supports two dataset formats:
+        **New format** (dict): ``{"trajectories": [...]}`` where each entry
+        is a dict with ``"local"``, ``"global"``, ``"actions"``, ``"env_id"``.
+        **Legacy format** (list): Flat list of ``((local, global), action_seq)``
+        tuples produced by the reference pipeline (pre-windowed, already
+        ``seq_len``-length). Env filtering uses an optional *metadata* dict
+        with a ``"samples_per_env"`` key mapping env IDs to sample counts.
+        Args:
+            data: Dataset in new dict format or legacy list format.
+            allowed_envs: Only samples from these env IDs are kept.
+            metadata: Optional sidecar metadata for legacy format env
+                filtering. Ignored for the new format.
+        """
+        if isinstance(data, list):
+            self._load_legacy_offline_data(data, allowed_envs, metadata)
+            return
+        trajectories = data.get("trajectories", [data])
+        for traj in trajectories:
+            if traj.get("env_id", "") not in allowed_envs:
+                continue
+            windows = self._slice_trajectory(traj)
+            self._offline.extend(windows)
+        # Truncate to capacity
+        if len(self._offline) > self._capacity:
+            self._offline = self._offline[: self._capacity]
+        self._invalidate_cache()
+    def _load_legacy_offline_data(
+        self,
+        data: list,
+        allowed_envs: list[str],
+        metadata: dict | None = None,
+    ) -> None:
+        """Load reference-format datasets (pre-windowed tuples).
+        Args:
+            data: List of ``((local_crop, global_map), action_seq)`` tuples.
+                ``local_crop`` is ``[9, 9]``, ``global_map`` is ``[21, 79]``,
+                ``action_seq`` is a sequence of length ``seq_len``.
+            allowed_envs: Env IDs to retain.
+            metadata: Optional dict with ``"samples_per_env"`` key mapping
+                env IDs to per-env sample counts for precise filtering.
+        """
+        allowed = set(allowed_envs)
+        if metadata and "samples_per_env" in metadata:
+            # Build a per-sample env_id index from the metadata ordering
+            sample_to_env: list[str] = []
+            for env_id in sorted(metadata["samples_per_env"].keys()):
+                count = metadata["samples_per_env"][env_id]
+                sample_to_env.extend([env_id] * count)
+            for i, sample in enumerate(data):
+                env_id = (
+                    sample_to_env[i] if i < len(sample_to_env) else None
+                )
+                if env_id is None or env_id in allowed:
+                    self._offline.append(self._unpack_legacy_sample(sample))
+        else:
+            # No metadata — keep all samples (caller is responsible for
+            # pre-filtering)
+            for sample in data:
+                self._offline.append(self._unpack_legacy_sample(sample))
+        if len(self._offline) > self._capacity:
+            self._offline = self._offline[: self._capacity]
+        self._invalidate_cache()
+    @staticmethod
+    def _unpack_legacy_sample(
+        sample: tuple,
+    ) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
+        """Convert a legacy ``((local, global), action_seq)`` sample.
+        Args:
+            sample: Tuple of ``(state, action_seq)`` where state is
+                ``(local_crop, global_map)``.
+        Returns:
+            ``(local [9,9], global [21,79], actions [seq_len])`` as
+            numpy int16/int64 arrays.
+        """
+        (local, glb), action_seq = sample
+        return (
+            np.asarray(local, dtype=np.int16),
+            np.asarray(glb, dtype=np.int16),
+            np.asarray(action_seq, dtype=np.int64),
+        )
+    # ── Online data ──────────���───────────────────────────────────
+    def _invalidate_cache(self) -> None:
+        """Mark the stacked array cache as stale."""
+        self._cache_valid = False
+    def _ensure_cache(self) -> None:
+        """Rebuild stacked arrays from offline + online windows."""
+        if self._cache_valid:
+            return
+        combined = self._offline + self._online
+        if not combined:
+            return
+        n = len(combined)
+        l0, g0, a0 = combined[0]
+        self._cached_local = np.empty(
+            (n, *l0.shape), dtype=l0.dtype,
+        )
+        self._cached_global = np.empty(
+            (n, *g0.shape), dtype=g0.dtype,
+        )
+        self._cached_actions = np.empty(
+            (n, *a0.shape), dtype=a0.dtype,
+        )
+        for i, (l, g, a) in enumerate(combined):
+            self._cached_local[i] = l
+            self._cached_global[i] = g
+            self._cached_actions[i] = a
+        self._cache_valid = True
+    def add(self, trajectory: dict) -> None:
+        """Add a trajectory, sliced into overlapping windows.
+        FIFO-evicts oldest online samples when over capacity.
+        Args:
+            trajectory: Dict with ``"local"`` ``[T,9,9]``,
+                ``"global"`` ``[T,21,79]``, ``"actions"`` ``[T]``.
+        """
+        windows = self._slice_trajectory(trajectory)
+        self._online.extend(windows)
+        max_online = self._capacity - len(self._offline)
+        if len(self._online) > max_online:
+            excess = len(self._online) - max_online
+            self._online = self._online[excess:]
+        self._invalidate_cache()
+    # ── Sampling ─────────────────────────────────────────────────
+    def sample(
+        self, batch_size: int,
+    ) -> tuple[np.ndarray, np.ndarray, np.ndarray] | None:
+        """Random sample from offline + online combined.
+        Args:
+            batch_size: Number of windows to sample.
+        Returns:
+            ``(local [B,9,9], global [B,21,79], actions [B,seq_len])``
+            as numpy arrays, or ``None`` if the buffer is empty.
+        """
+        if len(self) == 0:
+            return None
+        self._ensure_cache()
+        if self._cached_local is None:
+            return None
+        indices = np.random.randint(0, len(self), size=batch_size)
+        return (
+            self._cached_local[indices],
+            self._cached_global[indices],
+            self._cached_actions[indices],
+        )
+    # ── Properties ─────────��─────────────────────────────────────
+    def __len__(self) -> int:
+        """Total number of windows (offline + online)."""
+        return len(self._offline) + len(self._online)
+    @property
+    def n_offline(self) -> int:
+        """Number of pinned offline windows."""
+        return len(self._offline)
+    @property
+    def offline_size(self) -> int:
+        """Number of pinned offline windows (alias)."""
+        return len(self._offline)
+    # ── Internals ───────────────────────────────────────────���────
+    def _slice_trajectory(
+        self, traj: dict,
+    ) -> list[tuple[np.ndarray, np.ndarray, np.ndarray]]:
+        """Slice a trajectory into overlapping seq_len windows.
+        Args:
+            traj: Trajectory dict with ``"local"``, ``"global"``,
+                ``"actions"`` arrays.
+        Returns:
+            List of ``(local, global, actions)`` tuples.
+        """
+        local_arr = np.asarray(traj["local"])
+        global_arr = np.asarray(traj["global"])
+        actions_arr = np.asarray(traj["actions"])
+        T = len(actions_arr)
+        windows: list[tuple[np.ndarray, np.ndarray, np.ndarray]] = []
+        for start in range(T):
+            end = start + self._seq_len
+            if end <= T:
+                a = actions_arr[start:end]
+            else:
+                a = np.full(self._seq_len, self._pad_token, dtype=np.int64)
+                a[: T - start] = actions_arr[start:]
+            # Use the observation at the window start
+            l = local_arr[min(start, len(local_arr) - 1)]
+            g = global_arr[min(start, len(global_arr) - 1)]
+            windows.append((l.copy(), g.copy(), a))
+        return windows

src/config.py ADDED Viewed

	@@ -0,0 +1,164 @@

+"""Configuration loader for ReMDM-MiniHack.
+Loads YAML configs with deep-merge and CLI override support,
+following the Craftax config pattern.
+"""
+from __future__ import annotations
+import logging
+import os
+import secrets
+from datetime import datetime, timezone
+from pathlib import Path
+from types import SimpleNamespace
+import yaml
+logger = logging.getLogger(__name__)
+_PROJECT_ROOT = Path(__file__).resolve().parent.parent
+def _deep_merge(base: dict, override: dict) -> dict:
+    """Recursively merge *override* into *base* (mutates *base*).
+    Args:
+        base: Base dictionary to merge into.
+        override: Dictionary whose values take precedence.
+    Returns:
+        The merged dictionary (same object as *base*).
+    """
+    for key, value in override.items():
+        if (
+            key in base
+            and isinstance(base[key], dict)
+            and isinstance(value, dict)
+        ):
+            _deep_merge(base[key], value)
+        else:
+            base[key] = value
+    return base
+def _cast_value(value: str) -> int | float | bool | str | None:
+    """Best-effort cast of a CLI string to a Python scalar.
+    Args:
+        value: Raw string from the command line.
+    Returns:
+        Parsed Python value (int, float, bool, str, or None).
+    """
+    if value.lower() in ("true", "yes"):
+        return True
+    if value.lower() in ("false", "no"):
+        return False
+    if value.lower() == "null":
+        return None
+    try:
+        return int(value)
+    except ValueError:
+        pass
+    try:
+        return float(value)
+    except ValueError:
+        pass
+    return value
+def load_config(
+    config_path: str | None = None,
+    cli_overrides: dict | None = None,
+) -> SimpleNamespace:
+    """Load configuration from YAML with optional overrides.
+    1. Load ``configs/defaults.yaml``.
+    2. Deep-merge *config_path* on top (if provided and different from defaults).
+    3. Apply *cli_overrides* key=value pairs.
+    4. Auto-select device (``cuda`` if available, else ``cpu``; honour
+       ``DEVICE`` env-var).
+    5. Validate invariants.
+    Args:
+        config_path: Path to a YAML file merged on top of defaults.
+            ``None`` uses defaults only.
+        cli_overrides: ``{key: value}`` pairs applied last.
+    Returns:
+        A ``SimpleNamespace`` containing all hyperparameters.
+    Raises:
+        AssertionError: If ``mask_token != action_dim`` or
+            ``pad_token != action_dim + 1``.
+    """
+    if cli_overrides is None:
+        cli_overrides = {}
+    defaults_path = _PROJECT_ROOT / "configs" / "defaults.yaml"
+    with open(defaults_path, "r") as fh:
+        cfg = yaml.safe_load(fh)
+    if config_path is not None:
+        config_path_resolved = Path(config_path)
+        if not config_path_resolved.is_absolute():
+            config_path_resolved = _PROJECT_ROOT / config_path_resolved
+        if config_path_resolved.resolve() != defaults_path.resolve():
+            with open(config_path_resolved, "r") as fh:
+                overrides = yaml.safe_load(fh) or {}
+            _deep_merge(cfg, overrides)
+    for key, value in cli_overrides.items():
+        if isinstance(value, str):
+            value = _cast_value(value)
+        cfg[key] = value
+    # Device selection
+    env_device = os.environ.get("DEVICE")
+    if env_device:
+        cfg["device"] = env_device
+    elif "device" not in cfg:
+        try:
+            import torch
+            cfg["device"] = "cuda" if torch.cuda.is_available() else "cpu"
+        except ImportError:
+            cfg["device"] = "cpu"
+    ns = SimpleNamespace(**cfg)
+    # Validation
+    assert ns.mask_token == ns.action_dim, (
+        f"mask_token ({ns.mask_token}) must equal action_dim ({ns.action_dim})"
+    )
+    assert ns.pad_token == ns.action_dim + 1, (
+        f"pad_token ({ns.pad_token}) must equal action_dim + 1 "
+        f"({ns.action_dim + 1})"
+    )
+    return ns
+def make_run_dir(cfg: SimpleNamespace, tag: str = "run") -> Path:
+    """Create a unique run subdirectory under ``cfg.checkpoint_dir``.
+    Generates a directory named ``{tag}_{YYYYMMDD}_{HHMMSS}_{hex4}``
+    to prevent concurrent runs from overwriting each other's
+    checkpoints. Updates ``cfg.checkpoint_dir`` in place.
+    Args:
+        cfg: Config namespace (``checkpoint_dir`` is mutated).
+        tag: Prefix for the directory name (e.g. ``"dagger"``,
+            ``"offline"``).
+    Returns:
+        The created directory path.
+    """
+    ts = datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S")
+    suffix = secrets.token_hex(2)
+    run_dir = Path(cfg.checkpoint_dir).resolve() / f"{tag}_{ts}_{suffix}"
+    run_dir.mkdir(parents=True, exist_ok=True)
+    cfg.checkpoint_dir = str(run_dir)
+    logger.info("Checkpoint directory: %s", run_dir)
+    return run_dir

src/curriculum.py ADDED Viewed

	@@ -0,0 +1,143 @@

+"""Dynamic environment curriculum and efficiency filter.
+Ported from minihack_reference/src/curriculum.py. Tracks per-environment
+win rates in a rolling window and uses bucket-based sampling weights to
+focus training on environments where the model is struggling.
+"""
+from __future__ import annotations
+import random
+from collections import deque
+class DynamicCurriculum:
+    """Rolling-window curriculum with bucket-based sampling weights.
+    Each environment maintains a deque of recent win/loss outcomes.
+    Sampling probability is inversely proportional to performance:
+    environments with low win rates are sampled more often.
+    Args:
+        env_ids: List of environment IDs to track.
+        queue_size: Rolling window size per environment.
+    """
+    # Bucket thresholds and weights
+    _LOW_THRESHOLD = 0.15
+    _HIGH_THRESHOLD = 0.85
+    _WEIGHT_LOW = 0.2
+    _WEIGHT_MID = 1.0
+    _WEIGHT_HIGH = 0.1
+    def __init__(
+        self,
+        env_ids: list[str],
+        queue_size: int = 100,
+        preseed: bool = True,
+    ) -> None:
+        self._env_ids = list(env_ids)
+        self._queue_size = queue_size
+        self._queues: dict[str, deque[bool]] = {}
+        for eid in self._env_ids:
+            q: deque[bool] = deque(maxlen=queue_size)
+            if preseed:
+                # 50/50 prior for uniform early sampling
+                for _ in range(50):
+                    q.append(True)
+                for _ in range(50):
+                    q.append(False)
+            self._queues[eid] = q
+    def update(self, env_id: str, won: bool) -> None:
+        """Record an episode outcome.
+        Args:
+            env_id: Environment ID.
+            won: Whether the episode was won.
+        """
+        if env_id not in self._queues:
+            self._queues[env_id] = deque(maxlen=self._queue_size)
+        self._queues[env_id].append(won)
+    def win_rate(self, env_id: str) -> float:
+        """Rolling win rate for an environment.
+        Args:
+            env_id: Environment ID.
+        Returns:
+            Win rate in ``[0, 1]``. Default 0.5 if empty.
+        """
+        q = self._queues.get(env_id)
+        if q is None or len(q) == 0:
+            return 0.5
+        return sum(q) / len(q)
+    def sample_env(self) -> str:
+        """Sample an environment ID using bucket-weighted probabilities.
+        Returns:
+            Sampled environment ID.
+        """
+        weights: list[float] = []
+        for eid in self._env_ids:
+            w = self.win_rate(eid)
+            if w < self._LOW_THRESHOLD:
+                weights.append(self._WEIGHT_LOW)
+            elif w > self._HIGH_THRESHOLD:
+                weights.append(self._WEIGHT_HIGH)
+            else:
+                weights.append(self._WEIGHT_MID)
+        return random.choices(self._env_ids, weights=weights, k=1)[0]
+    def state_dict(self) -> dict:
+        """Serialise curriculum state.
+        Returns:
+            Dict with ``env_ids``, ``queue_size``, and per-env queues.
+        """
+        return {
+            "env_ids": self._env_ids,
+            "queue_size": self._queue_size,
+            "queues": {
+                eid: list(q) for eid, q in self._queues.items()
+            },
+        }
+    def load_state_dict(self, sd: dict) -> None:
+        """Restore curriculum state.
+        Args:
+            sd: State dict from ``state_dict()``.
+        """
+        self._queue_size = sd.get("queue_size", self._queue_size)
+        for eid, items in sd.get("queues", {}).items():
+            q: deque[bool] = deque(maxlen=self._queue_size)
+            q.extend(items)
+            self._queues[eid] = q
+def efficiency_filter(
+    model_won: bool,
+    model_steps: int,
+    oracle_steps: int,
+    multiplier: float = 1.5,
+) -> bool:
+    """Decide whether to add oracle trajectory to the buffer.
+    Returns ``True`` (add oracle data) when the model either failed
+    or was substantially less efficient than the oracle.
+    Args:
+        model_won: Whether the model solved the episode.
+        model_steps: Steps the model took.
+        oracle_steps: Steps the oracle took.
+        multiplier: Efficiency threshold multiplier.
+    Returns:
+        ``True`` if oracle data should be added to the buffer.
+    """
+    if not model_won:
+        return True
+    return model_steps > multiplier * oracle_steps

src/diffusion/__init__.py ADDED Viewed

File without changes

src/diffusion/forward.py ADDED Viewed

	@@ -0,0 +1,50 @@

+"""Forward masking process q(z_t | x_0).
+Ported from the Craftax JAX implementation (src/diffusion/forward.py).
+Each token is independently replaced with mask_token with probability
+sigma_t = 1 - alpha_t. PAD positions are never masked.
+"""
+from __future__ import annotations
+from typing import Callable
+import torch
+from torch import Tensor
+def q_sample(
+    x0: Tensor,
+    t: Tensor,
+    mask_token: int,
+    pad_token: int,
+    schedule_fn: Callable[[Tensor], Tensor],
+) -> Tensor:
+    """Sample z_t from the forward masking process.
+    Args:
+        x0: Clean action sequences. Shape ``[B, L]``, dtype int64.
+        t: Per-sample diffusion time in [0, 1]. Shape ``[B]``.
+        mask_token: Integer ID of the MASK token.
+        pad_token: Integer ID of the PAD token.
+        schedule_fn: Noise schedule returning alpha(t).
+    Returns:
+        Noisy sequence z_t. Shape ``[B, L]``, dtype int64.
+        PAD positions are preserved unchanged.
+    """
+    alpha_t = schedule_fn(t)  # [B]
+    sigma_t = 1.0 - alpha_t  # mask probability per sample
+    sigma_t = sigma_t.unsqueeze(-1)  # [B, 1]
+    # Independent Bernoulli masking per position
+    mask_draws = torch.rand_like(x0, dtype=torch.float32)  # [B, L]
+    do_mask = mask_draws < sigma_t  # [B, L]
+    zt = torch.where(do_mask, mask_token, x0)
+    # Restore PAD positions — never mask padding
+    pad_mask = x0 == pad_token  # [B, L]
+    zt = torch.where(pad_mask, pad_token, zt)
+    return zt

src/diffusion/loss.py ADDED Viewed

	@@ -0,0 +1,162 @@

+"""MDLM ELBO loss with SUBS parameterisation.
+Ported from the Craftax JAX implementation (src/diffusion/loss.py).
+Computes continuous-time loss on masked positions only, with analytic
+SUBS weighting clipped for numerical stability.
+"""
+from __future__ import annotations
+from typing import Callable
+import torch
+import torch.nn.functional as F
+from torch import Tensor
+from src.diffusion.schedules import alpha_prime
+_MAX_WEIGHT: float = 1000.0
+def mdlm_loss(
+    logits: Tensor,
+    x0: Tensor,
+    zt: Tensor,
+    t: Tensor,
+    mask_token: int,
+    pad_token: int,
+    schedule_fn: Callable[[Tensor], Tensor],
+    weight_clip: float = _MAX_WEIGHT,
+    label_smoothing: float = 0.0,
+    use_importance_weighting: bool = False,
+) -> Tensor:
+    """Compute masked diffusion loss.
+    By default uses a simple masked cross-entropy average (matching the
+    reference implementation).  When ``use_importance_weighting=True``,
+    applies SUBS weighting ``w(t) = -alpha'(t) / (1 - alpha_t)``.
+    Args:
+        logits: Model output. Shape ``[B, L, vocab]``.
+        x0: Clean action sequences. Shape ``[B, L]``, int64.
+        zt: Noisy sequences. Shape ``[B, L]``, int64.
+        t: Per-sample diffusion time in [0, 1]. Shape ``[B]``.
+        mask_token: MASK token ID.
+        pad_token: PAD token ID.
+        schedule_fn: Noise schedule returning alpha(t).
+        weight_clip: Upper clamp for SUBS weight (default 1000).
+        label_smoothing: Smoothing epsilon for cross-entropy.
+        use_importance_weighting: If ``True``, apply SUBS w(t) per sample.
+    Returns:
+        Scalar loss. Returns ``0.0`` when no masked positions exist.
+    """
+    B, L, V = logits.shape
+    # Mask: compute loss only on masked, non-PAD positions
+    is_masked = (zt == mask_token) & (x0 != pad_token)  # [B, L]
+    if not is_masked.any():
+        return logits.new_tensor(0.0)
+    # Per-position cross-entropy
+    # Clamp targets to valid vocab range — out-of-range positions (PAD,
+    # MASK) will be zeroed out by is_masked anyway.
+    safe_targets = x0.clamp(0, V - 1)  # [B, L]
+    ce = F.cross_entropy(
+        logits.reshape(-1, V),
+        safe_targets.reshape(-1),
+        reduction="none",
+        label_smoothing=label_smoothing,
+    )  # [B*L]
+    ce = ce.reshape(B, L)  # [B, L]
+    # Zero out non-masked positions
+    ce = ce * is_masked.float()  # [B, L]
+    # Global average over all masked positions (matches reference)
+    n_masked_total = is_masked.float().sum().clamp(min=1.0)
+    loss = ce.sum() / n_masked_total
+    if use_importance_weighting:
+        # SUBS weight: w_t = -alpha'(t) / (1 - alpha_t + eps)
+        alpha_t = schedule_fn(t)  # [B]
+        d_alpha = alpha_prime(t, schedule_fn)  # [B]
+        w_t = (-d_alpha) / (1.0 - alpha_t + 1e-8)  # [B]
+        w_t = w_t.clamp(0.0, weight_clip)  # [B]
+        # Per-sample weighted loss (needed for SUBS)
+        n_masked_per = is_masked.float().sum(dim=1).clamp(min=1.0)  # [B]
+        per_sample = ce.sum(dim=1) / n_masked_per  # [B]
+        loss = (per_sample * w_t).mean()
+    return loss
+def auxiliary_goal_loss(
+    goal_pred: Tensor,
+    global_obs: Tensor,
+    pad_value: float = -1.0,
+) -> Tensor:
+    """MSE loss for auxiliary staircase-coordinate prediction.
+    Args:
+        goal_pred: Predicted normalised staircase coords. Shape ``[B, 2]``.
+        global_obs: Full map glyphs. Shape ``[B, 21, 79]``, int.
+        pad_value: Coordinate value used when staircase is not visible.
+    Returns:
+        Scalar MSE loss over samples where the staircase is visible.
+        Returns ``0.0`` when no staircase is visible in the batch.
+    """
+    targets = find_staircase_from_glyphs(global_obs)  # [B, 2]
+    targets = targets.to(goal_pred.device, dtype=goal_pred.dtype)
+    # Only supervise where staircase is visible
+    valid = (targets[:, 0] != pad_value)  # [B]
+    if not valid.any():
+        return goal_pred.new_tensor(0.0)
+    diff = (goal_pred[valid] - targets[valid]) ** 2  # [N, 2]
+    return diff.mean()
+def find_staircase_from_glyphs(global_obs: Tensor) -> Tensor:
+    """Locate the staircase '>' in the global glyph map.
+    Searches for NLE staircase-down glyph (character code 62 = '>').
+    Returns normalised (row/H, col/W) coordinates per batch element,
+    or (-1, -1) when the staircase is not visible.
+    Args:
+        global_obs: Glyph map. Shape ``[B, H, W]`` or ``[H, W]``, int.
+    Returns:
+        Normalised coordinates. Shape ``[B, 2]`` (float32).
+    """
+    if global_obs.ndim == 2:
+        global_obs = global_obs.unsqueeze(0)
+    B, H, W = global_obs.shape
+    # NLE staircase-down glyphs: ord('>') = 62, plus NLE tile variants
+    # 2310 (S_dnstair), 2368 (S_dnstairs), 2383 (S_vodoor).
+    is_stair = (
+        (global_obs == 62)
+        | (global_obs == 2310)
+        | (global_obs == 2368)
+        | (global_obs == 2383)
+    )
+    coords = torch.full(
+        (B, 2), -1.0, dtype=torch.float32, device=global_obs.device
+    )
+    for b in range(B):
+        positions = is_stair[b].nonzero(as_tuple=False)  # [N, 2]
+        if positions.shape[0] > 0:
+            row = positions[0, 0].float() / max(1, H - 1)
+            col = positions[0, 1].float() / max(1, W - 1)
+            coords[b, 0] = row
+            coords[b, 1] = col
+    return coords

src/diffusion/sampling.py ADDED Viewed

	@@ -0,0 +1,398 @@

+"""ReMDM reverse denoising with remasking strategies.
+Ported from the Craftax JAX implementation (src/diffusion/sampling.py).
+Implements MaskGIT-style progressive unmasking with optional stochastic
+remasking (ReMDM) using three strategy variants.
+"""
+from __future__ import annotations
+from types import SimpleNamespace
+import numpy as np
+import torch
+import torch.nn.functional as F
+from torch import Tensor
+from torch.distributions import Categorical
+from src.diffusion.schedules import get_schedule
+# NLE hazard glyph IDs and char codes (walls, locked doors, lava, water)
+_HAZARD_GLYPHS: frozenset[int] = frozenset({2359, 2360, 2389, 2390})
+_HAZARD_CHARS: frozenset[int] = frozenset(
+    {ord("|"), ord("-"), ord("+"), ord("L"), ord("W")}
+)
+# Cardinal action → (dy, dx) offsets
+_CARDINAL_OFFSETS: dict[int, tuple[int, int]] = {
+    0: (-1, 0), 1: (0, 1), 2: (1, 0), 3: (0, -1),
+}
+_N_PHYSICS_CHECK = 8  # only inspect the first N plan positions
+def _check_hazard(local_crop: np.ndarray, action: int) -> bool:
+    """Return True if *action* from the agent's centre steps into a hazard.
+    Args:
+        local_crop: ``[crop_size, crop_size]`` glyph array.
+        action: Cardinal action index (0=N, 1=E, 2=S, 3=W).
+    Returns:
+        ``True`` when the target cell contains a hazard glyph.
+    """
+    if action not in _CARDINAL_OFFSETS:
+        return False
+    cs = local_crop.shape[0]
+    cy, cx = cs // 2, cs // 2
+    dy, dx = _CARDINAL_OFFSETS[action]
+    ny, nx = cy + dy, cx + dx
+    if not (0 <= ny < cs and 0 <= nx < cs):
+        return True
+    glyph = int(local_crop[ny, nx])
+    return glyph in _HAZARD_GLYPHS or glyph in _HAZARD_CHARS
+def top_k_filter(logits: Tensor, k: int) -> Tensor:
+    """Zero out all but the top-k logits per position.
+    Args:
+        logits: Raw logits. Shape ``[..., V]``.
+        k: Number of top entries to keep.
+    Returns:
+        Filtered logits with non-top-k set to ``-inf``.
+    """
+    if k <= 0 or k >= logits.shape[-1]:
+        return logits
+    topk_vals, _ = logits.topk(k, dim=-1)  # [..., k]
+    threshold = topk_vals[..., -1:]  # [..., 1]
+    return logits.masked_fill(logits < threshold, float("-inf"))
+def _compute_remask_prob(
+    strategy: str,
+    eta: float,
+    sigma_max: float,
+    confidence: Tensor | None,
+) -> Tensor | float:
+    """Compute per-token remasking probability.
+    Args:
+        strategy: One of ``"rescale"``, ``"cap"``, ``"conf"``.
+        eta: Base remasking strength hyperparameter.
+        sigma_max: ``1 - alpha_t(ratio)`` at current step.
+        confidence: Per-token confidence scores. Shape ``[B, L]``.
+            Required only for the ``"conf"`` strategy.
+    Returns:
+        Scalar or ``[B, L]`` tensor of remasking probabilities.
+    """
+    if strategy == "rescale":
+        return eta * sigma_max
+    if strategy == "cap":
+        return min(eta, sigma_max)
+    if strategy == "conf":
+        assert confidence is not None, "conf strategy requires confidence"
+        return eta * sigma_max * (1.0 - confidence)
+    raise ValueError(f"Unknown remask strategy: {strategy}")
+@torch.no_grad()
+def remdm_sample(
+    model: torch.nn.Module,
+    local_obs: Tensor,
+    global_obs: Tensor,
+    cfg: SimpleNamespace,
+    device: torch.device | str,
+    physics_aware: bool = True,
+    blind_global: bool = False,
+    return_analytics: bool = False,
+    num_steps: int | None = None,
+) -> Tensor | tuple[Tensor, list, list[float], list[int]]:
+    """Generate action sequences via iterative ReMDM denoising.
+    Args:
+        model: Denoising model with forward signature
+            ``(local_obs, global_obs, action_seq, t_discrete) -> dict``.
+        local_obs: Local crop observations. Shape ``[B, 9, 9]``.
+        global_obs: Global map observations. Shape ``[B, 21, 79]``.
+        cfg: Config namespace with ``seq_len``, ``mask_token``,
+            ``action_dim``, ``diffusion_steps_eval``, ``temperature``,
+            ``top_k``, ``eta``, ``remask_strategy``, ``noise_schedule``.
+        device: Torch device.
+        physics_aware: If ``True``, soft-penalise hazardous cardinal actions
+            by overriding their confidence to ``0.001`` before commitment
+            ranking. Only checks the first ``_N_PHYSICS_CHECK`` positions.
+        blind_global: If ``True``, zero out the global map observation
+            (local-only ablation).
+        return_analytics: If ``True``, also return per-step analytics as
+            ``(seq, path_per_step, tracking_confidence, tracking_masked)``.
+        num_steps: Override number of denoising steps (default uses
+            ``cfg.diffusion_steps_eval``).
+    Returns:
+        When ``return_analytics=False`` (default): fully committed action
+        sequence of shape ``[B, seq_len]``, int64, with no MASK tokens.
+        When ``return_analytics=True``: tuple
+        ``(seq, path_per_step, tracking_confidence, tracking_masked_count)``
+        where ``path_per_step`` is a list of ``[seq_len]`` numpy arrays,
+        ``tracking_confidence`` a list of per-step avg unmasked confidence
+        floats, and ``tracking_masked_count`` a list of masked-token counts.
+    """
+    B = local_obs.shape[0]
+    seq_len = cfg.seq_len
+    mask_token = cfg.mask_token
+    action_dim = cfg.action_dim
+    K = num_steps if num_steps is not None else cfg.diffusion_steps_eval
+    schedule_fn = get_schedule(cfg.noise_schedule)
+    min_keep = max(1, int(seq_len * 0.10))  # Safety Net: always unmask ≥10%
+    local_obs = local_obs.to(device)
+    global_obs = global_obs.to(device)
+    if blind_global:
+        global_obs = torch.zeros_like(global_obs)
+    # Pre-compute numpy local crops for physics checks (CPU, batch loop)
+    local_np: np.ndarray | None = None  # [B, crop, crop]
+    if physics_aware:
+        local_np = local_obs.cpu().numpy()
+    # Analytics buffers (only populated when return_analytics=True)
+    path_per_step: list[np.ndarray] = []
+    tracking_confidence: list[float] = []
+    tracking_masked_count: list[int] = []
+    # Start fully masked
+    seq = torch.full(
+        (B, seq_len), mask_token, dtype=torch.long, device=device
+    )
+    for k in range(1, K + 1):
+        ratio = k / K
+        # Pass as tensor (not Python int) to avoid torch.compile recompilation
+        t_discrete = torch.full(
+            (B,), int(cfg.num_diffusion_steps * (1.0 - ratio)),
+            dtype=torch.long, device=device,
+        )
+        # Forward pass
+        out = model(local_obs, global_obs, seq, t_discrete)
+        logits = out["actions"]  # [B, seq_len, vocab]
+        # Mask invalid action tokens (indices >= action_dim)
+        logits[:, :, action_dim:] = float("-inf")
+        # Temperature scaling
+        logits = logits / cfg.temperature
+        # Top-K filtering
+        logits = top_k_filter(logits, cfg.top_k)
+        # Sample predictions
+        probs = F.softmax(logits, dim=-1)  # [B, seq_len, action_dim]
+        preds = Categorical(probs=probs).sample()  # [B, seq_len]
+        # Confidence: probability of the sampled token
+        conf = probs.gather(
+            -1, preds.unsqueeze(-1)
+        ).squeeze(-1)  # [B, seq_len]
+        # Physics softener: demote hazardous cardinal actions to conf=0.001
+        if physics_aware and local_np is not None:
+            preds_np = preds.cpu().numpy()  # [B, seq_len]
+            conf_override = conf.clone()
+            for b in range(B):
+                crop_b = np.asarray(local_np[b])  # [crop, crop]
+                for pos in range(min(_N_PHYSICS_CHECK, seq_len)):
+                    action = int(preds_np[b, pos])
+                    if _check_hazard(crop_b, action):
+                        conf_override[b, pos] = 0.001
+            conf = conf_override
+        is_masked = seq == mask_token  # [B, seq_len]
+        if k < K:
+            # MaskGIT progressive unmasking with min-keep guarantee
+            n_unmask = max(min_keep, max(1, int(seq_len * ratio)))
+            # Set confidence of non-masked positions to -1 so they
+            # are not selected for unmasking
+            unmask_scores = conf.clone()
+            unmask_scores[~is_masked] = -1.0
+            # For each batch element, unmask top-confidence masked positions
+            _, topk_indices = unmask_scores.topk(
+                n_unmask, dim=-1
+            )  # [B, n_unmask]
+            # Build scatter mask for positions to unmask
+            unmask_mask = torch.zeros_like(seq, dtype=torch.bool)
+            unmask_mask.scatter_(1, topk_indices, True)
+            unmask_mask = unmask_mask & is_masked  # only unmask masked pos
+            seq = torch.where(unmask_mask, preds, seq)
+            # ReMDM stochastic remasking of committed (non-masked) positions
+            is_committed = seq != mask_token  # [B, seq_len]
+            alpha_t_ratio = schedule_fn(
+                torch.tensor(ratio, device=device)
+            )
+            sigma_max = (1.0 - alpha_t_ratio).item()
+            remask_prob = _compute_remask_prob(
+                cfg.remask_strategy, cfg.eta, sigma_max, conf
+            )
+            if isinstance(remask_prob, Tensor):
+                do_remask = (
+                    torch.rand_like(conf) < remask_prob
+                ) & is_committed
+            else:
+                do_remask = (
+                    torch.rand(B, seq_len, device=device) < remask_prob
+                ) & is_committed
+            seq = torch.where(do_remask, mask_token, seq)
+        else:
+            # Final step: commit all remaining MASK tokens
+            seq = torch.where(is_masked, preds, seq)
+        # Analytics tracking
+        if return_analytics:
+            path_per_step.append(seq[0].cpu().numpy().copy())
+            still_masked = (seq[0] == mask_token)
+            unmasked_conf = conf[0][~still_masked]
+            avg_conf = (
+                unmasked_conf.mean().item()
+                if unmasked_conf.numel() > 0 else 0.0
+            )
+            tracking_confidence.append(avg_conf)
+            tracking_masked_count.append(int(still_masked.sum().item()))
+    assert (seq != mask_token).all(), (
+        "remdm_sample produced MASK tokens in final output"
+    )
+    if return_analytics:
+        return seq, path_per_step, tracking_confidence, tracking_masked_count
+    return seq
+@torch.no_grad()
+def greedy_sample(
+    model: torch.nn.Module,
+    local_obs: Tensor,
+    global_obs: Tensor,
+    cfg: SimpleNamespace,
+    device: torch.device | str,
+    blind_global: bool = False,
+    num_steps: int | None = None,
+) -> Tensor:
+    """Greedy (argmax) MaskGIT sampling — no temperature, top-K, or remasking.
+    Used by ``DataCollector`` during DAgger for deterministic rollouts,
+    matching the reference ``run_model_episode`` behaviour.
+    Args:
+        model: Denoising model.
+        local_obs: Shape ``[B, 9, 9]``.
+        global_obs: Shape ``[B, 21, 79]``.
+        cfg: Config namespace.
+        device: Torch device.
+        blind_global: Zero out global map (local-only ablation).
+    Returns:
+        Fully committed action sequence ``[B, seq_len]``, int64.
+    """
+    B = local_obs.shape[0]
+    seq_len = cfg.seq_len
+    mask_token = cfg.mask_token
+    action_dim = cfg.action_dim
+    K = num_steps if num_steps is not None else cfg.diffusion_steps_eval
+    local_obs = local_obs.to(device)
+    global_obs = global_obs.to(device)
+    if blind_global:
+        global_obs = torch.zeros_like(global_obs)
+    seq = torch.full(
+        (B, seq_len), mask_token, dtype=torch.long, device=device,
+    )
+    for k in range(1, K + 1):
+        ratio = k / K
+        t_discrete = torch.full(
+            (B,), int(cfg.num_diffusion_steps * (1.0 - ratio)),
+            dtype=torch.long, device=device,
+        )
+        out = model(local_obs, global_obs, seq, t_discrete)
+        logits = out["actions"]  # [B, seq_len, vocab]
+        # Mask invalid action tokens
+        logits[:, :, action_dim:] = float("-inf")
+        # Greedy: argmax over softmax (no temperature, no top-K)
+        probs = F.softmax(logits, dim=-1)  # [B, seq_len, action_dim]
+        confidences, preds = probs.max(dim=-1)  # [B, seq_len] each
+        # MaskGIT progressive unmasking by confidence
+        num_to_unmask = max(1, int(seq_len * ratio))
+        is_masked = seq == mask_token  # [B, seq_len]
+        # Score only masked positions for unmasking
+        scores = confidences.clone()
+        scores[~is_masked] = -1.0
+        _, topk_idx = scores.topk(num_to_unmask, dim=-1)
+        unmask_mask = torch.zeros_like(seq, dtype=torch.bool)
+        unmask_mask.scatter_(1, topk_idx, True)
+        unmask_mask = unmask_mask & is_masked
+        seq = torch.where(unmask_mask, preds, seq)
+        # No remasking in greedy mode
+    # Force-commit any remaining masked tokens
+    still_masked = seq == mask_token
+    if still_masked.any():
+        t_zero = torch.zeros(B, dtype=torch.long, device=device)
+        out = model(local_obs, global_obs, seq, t_zero)
+        logits = out["actions"]
+        logits[:, :, action_dim:] = float("-inf")
+        preds = logits.argmax(dim=-1)
+        seq = torch.where(still_masked, preds, seq)
+    return seq
+def select_action(
+    model: torch.nn.Module,
+    local_obs: Tensor,
+    global_obs: Tensor,
+    cfg: SimpleNamespace,
+    device: torch.device | str,
+    physics_aware: bool = True,
+    blind_global: bool = False,
+) -> int:
+    """Sample a single action from a length-1 batch.
+    Args:
+        model: Denoising model.
+        local_obs: Shape ``[9, 9]`` or ``[1, 9, 9]``.
+        global_obs: Shape ``[21, 79]`` or ``[1, 21, 79]``.
+        cfg: Config namespace.
+        device: Torch device.
+        physics_aware: Forward to ``remdm_sample``.
+        blind_global: Forward to ``remdm_sample``.
+    Returns:
+        The first action of the generated plan (int).
+    """
+    if local_obs.ndim == 2:
+        local_obs = local_obs.unsqueeze(0)
+    if global_obs.ndim == 2:
+        global_obs = global_obs.unsqueeze(0)
+    seq = remdm_sample(
+        model, local_obs, global_obs, cfg, device,
+        physics_aware=physics_aware, blind_global=blind_global,
+    )
+    return seq[0, 0].item()

src/diffusion/schedules.py ADDED Viewed

	@@ -0,0 +1,88 @@

+"""Noise schedule functions for MDLM diffusion.
+Ported from the Craftax JAX implementation (src/diffusion/schedules.py).
+All functions operate on PyTorch tensors and are pure (no global state).
+Convention: alpha(t) is the fraction of tokens that remain *unmasked*.
+  - alpha(0) = 1.0  (fully clean)
+  - alpha(1) = 0.0  (fully masked)
+"""
+from __future__ import annotations
+import math
+from typing import Callable
+import torch
+from torch import Tensor
+def linear_schedule(t: Tensor) -> Tensor:
+    """Linear noise schedule: alpha(t) = 1 - t.
+    Args:
+        t: Diffusion time in [0, 1]. Any shape.
+    Returns:
+        Retention probability alpha_t, same shape as *t*.
+    """
+    return 1.0 - t
+def cosine_schedule(t: Tensor) -> Tensor:
+    """Cosine noise schedule: alpha(t) = cos(pi/2 * t)^2.
+    Args:
+        t: Diffusion time in [0, 1]. Any shape.
+    Returns:
+        Retention probability alpha_t, same shape as *t*.
+    """
+    return torch.cos(t * (math.pi / 2.0)) ** 2
+_SCHEDULE_MAP: dict[str, Callable[[Tensor], Tensor]] = {
+    "linear": linear_schedule,
+    "cosine": cosine_schedule,
+}
+def get_schedule(name: str) -> Callable[[Tensor], Tensor]:
+    """Look up a noise schedule by name.
+    Args:
+        name: One of ``"linear"`` or ``"cosine"``.
+    Returns:
+        The schedule function ``alpha(t)``.
+    Raises:
+        KeyError: If *name* is not registered.
+    """
+    if name not in _SCHEDULE_MAP:
+        raise KeyError(
+            f"Unknown schedule '{name}'. "
+            f"Available: {list(_SCHEDULE_MAP.keys())}"
+        )
+    return _SCHEDULE_MAP[name]
+def alpha_prime(
+    t: Tensor,
+    schedule_fn: Callable[[Tensor], Tensor],
+    eps: float = 1e-5,
+) -> Tensor:
+    """Numerical derivative d(alpha)/dt via central difference.
+    Args:
+        t: Diffusion time in [0, 1]. Any shape.
+        schedule_fn: Noise schedule returning alpha(t).
+        eps: Half-width for finite-difference stencil.
+    Returns:
+        Approximate derivative, same shape as *t*.
+    """
+    t_clamped = t.clamp(eps, 1.0 - eps)
+    return (schedule_fn(t_clamped + eps) - schedule_fn(t_clamped - eps)) / (
+        2.0 * eps
+    )

src/envs/__init__.py ADDED Viewed

File without changes

src/envs/discovery.py ADDED Viewed

	@@ -0,0 +1,166 @@

+"""MiniHack environment discovery and diagnostic utilities.
+Provides tools for scanning the gymnasium registry, validating action-space
+consistency across environments, and benchmarking inference throughput.
+"""
+from __future__ import annotations
+import logging
+import time
+from types import SimpleNamespace
+import torch
+logger = logging.getLogger(__name__)
+_NAV_KEYWORDS = ("Room", "Corridor", "Maze", "River")
+_EXCLUDED_KEYWORDS = ("KeyRoom",)
+_REFERENCE_ENV_ID = "MiniHack-Room-15x15-v0"
+def list_working_minihack_tasks() -> list[str]:
+    """Scan the gymnasium registry for working MiniHack navigation tasks.
+    Filters to environments whose names contain at least one navigation
+    keyword and attempts to instantiate each. Returns the IDs of all
+    successfully created environments.
+    Returns:
+        Sorted list of working MiniHack navigation environment IDs.
+    """
+    import gymnasium as gym
+    import minihack  # noqa: F401 — registers envs
+    all_ids = list(gym.envs.registry.keys())
+    candidates = [
+        e for e in all_ids
+        if "MiniHack" in e
+        and any(k in e for k in _NAV_KEYWORDS)
+        and not any(x in e for x in _EXCLUDED_KEYWORDS)
+    ]
+    working: list[str] = []
+    broken: list[str] = []
+    for env_id in sorted(candidates):
+        try:
+            env = gym.make(env_id)
+            working.append(env_id)
+            env.close()
+        except Exception:
+            broken.append(env_id)
+    logger.info(
+        f"MiniHack navigation tasks — working: {len(working)}, "
+        f"broken: {len(broken)}"
+    )
+    return working
+def check_action_consistency_with_fixed_ref(
+    env_list: list[str],
+) -> list[tuple[str, str, int]]:
+    """Validate action-space ordering against a fixed reference environment.
+    Compares each environment's action list against
+    ``MiniHack-Room-15x15-v0`` and classifies the relationship as one of:
+    ``REFERENCE``, ``EXACT``, ``SUPERSET (+N)``, ``SUBSET (-N)``,
+    ``CONFLICT``, or ``CRASHED``.
+    Args:
+        env_list: MiniHack environment IDs to check.
+    Returns:
+        List of ``(env_id, status, action_space_size)`` tuples.
+    """
+    import gymnasium as gym
+    import minihack  # noqa: F401
+    ref_env = gym.make(_REFERENCE_ENV_ID)
+    reference_actions = ref_env.unwrapped.actions  # type: ignore[attr-defined]
+    ref_env.close()
+    results: list[tuple[str, str, int]] = []
+    for env_id in sorted(env_list):
+        if env_id == _REFERENCE_ENV_ID:
+            results.append((env_id, "REFERENCE", len(reference_actions)))
+            continue
+        try:
+            env = gym.make(env_id)
+            try:
+                env_actions = env.unwrapped.actions  # type: ignore[attr-defined]
+                limit = min(len(reference_actions), len(env_actions))
+                is_match = all(
+                    reference_actions[i] == env_actions[i]
+                    for i in range(limit)
+                )
+                diff = len(env_actions) - len(reference_actions)
+                if is_match and diff == 0:
+                    status = "EXACT"
+                elif diff > 0:
+                    status = f"SUPERSET (+{diff})"
+                elif is_match:
+                    status = f"SUBSET ({diff})"
+                else:
+                    status = "CONFLICT"
+                results.append((env_id, status, len(env_actions)))
+            finally:
+                env.close()
+        except Exception:
+            results.append((env_id, "CRASHED", 0))
+    for name, status, size in results:
+        logger.info(f"  {name:<40} | {status:<14} | n_actions={size}")
+    return results
+def benchmark_inference(
+    model: torch.nn.Module,
+    cfg: SimpleNamespace,
+    device: torch.device | str,
+    n_actions: int = 100,
+) -> tuple[float, float]:
+    """Measure ReMDM inference throughput.
+    Runs ``n_actions`` planning calls with dummy observations and
+    measures wall-clock time.
+    Args:
+        model: Denoising model in eval mode.
+        cfg: Config namespace (used for ``seq_len``, ``mask_token``, etc.).
+        device: Torch device.
+        n_actions: Number of planning calls to benchmark.
+    Returns:
+        ``(diffusion_steps_per_sec, actions_per_sec)`` as floats.
+    """
+    from src.diffusion.sampling import remdm_sample
+    model.eval()
+    local_dummy = torch.zeros(
+        (1, cfg.crop_size, cfg.crop_size), dtype=torch.long, device=device,
+    )
+    global_dummy = torch.zeros(
+        (1, cfg.map_h, cfg.map_w), dtype=torch.long, device=device,
+    )
+    if torch.cuda.is_available():
+        torch.cuda.synchronize()
+    t0 = time.perf_counter()
+    for _ in range(n_actions):
+        remdm_sample(model, local_dummy, global_dummy, cfg, device)
+    if torch.cuda.is_available():
+        torch.cuda.synchronize()
+    elapsed = time.perf_counter() - t0
+    total_steps = n_actions * cfg.diffusion_steps_eval
+    steps_per_sec = total_steps / elapsed if elapsed > 0 else 0.0
+    actions_per_sec = n_actions / elapsed if elapsed > 0 else 0.0
+    logger.info(
+        f"Benchmark ({n_actions} actions): "
+        f"{steps_per_sec:.1f} diffusion-steps/s | "
+        f"{actions_per_sec:.1f} actions/s"
+    )
+    return steps_per_sec, actions_per_sec

src/envs/minihack_env.py ADDED Viewed

	@@ -0,0 +1,454 @@

+"""MiniHack environment wrapper with BFS oracle and shaped rewards.
+Ported from minihack_reference/src/env.py. Provides dual-stream
+observations (9x9 local crop + 21x79 global map), a multi-tier BFS
+oracle, and reward shaping (win bonus, BFS progress, exploration, step
+penalty).
+"""
+from __future__ import annotations
+import collections
+import logging
+from types import SimpleNamespace
+import gymnasium as gym
+import minihack  # noqa: F401 — registers MiniHack envs
+import numpy as np
+logger = logging.getLogger(__name__)
+# Suppress noisy NLE INFO spam ("Not saving any NLE data." on every env create)
+logging.getLogger("nle.env.base").setLevel(logging.WARNING)
+# ── Staircase detection ──────────────────────────────────────────────
+def find_staircase_from_glyphs(global_obs: np.ndarray) -> np.ndarray:
+    """Locate the staircase '>' in the global glyph map.
+    Args:
+        global_obs: Glyph map, shape ``[B, H, W]`` or ``[H, W]``.
+    Returns:
+        Normalised ``(row/H, col/W)`` coords, shape ``[B, 2]``
+        (float32). ``(-1, -1)`` when not visible.
+    """
+    squeeze = global_obs.ndim == 2
+    if squeeze:
+        global_obs = global_obs[np.newaxis]
+    B, H, W = global_obs.shape
+    coords = np.full((B, 2), -1.0, dtype=np.float32)
+    for b in range(B):
+        is_stair = (
+            (global_obs[b] == 62)
+            | (global_obs[b] == 2310)
+            | (global_obs[b] == 2368)
+            | (global_obs[b] == 2383)
+        )
+        positions = np.argwhere(is_stair)
+        if positions.shape[0] > 0:
+            coords[b, 0] = positions[0, 0] / max(1, H - 1)
+            coords[b, 1] = positions[0, 1] / max(1, W - 1)
+    return coords
+# ── Environment wrapper ──────────────────────────────────────────────
+class AdvancedObservationEnv(gym.Env):
+    """MiniHack wrapper with dual-stream obs, BFS oracle, shaped rewards.
+    Observations are ``(local_crop, global_map)`` where
+    ``local_crop`` is a ``[crop_size, crop_size]`` glyph window centred
+    on the agent and ``global_map`` is the full ``[21, 79]`` glyph grid.
+    Args:
+        env_id: MiniHack registry ID.
+        des_file: Optional ``.des`` file content (for custom levels).
+        cfg: Configuration namespace with ``crop_size``, ``action_dim``,
+            ``pad_token``, ``map_h``, ``map_w``.
+    """
+    _UNWALKABLE = frozenset({32, 45, 124, 125})  # space, -, |, }
+    _CLOSED_DOOR = 43  # '+'
+    _DIR_MAP = {(-1, 0): 0, (0, 1): 1, (1, 0): 2, (0, -1): 3}
+    _CARDINAL = [(-1, 0), (0, 1), (1, 0), (0, -1)]
+    def __init__(
+        self,
+        env_id: str,
+        des_file: str | None,
+        cfg: SimpleNamespace,
+    ) -> None:
+        super().__init__()
+        self.env_id = env_id
+        self._cfg = cfg
+        self._crop_half = cfg.crop_size // 2
+        obs_keys = ("glyphs", "chars", "pixel")
+        if des_file is not None:
+            self._inner = gym.make(
+                "MiniHack-Navigation-Custom-v0",
+                des_file=des_file,
+                observation_keys=obs_keys,
+            )
+        else:
+            self._inner = gym.make(
+                env_id, observation_keys=obs_keys,
+            )
+        self.observation_space = gym.spaces.Box(
+            low=0, high=6000,
+            shape=(cfg.crop_size, cfg.crop_size),
+            dtype=np.int16,
+        )
+        self.action_space: gym.spaces.Discrete = gym.spaces.Discrete(cfg.action_dim)
+        self._visited: set[tuple[int, int]] = set()
+        self._prev_bfs_dist: int | None = None
+        self.last_raw_obs: dict | None = None
+    # ── gym.Env interface ────────────────────────────────────────────
+    def reset(
+        self, seed: int | None = None, options: dict | None = None,
+    ) -> tuple[tuple[np.ndarray, np.ndarray], dict]:
+        """Reset environment and tracking state.
+        Args:
+            seed: Optional RNG seed.
+            options: Passed through to the inner env.
+        Returns:
+            ``((local_crop, global_map), info)``
+        """
+        obs, info = self._inner.reset(seed=seed, options=options)
+        self.last_raw_obs = obs
+        self._prev_bfs_dist = self._get_bfs_distance(obs)
+        self._visited = set()
+        agent_pos = self._get_agent_pos(obs)
+        if agent_pos is not None:
+            self._visited.add(agent_pos)
+        return self._get_obs(obs), info
+    def step(
+        self, action: int,
+    ) -> tuple[tuple[np.ndarray, np.ndarray], float, bool, bool, dict]:
+        """Execute one environment step with shaped reward.
+        Reward shaping:
+        - Win bonus: ``+20.0``
+        - BFS progress toward staircase: ``+0.5 * (prev - curr)``
+        - New-tile exploration: ``+0.05``
+        - Step penalty: ``-0.01``
+        Args:
+            action: Integer action in ``[0, action_dim)``.
+        Returns:
+            ``(obs, shaped_reward, terminated, truncated, info)``
+        """
+        inner_n = self._inner.action_space.n
+        if action >= inner_n:
+            action = action % inner_n
+        obs, raw_reward, terminated, truncated, info = self._inner.step(action)
+        self.last_raw_obs = obs
+        reward = float(raw_reward)
+        # Win bonus
+        if terminated and reward > 0:
+            info["won"] = True
+            reward += 20.0
+        else:
+            info["won"] = False
+        # BFS shaping
+        curr_dist = self._get_bfs_distance(obs)
+        if curr_dist is not None and self._prev_bfs_dist is not None:
+            reward += (self._prev_bfs_dist - curr_dist) * 0.5
+            self._prev_bfs_dist = curr_dist
+        # Exploration bonus
+        agent_pos = self._get_agent_pos(obs)
+        if agent_pos is not None and agent_pos not in self._visited:
+            reward += 0.05
+            self._visited.add(agent_pos)
+        # Step penalty
+        reward -= 0.01
+        return self._get_obs(obs), reward, terminated, truncated, info
+    @property
+    def unwrapped(self):
+        """Access the inner MiniHack env."""
+        return self._inner.unwrapped
+    def close(self) -> None:
+        """Close the inner environment."""
+        self._inner.close()
+    # ── Observation helpers ──────────────────────────────────────────
+    def _get_obs(
+        self, obs: dict,
+    ) -> tuple[np.ndarray, np.ndarray]:
+        """Extract dual-stream observation.
+        Args:
+            obs: Raw NLE observation dict.
+        Returns:
+            ``(local_crop [crop,crop], global_map [H,W])`` as int16.
+        """
+        return self._get_crop(obs), obs["glyphs"].copy().astype(np.int16)
+    def _get_crop(self, obs: dict) -> np.ndarray:
+        """Crop local glyph window centred on agent.
+        Args:
+            obs: Raw NLE observation dict.
+        Returns:
+            ``[crop_size, crop_size]`` int16 array.
+        """
+        glyphs = obs["glyphs"]
+        chars = obs["chars"]
+        agent_pos = np.argwhere(chars == ord("@"))
+        cs = self._cfg.crop_size
+        if len(agent_pos) == 0:
+            return np.full((cs, cs), self._cfg.pad_token, dtype=np.int16)
+        y, x = agent_pos[0]
+        h = self._crop_half
+        padded = np.pad(
+            glyphs, h, mode="constant",
+            constant_values=self._cfg.pad_token,
+        )
+        return padded[y:y + cs, x:x + cs].astype(np.int16)
+    def _get_agent_pos(self, obs: dict) -> tuple[int, int] | None:
+        """Find agent '@' position in the chars grid.
+        Args:
+            obs: Raw NLE observation dict.
+        Returns:
+            ``(row, col)`` or ``None``.
+        """
+        chars = obs["chars"]
+        pos = np.argwhere(chars == ord("@"))
+        return tuple(pos[0]) if len(pos) > 0 else None
+    def _get_bfs_distance(self, obs: dict) -> int | None:
+        """BFS shortest-path distance from agent to staircase.
+        Args:
+            obs: Raw NLE observation dict.
+        Returns:
+            Integer distance or ``None`` if unreachable / not visible.
+        """
+        chars = obs["chars"]
+        start = np.argwhere(chars == ord("@"))
+        target = np.argwhere(chars == ord(">"))
+        if len(start) == 0 or len(target) == 0:
+            return None
+        start = tuple(start[0])
+        target = tuple(target[0])
+        if start == target:
+            return 0
+        queue: collections.deque = collections.deque([(start, 0)])
+        visited = {start}
+        while queue:
+            (r, c), dist = queue.popleft()
+            if (r, c) == target:
+                return dist
+            for dr, dc in self._CARDINAL:
+                nr, nc = r + dr, c + dc
+                if (
+                    0 <= nr < self._cfg.map_h
+                    and 0 <= nc < self._cfg.map_w
+                    and (nr, nc) not in visited
+                    and chars[nr, nc] not in self._UNWALKABLE
+                ):
+                    visited.add((nr, nc))
+                    queue.append(((nr, nc), dist + 1))
+        return None
+    # ── BFS Oracle ───────────────────────────────────────────────────
+    def get_oracle_action(self, obs: dict) -> int:
+        """5-tier BFS oracle action.
+        Priority:
+        1. Kick adjacent closed door.
+        2. BFS to staircase '>'.
+        3. BFS to frontier (adjacent to unexplored space).
+        4. BFS to farthest reachable tile.
+        5. Random cardinal direction.
+        Args:
+            obs: Raw NLE observation dict (needs ``'chars'`` key).
+        Returns:
+            Action index in ``[0, action_dim)``.
+        """
+        if obs is None:
+            return 0
+        chars = obs["chars"]
+        start = np.argwhere(chars == ord("@"))
+        if len(start) == 0:
+            return np.random.randint(0, 4)
+        start = tuple(start[0])
+        target_list = np.argwhere(chars == ord(">"))
+        # 1. Adjacent closed door → kick
+        for dr, dc in self._CARDINAL:
+            nr, nc = start[0] + dr, start[1] + dc
+            if (
+                0 <= nr < self._cfg.map_h
+                and 0 <= nc < self._cfg.map_w
+                and chars[nr, nc] == self._CLOSED_DOOR
+            ):
+                return 11  # KICK
+        # BFS to gather reachable tiles + check staircase
+        queue: collections.deque = collections.deque([(start, [])])
+        visited = {start}
+        reachable: list[tuple[tuple[int, int], list[tuple[int, int]]]] = []
+        target_path: list[tuple[int, int]] | None = None
+        while queue:
+            (r, c), path = queue.popleft()
+            reachable.append(((r, c), path))
+            for t_r, t_c in target_list:
+                if r == t_r and c == t_c:
+                    target_path = path
+                    break
+            if target_path is not None:
+                break
+            for dr, dc in self._CARDINAL:
+                nr, nc = r + dr, c + dc
+                if (
+                    0 <= nr < self._cfg.map_h
+                    and 0 <= nc < self._cfg.map_w
+                    and (nr, nc) not in visited
+                ):
+                    ch = chars[nr, nc]
+                    if ch not in self._UNWALKABLE and ch != self._CLOSED_DOOR:
+                        visited.add((nr, nc))
+                        queue.append(((nr, nc), path + [(dr, dc)]))
+        # 2. Path to staircase
+        if target_path:
+            return self._DIR_MAP.get(target_path[0], 0)
+        # 3. Frontier exploration — tiles adjacent to unexplored space
+        frontier: list[list[tuple[int, int]]] = []
+        for (r, c), path in reachable:
+            if not path:
+                continue
+            for dr, dc in self._CARDINAL:
+                nr, nc = r + dr, c + dc
+                if (
+                    0 <= nr < self._cfg.map_h
+                    and 0 <= nc < self._cfg.map_w
+                    and chars[nr, nc] == 32
+                ):
+                    frontier.append(path)
+                    break
+        if frontier:
+            frontier.sort(key=len)
+            return self._DIR_MAP.get(frontier[0][0], 0)
+        # 4. Farthest reachable tile
+        if reachable:
+            reachable.sort(key=lambda x: len(x[1]), reverse=True)
+            farthest = reachable[0][1]
+            if farthest:
+                return self._DIR_MAP.get(farthest[0], 0)
+        # 5. Random cardinal
+        return np.random.randint(0, 4)
+# ── Factory ──────────────────────────────────────────────────────────
+def make_env(
+    env_id: str,
+    des_file: str | None,
+    cfg: SimpleNamespace,
+) -> AdvancedObservationEnv:
+    """Create a wrapped MiniHack environment.
+    Args:
+        env_id: MiniHack registry ID.
+        des_file: Optional ``.des`` file content.
+        cfg: Configuration namespace.
+    Returns:
+        Wrapped environment.
+    """
+    return AdvancedObservationEnv(env_id, des_file, cfg)
+def collect_oracle_trajectory(
+    env_id: str,
+    seed: int,
+    cfg: SimpleNamespace,
+    max_steps: int = 500,
+) -> dict | None:
+    """Roll out the BFS oracle on a single episode.
+    Args:
+        env_id: MiniHack registry ID.
+        seed: RNG seed for the episode.
+        cfg: Configuration namespace.
+        max_steps: Maximum episode length.
+    Returns:
+        ``{"local": [T,9,9], "global": [T,21,79],
+          "actions": [T], "env_id": str}`` on success,
+        or ``None`` on failure.
+    """
+    env = make_env(env_id, None, cfg)
+    try:
+        (local, glb), _info = env.reset(seed=seed)
+        locals_list = [local]
+        globals_list = [glb]
+        actions_list: list[int] = []
+        for _ in range(max_steps):
+            action = env.get_oracle_action(env.last_raw_obs)
+            actions_list.append(action)
+            (local, glb), _reward, terminated, truncated, _info = env.step(
+                action
+            )
+            locals_list.append(local)
+            globals_list.append(glb)
+            if terminated or truncated:
+                break
+        # Trim trailing obs (one more obs than actions)
+        locals_arr = np.stack(locals_list[:-1], axis=0).astype(np.int16)
+        globals_arr = np.stack(globals_list[:-1], axis=0).astype(np.int16)
+        actions_arr = np.array(actions_list, dtype=np.int64)
+        return {
+            "local": locals_arr,
+            "global": globals_arr,
+            "actions": actions_arr,
+            "env_id": env_id,
+        }
+    except Exception:
+        logger.error(
+            f"Oracle trajectory failed for {env_id} seed={seed}",
+            exc_info=True,
+        )
+        return None
+    finally:
+        env.close()

src/models/__init__.py ADDED Viewed

File without changes

src/models/denoiser.py ADDED Viewed

	@@ -0,0 +1,415 @@

+"""Dual-stream denoising transformer for MiniHack.
+Ported from minihack_reference/src/model.py. Architecture follows the
+Craftax denoiser conventions (forward return format, obs-encoder pattern)
+while using the MiniHack dual-stream design (local CNN + gated global
+CNN + auxiliary goal head).
+"""
+from __future__ import annotations
+import copy
+import logging
+import shutil
+from types import SimpleNamespace
+import torch
+import torch.nn as nn
+from torch import Tensor
+logger = logging.getLogger(__name__)
+class LocalDiffusionPlannerWithGlobal(nn.Module):
+    """Dual-stream transformer for masked diffusion action planning.
+    Combines a local 9x9 glyph crop with a gated global 21x79 map
+    context. Produces action logits and an auxiliary staircase-coordinate
+    prediction.
+    Architecture:
+        Local stream:  Embedding(6000,64) -> CNN(64->32->64) -> Linear -> 1 token
+        Global stream: Embedding(6000,32) -> CNN(32->32->64) -> Pool(2,4)
+                       -> Linear -> 8 tokens, gated by sigmoid(learnable scalar)
+        Goal head:     mean(global_tokens) -> MLP -> [B,2] (before gate)
+        Action stream: Embedding(14, n_embd) + timestep + position
+        Transformer:   concat all -> TransformerEncoder -> last 64 tokens -> head
+    Args:
+        cfg: Config namespace with ``action_dim``, ``n_embd``, ``n_head``,
+            ``n_layer``, ``n_global_tokens``, ``seq_len``,
+            ``global_gate_init``, ``num_diffusion_steps``.
+    """
+    def __init__(self, cfg: SimpleNamespace) -> None:
+        super().__init__()
+        action_dim = cfg.action_dim
+        n_embd = cfg.n_embd
+        n_head = cfg.n_head
+        n_layer = cfg.n_layer
+        n_global_tokens = cfg.n_global_tokens
+        seq_len = cfg.seq_len
+        assert n_embd % n_head == 0, (
+            f"n_embd ({n_embd}) must be divisible by n_head ({n_head})"
+        )
+        self.n_global_tokens = n_global_tokens
+        # ── Local stream: 9x9 crop -> 1 token ──────────────────────
+        self.embedding = nn.Embedding(6000, 64)
+        self.cnn = nn.Sequential(
+            nn.Conv2d(64, 32, 3, padding=1),
+            nn.GELU(),
+            nn.Conv2d(32, 64, 3, padding=1),
+            nn.GELU(),
+            nn.Flatten(),
+            nn.Linear(64 * 9 * 9, n_embd),
+        )
+        # ── Action stream ──────────────────────────────────────────
+        self.action_emb = nn.Embedding(action_dim + 2, n_embd)
+        self.timestep_emb = nn.Embedding(
+            cfg.num_diffusion_steps, n_embd,
+        )
+        self.pos_emb = nn.Embedding(seq_len, n_embd)
+        # ── Transformer ───────────────────────────────────────────
+        encoder_layer = nn.TransformerEncoderLayer(
+            d_model=n_embd,
+            nhead=n_head,
+            dim_feedforward=n_embd * 4,
+            dropout=getattr(cfg, "dropout", 0.0),
+            activation="gelu",
+            norm_first=True,
+            batch_first=True,
+        )
+        self.transformer = nn.TransformerEncoder(
+            encoder_layer, num_layers=n_layer, enable_nested_tensor=False,
+        )
+        self.head = nn.Linear(n_embd, action_dim)
+        # ── Global stream: 21x79 map -> 8 tokens ──────────────────
+        self.global_embedding = nn.Embedding(6000, 32)
+        self.global_cnn = nn.Sequential(
+            nn.Conv2d(32, 32, 5, stride=2, padding=2),
+            nn.GELU(),
+            nn.Conv2d(32, 64, 3, stride=2, padding=1),
+            nn.GELU(),
+        )
+        self.global_pool = nn.AdaptiveAvgPool2d((2, 4))
+        self.global_proj = nn.Linear(64, n_embd)
+        self.global_gate = nn.Parameter(
+            torch.tensor(cfg.global_gate_init)
+        )
+        # ── Auxiliary goal head (before gate) ──────────────────────
+        self.goal_head = nn.Sequential(
+            nn.Linear(n_embd, 128),
+            nn.GELU(),
+            nn.Linear(128, 2),
+        )
+    def forward(
+        self,
+        local_obs: Tensor,
+        global_obs: Tensor,
+        action_seq: Tensor,
+        t_discrete: int | Tensor,
+    ) -> dict[str, Tensor]:
+        """Forward pass producing action logits and goal prediction.
+        Args:
+            local_obs: Local glyph crop. Shape ``[B, 9, 9]``, int.
+            global_obs: Full glyph map. Shape ``[B, 21, 79]``, int.
+            action_seq: Noisy action sequence. Shape ``[B, seq_len]``, int.
+            t_discrete: Discrete timestep index (scalar int or ``[B]``).
+        Returns:
+            Dict with keys:
+            - ``"actions"``: ``[B, seq_len, action_dim]`` logits.
+            - ``"goal_pred"``: ``[B, 2]`` normalised staircase coords.
+        """
+        B, Seq = action_seq.shape
+        device = local_obs.device
+        # Local stream -> [B, 1, n_embd]
+        x_local = self.embedding(local_obs)  # [B, 9, 9, 64]
+        x_local = x_local.permute(0, 3, 1, 2)  # [B, 64, 9, 9]
+        local_token = self.cnn(x_local).unsqueeze(1)  # [B, 1, n_embd]
+        # Global stream -> [B, 8, n_embd]
+        x_global = self.global_embedding(global_obs)  # [B, 21, 79, 32]
+        x_global = x_global.permute(0, 3, 1, 2)  # [B, 32, 21, 79]
+        gf = self.global_cnn(x_global)  # [B, 64, H', W']
+        gf = self.global_pool(gf)  # [B, 64, 2, 4]
+        global_tokens = gf.permute(0, 2, 3, 1)  # [B, 2, 4, 64]
+        global_tokens = global_tokens.reshape(
+            B, self.n_global_tokens, -1
+        )  # [B, 8, 64]
+        global_tokens = self.global_proj(global_tokens)  # [B, 8, n_embd]
+        # Aux goal head (before gate for direct gradient to CNN)
+        goal_pred = self.goal_head(
+            global_tokens.mean(dim=1)
+        )  # [B, 2]
+        # Apply gate
+        gate = torch.sigmoid(self.global_gate)
+        global_tokens = global_tokens * gate  # [B, 8, n_embd]
+        # Action stream -> [B, seq_len, n_embd]
+        positions = torch.arange(
+            Seq, device=device,
+        ).unsqueeze(0).expand(B, -1)  # [B, seq_len]
+        if isinstance(t_discrete, int):
+            t_tensor = torch.full(
+                (B,), t_discrete, dtype=torch.long, device=device,
+            )
+        else:
+            t_tensor = t_discrete.long().to(device)
+        seq_emb = (
+            self.action_emb(action_seq)
+            + self.timestep_emb(t_tensor).unsqueeze(1)
+            + self.pos_emb(positions)
+        )  # [B, seq_len, n_embd]
+        # Concatenate: [local(1), global(8), actions(seq_len)]
+        x = torch.cat(
+            [local_token, global_tokens, seq_emb], dim=1,
+        )  # [B, 1+8+seq_len, n_embd]
+        # Transformer
+        out = self.transformer(x)  # [B, 1+8+seq_len, n_embd]
+        # Take last seq_len tokens for action predictions
+        n_prefix = 1 + self.n_global_tokens
+        action_logits = self.head(
+            out[:, n_prefix:, :]
+        )  # [B, seq_len, action_dim]
+        return {"actions": action_logits, "goal_pred": goal_pred}
+class LocalDiffusionPlanner(nn.Module):
+    """Local-only ablation model (no global stream, no goal head).
+    Args:
+        cfg: Config namespace.
+    """
+    def __init__(self, cfg: SimpleNamespace) -> None:
+        super().__init__()
+        action_dim = cfg.action_dim
+        n_embd = cfg.n_embd
+        seq_len = cfg.seq_len
+        self.embedding = nn.Embedding(6000, 64)
+        self.cnn = nn.Sequential(
+            nn.Conv2d(64, 32, 3, padding=1),
+            nn.GELU(),
+            nn.Conv2d(32, 64, 3, padding=1),
+            nn.GELU(),
+            nn.Flatten(),
+            nn.Linear(64 * 9 * 9, n_embd),
+        )
+        self.action_emb = nn.Embedding(action_dim + 2, n_embd)
+        self.timestep_emb = nn.Embedding(cfg.num_diffusion_steps, n_embd)
+        self.pos_emb = nn.Embedding(seq_len, n_embd)
+        encoder_layer = nn.TransformerEncoderLayer(
+            d_model=n_embd,
+            nhead=cfg.n_head,
+            dim_feedforward=n_embd * 4,
+            dropout=getattr(cfg, "dropout", 0.0),
+            activation="gelu",
+            norm_first=True,
+            batch_first=True,
+        )
+        self.transformer = nn.TransformerEncoder(
+            encoder_layer, num_layers=cfg.n_layer,
+        )
+        self.head = nn.Linear(n_embd, action_dim)
+    def forward(
+        self,
+        local_obs: Tensor,
+        global_obs: Tensor,
+        action_seq: Tensor,
+        t_discrete: int | Tensor,
+    ) -> dict[str, Tensor]:
+        """Forward pass (ignores global_obs).
+        Args:
+            local_obs: ``[B, 9, 9]`` int.
+            global_obs: ``[B, 21, 79]`` int (ignored).
+            action_seq: ``[B, seq_len]`` int.
+            t_discrete: Timestep index.
+        Returns:
+            Dict with ``"actions"`` key only (no goal_pred).
+        """
+        B, Seq = action_seq.shape
+        device = local_obs.device
+        x_state = self.embedding(local_obs).permute(0, 3, 1, 2)
+        state_emb = self.cnn(x_state).unsqueeze(1)  # [B, 1, n_embd]
+        positions = torch.arange(
+            Seq, device=device,
+        ).unsqueeze(0).expand(B, -1)
+        if isinstance(t_discrete, int):
+            t_tensor = torch.full(
+                (B,), t_discrete, dtype=torch.long, device=device,
+            )
+        else:
+            t_tensor = t_discrete.long().to(device)
+        seq_emb = (
+            self.action_emb(action_seq)
+            + self.timestep_emb(t_tensor).unsqueeze(1)
+            + self.pos_emb(positions)
+        )
+        x = torch.cat([state_emb, seq_emb], dim=1)
+        out = self.transformer(x)
+        return {"actions": self.head(out[:, 1:, :])}
+# ── Factory ───────────────────────────────────���──────────────────────
+def make_model(cfg: SimpleNamespace) -> nn.Module:
+    """Instantiate the default MiniHack denoising model.
+    Args:
+        cfg: Config namespace.
+    Returns:
+        ``LocalDiffusionPlannerWithGlobal`` instance.
+    """
+    return LocalDiffusionPlannerWithGlobal(cfg)
+def _has_c_compiler() -> bool:
+    """Check whether a C compiler is reachable by Triton.
+    Checks the ``CC`` env var (set by conda activation scripts),
+    then falls back to ``cc`` and ``gcc`` on ``PATH``.
+    """
+    import os
+    cc_env = os.environ.get("CC")
+    if cc_env and shutil.which(cc_env):
+        return True
+    return shutil.which("cc") is not None or shutil.which("gcc") is not None
+def try_compile(model: nn.Module, cfg: SimpleNamespace) -> nn.Module:
+    """Wrap *model* with ``torch.compile`` if enabled and a C compiler exists.
+    Falls back to the uncompiled model when ``torch.compile`` is
+    unavailable or Triton cannot find a C compiler (common on managed
+    GPU nodes that lack ``gcc``/``cc``).
+    Args:
+        model: The raw (uncompiled) model.
+        cfg: Config namespace; reads ``torch_compile`` bool.
+    Returns:
+        Compiled model, or *model* unchanged on fallback.
+    """
+    if not getattr(cfg, "torch_compile", False):
+        return model
+    if not hasattr(torch, "compile"):
+        return model
+    if not _has_c_compiler():
+        logger.warning(
+            "torch.compile requested but no C compiler found "
+            "(CC env var, cc, gcc); falling back to eager mode"
+        )
+        return model
+    logger.info("Compiling model with torch.compile")
+    return torch.compile(model, mode="default")  # type: ignore[return-value]
+# ── EMA ──────────────────────────────────────────────────────────────
+class ModelEMA:
+    """Exponential moving average of model parameters.
+    Maintains a shadow copy of parameters updated as
+    ``theta_ema <- decay * theta_ema + (1 - decay) * theta``.
+    Args:
+        model: Source model.
+        decay: EMA decay factor (default 0.999).
+    """
+    def __init__(self, model: nn.Module, decay: float = 0.999) -> None:
+        self._decay = decay
+        self._shadow: dict[str, Tensor] = {}
+        for name, param in model.named_parameters():
+            self._shadow[name] = param.data.clone()
+    @torch.no_grad()
+    def update(self, model: nn.Module) -> None:
+        """Update shadow parameters from *model*.
+        Args:
+            model: Source model whose parameters are blended in.
+        """
+        for name, param in model.named_parameters():
+            self._shadow[name].mul_(self._decay).add_(
+                param.data, alpha=1.0 - self._decay,
+            )
+    def apply_to(self, model: nn.Module) -> None:
+        """Copy shadow parameters into *model* (for inference).
+        Args:
+            model: Target model to overwrite.
+        """
+        for name, param in model.named_parameters():
+            param.data.copy_(self._shadow[name])
+    def state_dict(self) -> dict[str, Tensor]:
+        """Return shadow parameter dict for serialisation.
+        Returns:
+            Dict mapping parameter names to EMA tensors.
+        """
+        return {k: v.clone() for k, v in self._shadow.items()}
+    def load_state_dict(self, sd: dict[str, Tensor]) -> None:
+        """Restore shadow parameters from *sd*.
+        Args:
+            sd: State dict from a prior ``state_dict()`` call.
+        """
+        for k, v in sd.items():
+            if k in self._shadow:
+                self._shadow[k].copy_(v)
+    def parameters(self):
+        """Iterate over shadow parameter tensors.
+        Yields:
+            EMA parameter tensors.
+        """
+        yield from self._shadow.values()
+    def make_eval_model(self, model: nn.Module) -> nn.Module:
+        """Return a deep copy of *model* with EMA weights applied.
+        Args:
+            model: Template model (architecture).
+        Returns:
+            New model with shadow parameters.
+        """
+        eval_model = copy.deepcopy(model)
+        self.apply_to(eval_model)
+        eval_model.eval()
+        return eval_model

src/planners/__init__.py ADDED Viewed

File without changes

src/planners/baselines.py ADDED Viewed

	@@ -0,0 +1,1247 @@

+"""SB3 + Decision Transformer baselines for the ReMDM diffusion planner.
+This module wraps standard discrete-action RL baselines (PPO, A2C, DQN,
+recurrent PPO) plus two imitation baselines (Behavioural Cloning and
+Decision Transformer) into the project's unified config + dispatch
+surface so they can be compared head-to-head against the DAgger /
+offline-BC diffusion planner on the same MiniHack environments.
+Entry point: :func:`run_baselines`.
+Hyperparameters live in ``configs/defaults.yaml`` under the
+``baselines_*`` namespace; the unified env-step training budget
+(``cfg.total_timesteps``) is shared with DAgger and offline BC.
+W&B logging routes through the project's :class:`Logger` (with the W&B
+project temporarily swapped to ``cfg.baselines_wandb_project``); SB3's
+standard ``WandbCallback`` piggybacks on the active run and syncs its
+tensorboard scalars automatically. No file in this module calls
+``wandb.log(...)`` directly.
+"""
+from __future__ import annotations
+import logging
+import os
+import random
+from pathlib import Path
+from types import SimpleNamespace
+from typing import Any
+import gymnasium as gym
+import numpy as np
+import orjson
+import torch
+import torch.nn as nn
+from sb3_contrib import RecurrentPPO
+from stable_baselines3 import A2C, DQN, PPO
+from stable_baselines3.common.callbacks import CallbackList, EvalCallback
+from stable_baselines3.common.monitor import Monitor
+from stable_baselines3.common.policies import ActorCriticPolicy
+from stable_baselines3.common.torch_layers import BaseFeaturesExtractor
+from stable_baselines3.common.vec_env import SubprocVecEnv
+from torch.utils.data import DataLoader, Dataset
+from wandb.integration.sb3 import WandbCallback
+from src.envs.minihack_env import (
+    AdvancedObservationEnv,
+    collect_oracle_trajectory,
+)
+from src.planners.logging import Logger
+logger = logging.getLogger(__name__)
+SB3_RL_ALGOS: tuple[str, ...] = ("ppo", "a2c", "dqn", "ppo-rnn")
+IMITATION_ALGOS: tuple[str, ...] = ("bc", "dt")
+ALL_BASELINE_ALGOS: tuple[str, ...] = SB3_RL_ALGOS + IMITATION_ALGOS
+# =============================================================================
+# Observation wrapper for SB3 dict-policies
+# =============================================================================
+class _SB3MiniHackWrapper(gym.Wrapper):
+    """Reshape ``AdvancedObservationEnv`` tuple obs into an SB3 dict obs.
+    The underlying env returns ``(local_crop, global_map)`` with shapes
+    ``(crop, crop)`` and ``(map_h, map_w)``; SB3's ``MultiInputPolicy``
+    needs a ``Dict`` space with explicit channel dims. Also remaps
+    ``info["won"]`` -> ``info["is_success"]`` so SB3's success tracking
+    reports our win rate.
+    """
+    def __init__(self, env: AdvancedObservationEnv) -> None:
+        super().__init__(env)
+        local_h, local_w = env.observation_space.shape
+        cfg = env._cfg  # AdvancedObservationEnv stores cfg here
+        self.observation_space = gym.spaces.Dict(
+            {
+                "local": gym.spaces.Box(
+                    low=0, high=6000, shape=(1, local_h, local_w), dtype=np.int16,
+                ),
+                "global": gym.spaces.Box(
+                    low=0, high=6000, shape=(1, cfg.map_h, cfg.map_w), dtype=np.int16,
+                ),
+            }
+        )
+    def reset(self, **kwargs: Any) -> tuple[dict[str, np.ndarray], dict]:
+        (local, glob), info = self.env.reset(**kwargs)
+        return self._pack(local, glob), info
+    def step(
+        self, action: int,
+    ) -> tuple[dict[str, np.ndarray], float, bool, bool, dict]:
+        (local, glob), reward, terminated, truncated, info = self.env.step(action)
+        if "won" in info:
+            info["is_success"] = info["won"]
+        return self._pack(local, glob), reward, terminated, truncated, info
+    @staticmethod
+    def _pack(
+        local: np.ndarray, glob: np.ndarray,
+    ) -> dict[str, np.ndarray]:
+        return {
+            "local": np.expand_dims(local, axis=0),  # [1, crop, crop]
+            "global": np.expand_dims(glob, axis=0),  # [1, H, W]
+        }
+# =============================================================================
+# CNN feature extractor (shared by SB3 RL + BC)
+# =============================================================================
+class _MiniHackCNN(BaseFeaturesExtractor):
+    """Dual-stream CNN for the SB3 dict observation.
+    Local stream: ``Conv(1->16, 3) -> Conv(16->32, 3)``.
+    Global stream: ``Conv(1->16, 5, stride 2) -> Conv(16->32, 3, stride 2)``.
+    Both streams are flattened and concatenated, then projected to
+    ``features_dim`` via a single linear + ReLU.
+    """
+    def __init__(
+        self, observation_space: gym.spaces.Dict, features_dim: int = 256,
+    ) -> None:
+        super().__init__(observation_space, features_dim)
+        self.local_cnn = nn.Sequential(
+            nn.Conv2d(1, 16, kernel_size=3, padding=1),
+            nn.ReLU(),
+            nn.Conv2d(16, 32, kernel_size=3, padding=1),
+            nn.ReLU(),
+            nn.Flatten(),
+        )
+        self.global_cnn = nn.Sequential(
+            nn.Conv2d(1, 16, kernel_size=5, stride=2),
+            nn.ReLU(),
+            nn.Conv2d(16, 32, kernel_size=3, stride=2),
+            nn.ReLU(),
+            nn.Flatten(),
+        )
+        with torch.no_grad():
+            dummy_loc = torch.zeros(1, *observation_space["local"].shape)
+            dummy_glob = torch.zeros(1, *observation_space["global"].shape)
+            n_flatten = (
+                self.local_cnn(dummy_loc).shape[1]
+                + self.global_cnn(dummy_glob).shape[1]
+            )
+        self.linear = nn.Sequential(nn.Linear(n_flatten, features_dim), nn.ReLU())
+    def forward(
+        self, observations: dict[str, torch.Tensor],
+    ) -> torch.Tensor:
+        loc = self.local_cnn(observations["local"].float())  # [B, F_l]
+        glob = self.global_cnn(observations["global"].float())  # [B, F_g]
+        return self.linear(torch.cat([loc, glob], dim=1))
+# =============================================================================
+# Decision Transformer
+# =============================================================================
+class _MiniHackStateEncoder(nn.Module):
+    """CNN encoder mapping a (local, global) obs pair to a state embedding."""
+    def __init__(
+        self,
+        embed_dim: int = 128,
+        crop_h: int = 9,
+        crop_w: int = 9,
+        map_h: int = 21,
+        map_w: int = 79,
+    ) -> None:
+        super().__init__()
+        self.local_cnn = nn.Sequential(
+            nn.Conv2d(1, 16, kernel_size=3, padding=1),
+            nn.ReLU(),
+            nn.Conv2d(16, 32, kernel_size=3, padding=1),
+            nn.ReLU(),
+            nn.Flatten(),
+        )
+        self.global_cnn = nn.Sequential(
+            nn.Conv2d(1, 16, kernel_size=5, stride=2),
+            nn.ReLU(),
+            nn.Conv2d(16, 32, kernel_size=3, stride=2),
+            nn.ReLU(),
+            nn.Flatten(),
+        )
+        with torch.no_grad():
+            dummy_loc = torch.zeros(1, 1, crop_h, crop_w)
+            dummy_glob = torch.zeros(1, 1, map_h, map_w)
+            local_flat = self.local_cnn(dummy_loc).shape[1]
+            global_flat = self.global_cnn(dummy_glob).shape[1]
+        self.proj = nn.Linear(local_flat + global_flat, embed_dim)
+    def forward(
+        self, local_obs: torch.Tensor, global_obs: torch.Tensor,
+    ) -> torch.Tensor:
+        # Accepts (B, T, 1, H, W) or (B, 1, H, W).
+        if local_obs.dim() == 5:
+            B, T = local_obs.shape[:2]
+            local_obs = local_obs.view(B * T, *local_obs.shape[2:])
+            global_obs = global_obs.view(B * T, *global_obs.shape[2:])
+            reshape = True
+        else:
+            B, T = local_obs.shape[0], 1
+            reshape = False
+        loc_feat = self.local_cnn(local_obs.float())  # [B*T, F_l]
+        glob_feat = self.global_cnn(global_obs.float())  # [B*T, F_g]
+        out = self.proj(torch.cat([loc_feat, glob_feat], dim=-1))  # [B*T, D]
+        if reshape:
+            out = out.view(B, T, -1)
+        return out
+class _DecisionTransformer(nn.Module):
+    """Causal Decision Transformer over interleaved (R, s, a) tokens."""
+    def __init__(
+        self,
+        n_actions: int,
+        embed_dim: int = 128,
+        n_heads: int = 4,
+        n_layers: int = 3,
+        context_len: int = 30,
+        max_ep_len: int = 500,
+        dropout: float = 0.1,
+        crop_h: int = 9,
+        crop_w: int = 9,
+        map_h: int = 21,
+        map_w: int = 79,
+    ) -> None:
+        super().__init__()
+        self.embed_dim = embed_dim
+        self.context_len = context_len
+        self.n_actions = n_actions
+        self.max_ep_len = max_ep_len
+        self.state_encoder = _MiniHackStateEncoder(
+            embed_dim, crop_h, crop_w, map_h, map_w,
+        )
+        self.action_embed = nn.Embedding(n_actions + 1, embed_dim)  # +1 for pad
+        self.return_embed = nn.Linear(1, embed_dim)
+        self.pos_embed = nn.Embedding(max_ep_len, embed_dim)
+        self.token_type_embed = nn.Embedding(3, embed_dim)
+        self.embed_ln = nn.LayerNorm(embed_dim)
+        self.dropout = nn.Dropout(dropout)
+        encoder_layer = nn.TransformerEncoderLayer(
+            d_model=embed_dim,
+            nhead=n_heads,
+            dim_feedforward=embed_dim * 4,
+            dropout=dropout,
+            activation="gelu",
+            batch_first=True,
+        )
+        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=n_layers)
+        self.action_head = nn.Linear(embed_dim, n_actions)
+        self.apply(self._init_weights)
+    @staticmethod
+    def _init_weights(module: nn.Module) -> None:
+        if isinstance(module, nn.Linear):
+            nn.init.normal_(module.weight, mean=0.0, std=0.02)
+            if module.bias is not None:
+                nn.init.zeros_(module.bias)
+        elif isinstance(module, nn.Embedding):
+            nn.init.normal_(module.weight, mean=0.0, std=0.02)
+        elif isinstance(module, nn.LayerNorm):
+            nn.init.ones_(module.weight)
+            nn.init.zeros_(module.bias)
+    def forward(
+        self,
+        returns_to_go: torch.Tensor,  # [B, T, 1]
+        local_obs: torch.Tensor,      # [B, T, 1, H_l, W_l]
+        global_obs: torch.Tensor,     # [B, T, 1, H_g, W_g]
+        actions: torch.Tensor,        # [B, T]
+        timesteps: torch.Tensor,      # [B, T]
+        attention_mask: torch.Tensor | None = None,  # [B, T]
+    ) -> torch.Tensor:
+        B, T = returns_to_go.shape[:2]
+        device = returns_to_go.device
+        rtg_embed = self.return_embed(returns_to_go)  # [B, T, D]
+        state_embed = self.state_encoder(local_obs, global_obs)  # [B, T, D]
+        action_embed = self.action_embed(actions)  # [B, T, D]
+        pos_embed = self.pos_embed(timesteps)  # [B, T, D]
+        rtg_embed = rtg_embed + pos_embed + self.token_type_embed.weight[0]
+        state_embed = state_embed + pos_embed + self.token_type_embed.weight[1]
+        action_embed = action_embed + pos_embed + self.token_type_embed.weight[2]
+        # Interleave (R_0, s_0, a_0, R_1, s_1, a_1, ...) -> [B, 3T, D]
+        stacked = torch.stack([rtg_embed, state_embed, action_embed], dim=2)
+        stacked = stacked.view(B, 3 * T, self.embed_dim)
+        stacked = self.dropout(self.embed_ln(stacked))
+        seq_len = 3 * T
+        causal_mask = torch.triu(
+            torch.ones(seq_len, seq_len, device=device), diagonal=1,
+        ).bool()
+        key_padding_mask = None
+        if attention_mask is not None:
+            expanded = attention_mask.unsqueeze(-1).repeat(1, 1, 3).view(B, 3 * T)
+            key_padding_mask = expanded == 0
+        hidden = self.transformer(
+            stacked, mask=causal_mask, src_key_padding_mask=key_padding_mask,
+        )
+        # State token positions are 1, 4, 7, ... -> stride 3.
+        state_hidden = hidden[:, 1::3, :]  # [B, T, D]
+        return self.action_head(state_hidden)  # [B, T, A]
+    @torch.no_grad()
+    def get_action(
+        self,
+        returns_to_go: torch.Tensor,
+        local_obs: torch.Tensor,
+        global_obs: torch.Tensor,
+        actions: torch.Tensor,
+        timesteps: torch.Tensor,
+    ) -> torch.Tensor:
+        self.eval()
+        logits = self.forward(
+            returns_to_go, local_obs, global_obs, actions, timesteps,
+        )
+        return logits[:, -1, :].argmax(dim=-1)
+class _DTDataset(Dataset):
+    """Sliding-window dataset over Decision Transformer trajectories."""
+    def __init__(
+        self,
+        trajectories: list[dict[str, np.ndarray]],
+        context_len: int,
+        max_ep_len: int,
+        n_actions: int,
+    ) -> None:
+        self.trajectories = trajectories
+        self.context_len = context_len
+        self.max_ep_len = max_ep_len
+        self.n_actions = n_actions
+        self.indices: list[tuple[int, int]] = [
+            (traj_idx, start)
+            for traj_idx, traj in enumerate(trajectories)
+            for start in range(len(traj["actions"]))
+        ]
+    def __len__(self) -> int:
+        return len(self.indices)
+    def __getitem__(self, idx: int) -> dict[str, torch.Tensor]:
+        traj_idx, start = self.indices[idx]
+        traj = self.trajectories[traj_idx]
+        traj_len = len(traj["actions"])
+        end = min(start + self.context_len, traj_len)
+        actual_len = end - start
+        local = traj["local"][start:end].copy()
+        glob = traj["global"][start:end].copy()
+        actions = traj["actions"][start:end].copy()
+        rtg = traj["returns_to_go"][start:end].copy()
+        timesteps = np.arange(start, end)
+        # Clamp to valid embedding ranges.
+        timesteps = np.clip(timesteps, 0, self.max_ep_len - 1)
+        actions = np.clip(actions, 0, self.n_actions - 1)
+        pad_len = self.context_len - actual_len
+        if pad_len > 0:
+            local = np.pad(
+                local, ((0, pad_len), (0, 0), (0, 0), (0, 0)), mode="constant",
+            )
+            glob = np.pad(
+                glob, ((0, pad_len), (0, 0), (0, 0), (0, 0)), mode="constant",
+            )
+            actions = np.pad(actions, (0, pad_len), mode="constant")
+            rtg = np.pad(rtg, (0, pad_len), mode="constant")
+            timesteps = np.pad(timesteps, (0, pad_len), mode="constant")
+        attention_mask = np.zeros(self.context_len, dtype=np.float32)
+        attention_mask[:actual_len] = 1.0
+        return {
+            "local": torch.tensor(local, dtype=torch.float32),
+            "global": torch.tensor(glob, dtype=torch.float32),
+            "actions": torch.tensor(actions, dtype=torch.long),
+            "returns_to_go": torch.tensor(rtg, dtype=torch.float32).unsqueeze(-1),
+            "timesteps": torch.tensor(timesteps, dtype=torch.long),
+            "attention_mask": torch.tensor(attention_mask, dtype=torch.float32),
+        }
+# =============================================================================
+# SB3 callbacks + env factory
+# =============================================================================
+class _PrefixedEvalCallback(EvalCallback):
+    """``EvalCallback`` that records mean_reward / avg_steps / win_rate
+    under a unique per-environment prefix.
+    SB3 truncates metric names at 36 chars, which collides on long
+    MiniHack env IDs; the prefix lets us strip ``MiniHack-`` / ``-v0``
+    cleanly.
+    """
+    def __init__(
+        self, eval_env: SubprocVecEnv, prefix: str, **kwargs: Any,
+    ) -> None:
+        super().__init__(eval_env, **kwargs)
+        self.prefix = prefix
+    def _on_step(self) -> bool:
+        cont = super()._on_step()
+        if self.evaluations_results:
+            self.logger.record(
+                f"{self.prefix}/mean_reward", float(np.mean(self.evaluations_results[-1])),
+            )
+            self.logger.record(
+                f"{self.prefix}/avg_steps", float(np.mean(self.evaluations_length[-1])),
+            )
+        if self.evaluations_successes:
+            self.logger.record(
+                f"{self.prefix}/win_rate",
+                float(np.mean(self.evaluations_successes[-1])),
+            )
+        return cont
+def _make_sb3_env_fn(env_id: str, cfg: SimpleNamespace, log_dir: str):
+    """Return a picklable thunk that builds one wrapped+monitored env."""
+    def _init() -> Monitor:
+        os.makedirs(log_dir, exist_ok=True)
+        env = AdvancedObservationEnv(env_id, des_file=None, cfg=cfg)
+        env = _SB3MiniHackWrapper(env)
+        return Monitor(env, log_dir)
+    return _init
+# =============================================================================
+# Helpers
+# =============================================================================
+def _short(env_id: str) -> str:
+    return env_id.replace("MiniHack-", "").replace("-v0", "")
+def _eval_episodes_per_env(cfg: SimpleNamespace) -> int:
+    override = getattr(cfg, "baselines_eval_episodes_per_env", None)
+    if override is not None:
+        return int(override)
+    return int(cfg.eval_episodes_per_env)
+def _seed_everything(seed: int) -> None:
+    random.seed(seed)
+    np.random.seed(seed)
+    torch.manual_seed(seed)
+    if torch.cuda.is_available():
+        torch.cuda.manual_seed(seed)
+        torch.cuda.manual_seed_all(seed)
+def _resolve_output_dir(cfg: SimpleNamespace, override: str | None) -> Path:
+    if override:
+        out = Path(override)
+    else:
+        out = Path(cfg.baselines_output_dir)
+    out.mkdir(parents=True, exist_ok=True)
+    return out
+def _init_baseline_logger(
+    cfg: SimpleNamespace, run_name: str,
+) -> Logger:
+    """Init the project Logger with W&B project swapped to baselines.
+    Mutates ``cfg.wandb_project`` / ``cfg.wandb_run_name`` /
+    ``cfg.wandb_resume_id`` for the duration of the call so the existing
+    Logger constructor picks them up. We deliberately do not restore the
+    originals — each baseline seed reuses this helper, and main.py exits
+    after ``run_baselines`` returns.
+    """
+    project_override = getattr(cfg, "baselines_wandb_project", None)
+    if project_override:
+        cfg.wandb_project = project_override
+    cfg.wandb_run_name = run_name
+    cfg.wandb_resume_id = None
+    return Logger(cfg)
+# =============================================================================
+# BC training
+# =============================================================================
+def _collect_bc_dataset(
+    cfg: SimpleNamespace,
+) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
+    """Roll out the BFS oracle on each ID env and stack flat (s, a) pairs."""
+    n_per_env = int(cfg.baselines_bc_oracle_episodes_per_env)
+    locals_, globals_, actions_ = [], [], []
+    for env_id in cfg.id_envs:
+        for traj_seed in range(n_per_env):
+            traj = collect_oracle_trajectory(env_id, traj_seed, cfg)
+            if traj is None:
+                continue
+            # (T, H, W) -> (T, 1, H, W)
+            locals_.append(np.expand_dims(traj["local"], axis=1))
+            globals_.append(np.expand_dims(traj["global"], axis=1))
+            actions_.append(traj["actions"])
+    if not actions_:
+        raise RuntimeError("BC oracle collection produced zero trajectories")
+    return (
+        np.concatenate(locals_, axis=0),
+        np.concatenate(globals_, axis=0),
+        np.concatenate(actions_, axis=0),
+    )
+class _BCDataset(Dataset):
+    def __init__(
+        self, loc: np.ndarray, glob: np.ndarray, acts: np.ndarray,
+    ) -> None:
+        self.loc = torch.tensor(loc, dtype=torch.float32)
+        self.glob = torch.tensor(glob, dtype=torch.float32)
+        self.acts = torch.tensor(acts, dtype=torch.int64)
+    def __len__(self) -> int:
+        return len(self.acts)
+    def __getitem__(
+        self, idx: int,
+    ) -> dict[str, dict[str, torch.Tensor] | torch.Tensor]:
+        return {
+            "obs": {"local": self.loc[idx], "global": self.glob[idx]},
+            "acts": self.acts[idx],
+        }
+def _eval_sb3_policy_manually(
+    policy: ActorCriticPolicy,
+    env_id: str,
+    cfg: SimpleNamespace,
+    log_dir: str,
+    n_episodes: int,
+) -> tuple[float, float]:
+    """Run ``policy.predict`` on a Monitor-wrapped vec env and return
+    (win_rate, avg_steps)."""
+    eval_env = SubprocVecEnv([_make_sb3_env_fn(env_id, cfg, log_dir)])
+    try:
+        obs = eval_env.reset()
+        wins = 0
+        total_steps = 0
+        completed = 0
+        while completed < n_episodes:
+            action, _ = policy.predict(obs, deterministic=True)
+            obs, _rewards, dones, infos = eval_env.step(action)
+            if dones[0]:
+                completed += 1
+                if infos[0].get("won", False):
+                    wins += 1
+                total_steps += infos[0]["episode"]["l"]
+    finally:
+        eval_env.close()
+    return wins / n_episodes, total_steps / n_episodes
+def _train_bc(
+    cfg: SimpleNamespace,
+    train_env: SubprocVecEnv,
+    log: Logger,
+    log_dir: str,
+    seed: int,
+) -> tuple[ActorCriticPolicy, dict[str, float]]:
+    """Train a Behavioural Cloning baseline. Returns (policy, seed_metrics)."""
+    device = torch.device(cfg.device)
+    n_eval = _eval_episodes_per_env(cfg)
+    logger.info("Collecting oracle demonstrations for BC...")
+    loc_arr, glob_arr, acts_arr = _collect_bc_dataset(cfg)
+    logger.info("BC dataset: %d transitions", len(acts_arr))
+    bc_loader = DataLoader(
+        _BCDataset(loc_arr, glob_arr, acts_arr),
+        batch_size=int(cfg.baselines_bc_batch_size),
+        shuffle=True,
+        num_workers=4,
+        pin_memory=torch.cuda.is_available(),
+    )
+    lr = float(cfg.baselines_bc_lr)
+    policy = ActorCriticPolicy(
+        observation_space=train_env.observation_space,
+        action_space=train_env.action_space,
+        lr_schedule=lambda _progress: lr,
+        features_extractor_class=_MiniHackCNN,
+        features_extractor_kwargs={"features_dim": 256},
+    ).to(device)
+    n_epochs = int(cfg.baselines_bc_epochs)
+    optimizer = torch.optim.AdamW(
+        policy.parameters(),
+        lr=lr,
+        weight_decay=float(cfg.weight_decay),
+    )
+    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
+        optimizer, T_max=n_epochs,
+    )
+    policy.train()
+    for epoch in range(n_epochs):
+        total_loss = 0.0
+        for batch in bc_loader:
+            obs = {k: v.to(policy.device) for k, v in batch["obs"].items()}
+            acts = batch["acts"].to(policy.device)
+            _values, log_prob, _entropy = policy.evaluate_actions(obs, acts)
+            loss = -log_prob.mean()
+            optimizer.zero_grad()
+            loss.backward()
+            torch.nn.utils.clip_grad_norm_(policy.parameters(), 1.0)
+            optimizer.step()
+            total_loss += loss.item()
+        scheduler.step()
+        avg_loss = total_loss / max(1, len(bc_loader))
+        current_lr = scheduler.get_last_lr()[0]
+        log.log(
+            {
+                "train/bc_loss": avg_loss,
+                "train/lr": current_lr,
+                "train/epoch": epoch + 1,
+            },
+            step=epoch + 1,
+        )
+        logger.info(
+            "BC epoch %02d/%02d | loss=%.4f | lr=%.2e",
+            epoch + 1, n_epochs, avg_loss, current_lr,
+        )
+    seed_metrics: dict[str, float] = {}
+    for split, env_list in (("ID", cfg.id_envs), ("OOD", cfg.ood_envs)):
+        logger.info("--- BC %s evaluation (seed=%d) ---", split, seed)
+        for env_id in env_list:
+            short = _short(env_id)
+            win_rate, avg_steps = _eval_sb3_policy_manually(
+                policy,
+                env_id,
+                cfg,
+                f"{log_dir}/eval_{split.lower()}/{env_id}",
+                n_eval,
+            )
+            seed_metrics[f"{split}/{short}/win_rate"] = win_rate * 100
+            seed_metrics[f"{split}/{short}/avg_steps"] = avg_steps
+            logger.info(
+                "%-30s | win_rate=%5.1f%% | avg_steps=%5.1f",
+                short, win_rate * 100, avg_steps,
+            )
+    log.log(seed_metrics, step=n_epochs + 1)
+    return policy, seed_metrics
+# =============================================================================
+# Decision Transformer training
+# =============================================================================
+def _collect_dt_trajectories(
+    cfg: SimpleNamespace,
+) -> list[dict[str, np.ndarray]]:
+    """Collect oracle trajectories with sparse reward + return-to-go labels."""
+    n_per_env = int(cfg.baselines_dt_oracle_episodes_per_env)
+    trajectories: list[dict[str, np.ndarray]] = []
+    for env_id in cfg.id_envs:
+        for traj_seed in range(n_per_env):
+            traj = collect_oracle_trajectory(env_id, traj_seed, cfg)
+            if traj is None:
+                continue
+            T = len(traj["actions"])
+            rewards = np.zeros(T, dtype=np.float32)
+            rewards[-1] = 1.0  # sparse goal reward
+            rtg = np.zeros(T, dtype=np.float32)
+            rtg[-1] = rewards[-1]
+            for t in range(T - 2, -1, -1):
+                rtg[t] = rewards[t] + rtg[t + 1]
+            trajectories.append(
+                {
+                    "local": np.expand_dims(traj["local"], axis=1),
+                    "global": np.expand_dims(traj["global"], axis=1),
+                    "actions": traj["actions"],
+                    "rewards": rewards,
+                    "returns_to_go": rtg,
+                }
+            )
+    return trajectories
+def _eval_dt(
+    model: _DecisionTransformer,
+    env_id: str,
+    cfg: SimpleNamespace,
+    target_return: float,
+    n_episodes: int,
+    max_ep_len: int,
+    eval_max_steps: int,
+    context_len: int,
+) -> tuple[float, float]:
+    """Roll out a trained Decision Transformer with target-return conditioning."""
+    device = torch.device(cfg.device)
+    env = AdvancedObservationEnv(env_id, des_file=None, cfg=cfg)
+    env = _SB3MiniHackWrapper(env)
+    model.eval()
+    wins = 0
+    total_steps = 0
+    try:
+        for _ep in range(n_episodes):
+            obs, _ = env.reset()
+            done = False
+            local_hist: list[np.ndarray] = []
+            global_hist: list[np.ndarray] = []
+            action_hist: list[int] = []
+            rtg_hist: list[float] = []
+            ts_hist: list[int] = []
+            current_rtg = float(target_return)
+            t = 0
+            info: dict = {}
+            while not done and t < eval_max_steps:
+                local_hist.append(obs["local"])
+                global_hist.append(obs["global"])
+                rtg_hist.append(current_rtg)
+                ts_hist.append(min(t, max_ep_len - 1))
+                ctx = min(len(local_hist), context_len)
+                local_in = np.stack(local_hist[-ctx:], axis=0)
+                global_in = np.stack(global_hist[-ctx:], axis=0)
+                rtg_in = np.array(rtg_hist[-ctx:], dtype=np.float32)
+                ts_in = np.array(ts_hist[-ctx:], dtype=np.int64)
+                if len(action_hist) < ctx:
+                    act_in = np.zeros(ctx, dtype=np.int64)
+                    if action_hist:
+                        act_in[-len(action_hist):] = action_hist[-ctx:]
+                else:
+                    act_in = np.array(action_hist[-ctx:], dtype=np.int64)
+                local_t = torch.tensor(local_in, dtype=torch.float32).unsqueeze(0).to(device)
+                global_t = torch.tensor(global_in, dtype=torch.float32).unsqueeze(0).to(device)
+                rtg_t = torch.tensor(rtg_in, dtype=torch.float32).unsqueeze(0).unsqueeze(-1).to(device)
+                act_t = torch.tensor(act_in, dtype=torch.long).unsqueeze(0).to(device)
+                ts_t = torch.tensor(ts_in, dtype=torch.long).unsqueeze(0).to(device)
+                with torch.no_grad():
+                    action = int(
+                        model.get_action(rtg_t, local_t, global_t, act_t, ts_t).item()
+                    )
+                action = max(0, min(action, int(cfg.action_dim) - 1))
+                action_hist.append(action)
+                obs, reward, terminated, truncated, info = env.step(action)
+                done = terminated or truncated
+                current_rtg -= float(reward)
+                t += 1
+            if info.get("won", False):
+                wins += 1
+            total_steps += t
+    finally:
+        env.close()
+    return wins / n_episodes, total_steps / n_episodes
+def _train_dt(
+    cfg: SimpleNamespace,
+    log: Logger,
+    log_dir: str,
+    seed: int,
+) -> tuple[_DecisionTransformer, dict[str, float]]:
+    """Train a Decision Transformer baseline. Returns (model, seed_metrics)."""
+    device = torch.device(cfg.device)
+    context_len = int(cfg.baselines_dt_context_len)
+    max_ep_len = int(cfg.baselines_dt_max_ep_len)
+    eval_max_steps = int(cfg.baselines_dt_eval_max_steps)
+    n_eval = _eval_episodes_per_env(cfg)
+    n_epochs = int(cfg.baselines_dt_epochs)
+    logger.info("Collecting oracle demonstrations for DT...")
+    trajectories = _collect_dt_trajectories(cfg)
+    if not trajectories:
+        raise RuntimeError("DT oracle collection produced zero trajectories")
+    traj_lengths = [len(t["actions"]) for t in trajectories]
+    logger.info(
+        "DT dataset: %d trajectories, %d transitions (len: min=%d max=%d mean=%.1f)",
+        len(trajectories),
+        sum(traj_lengths),
+        min(traj_lengths),
+        max(traj_lengths),
+        float(np.mean(traj_lengths)),
+    )
+    if max(traj_lengths) > max_ep_len:
+        logger.warning(
+            "Longest oracle trajectory (%d) exceeds baselines_dt_max_ep_len (%d); "
+            "positions will be clamped.",
+            max(traj_lengths),
+            max_ep_len,
+        )
+    target_return = float(max(t["returns_to_go"][0] for t in trajectories))
+    dataset = _DTDataset(
+        trajectories,
+        context_len=context_len,
+        max_ep_len=max_ep_len,
+        n_actions=int(cfg.action_dim),
+    )
+    loader = DataLoader(
+        dataset,
+        batch_size=int(cfg.baselines_dt_batch_size),
+        shuffle=True,
+        num_workers=4,
+        pin_memory=torch.cuda.is_available(),
+    )
+    model = _DecisionTransformer(
+        n_actions=int(cfg.action_dim),
+        embed_dim=int(cfg.baselines_dt_embed_dim),
+        n_heads=int(cfg.baselines_dt_n_heads),
+        n_layers=int(cfg.baselines_dt_n_layers),
+        context_len=context_len,
+        max_ep_len=max_ep_len,
+        crop_h=int(cfg.crop_size),
+        crop_w=int(cfg.crop_size),
+        map_h=int(cfg.map_h),
+        map_w=int(cfg.map_w),
+    ).to(device)
+    n_params = sum(p.numel() for p in model.parameters())
+    logger.info("DT parameters: %d", n_params)
+    optimizer = torch.optim.AdamW(
+        model.parameters(),
+        lr=float(cfg.baselines_dt_lr),
+        weight_decay=float(cfg.weight_decay),
+    )
+    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
+        optimizer, T_max=n_epochs,
+    )
+    for epoch in range(n_epochs):
+        model.train()
+        total_loss = 0.0
+        n_batches = 0
+        for batch in loader:
+            local = batch["local"].to(device)
+            glob = batch["global"].to(device)
+            actions = batch["actions"].to(device)
+            rtg = batch["returns_to_go"].to(device)
+            timesteps = batch["timesteps"].to(device)
+            attention_mask = batch["attention_mask"].to(device)
+            logits = model(rtg, local, glob, actions, timesteps, attention_mask)
+            logits_flat = logits.reshape(-1, int(cfg.action_dim))
+            targets_flat = actions.reshape(-1)
+            mask_flat = attention_mask.reshape(-1)
+            ce = nn.functional.cross_entropy(
+                logits_flat, targets_flat, reduction="none",
+            )
+            loss = (ce * mask_flat).sum() / mask_flat.sum().clamp(min=1.0)
+            optimizer.zero_grad()
+            loss.backward()
+            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
+            optimizer.step()
+            total_loss += loss.item()
+            n_batches += 1
+        scheduler.step()
+        avg_loss = total_loss / max(1, n_batches)
+        log.log(
+            {
+                "train/dt_loss": avg_loss,
+                "train/lr": float(scheduler.get_last_lr()[0]),
+                "train/epoch": epoch + 1,
+            },
+            step=epoch + 1,
+        )
+        logger.info(
+            "DT epoch %02d/%02d | loss=%.4f | lr=%.2e",
+            epoch + 1,
+            n_epochs,
+            avg_loss,
+            float(scheduler.get_last_lr()[0]),
+        )
+    seed_metrics: dict[str, float] = {}
+    logger.info("DT eval target return = %.2f", target_return)
+    for split, env_list in (("ID", cfg.id_envs), ("OOD", cfg.ood_envs)):
+        logger.info("--- DT %s evaluation (seed=%d) ---", split, seed)
+        for env_id in env_list:
+            short = _short(env_id)
+            win_rate, avg_steps = _eval_dt(
+                model,
+                env_id,
+                cfg,
+                target_return=target_return,
+                n_episodes=n_eval,
+                max_ep_len=max_ep_len,
+                eval_max_steps=eval_max_steps,
+                context_len=context_len,
+            )
+            seed_metrics[f"{split}/{short}/win_rate"] = win_rate * 100
+            seed_metrics[f"{split}/{short}/avg_steps"] = avg_steps
+            logger.info(
+                "%-30s | win_rate=%5.1f%% | avg_steps=%5.1f",
+                short, win_rate * 100, avg_steps,
+            )
+    log.log(seed_metrics, step=n_epochs + 1)
+    return model, seed_metrics
+# =============================================================================
+# SB3 RL training
+# =============================================================================
+def _build_sb3_model(
+    algo: str,
+    train_env: SubprocVecEnv,
+    cfg: SimpleNamespace,
+    seed: int,
+    tb_log_dir: str,
+):
+    """Construct one of {ppo, a2c, dqn, ppo-rnn} with the MiniHack CNN."""
+    policy_kwargs = {
+        "features_extractor_class": _MiniHackCNN,
+        "features_extractor_kwargs": {"features_dim": 256},
+    }
+    if algo == "ppo":
+        return PPO(
+            "MultiInputPolicy", train_env, policy_kwargs=policy_kwargs,
+            verbose=1, tensorboard_log=tb_log_dir, seed=seed,
+        )
+    if algo == "ppo-rnn":
+        return RecurrentPPO(
+            "MultiInputLstmPolicy", train_env, policy_kwargs=policy_kwargs,
+            verbose=1, tensorboard_log=tb_log_dir, seed=seed,
+        )
+    if algo == "a2c":
+        return A2C(
+            "MultiInputPolicy", train_env, policy_kwargs=policy_kwargs,
+            verbose=1, tensorboard_log=tb_log_dir, seed=seed,
+        )
+    if algo == "dqn":
+        return DQN(
+            "MultiInputPolicy", train_env, policy_kwargs=policy_kwargs,
+            verbose=1, tensorboard_log=tb_log_dir, seed=seed,
+            buffer_size=int(cfg.baselines_dqn_buffer_size),
+        )
+    raise ValueError(f"Unknown SB3 algo: {algo!r}")
+def _build_sb3_callbacks(
+    cfg: SimpleNamespace,
+    train_env: SubprocVecEnv,
+    log_dir: str,
+    model_dir: str,
+) -> CallbackList:
+    callbacks: list = [WandbCallback(model_save_path=model_dir)]
+    n_eval = _eval_episodes_per_env(cfg)
+    eval_freq = max(
+        1, int(cfg.baselines_eval_freq_env_steps) // train_env.num_envs,
+    )
+    for env_id in cfg.id_envs:
+        short = _short(env_id)
+        eval_env = SubprocVecEnv(
+            [_make_sb3_env_fn(env_id, cfg, f"{log_dir}/eval_id/{env_id}")]
+        )
+        callbacks.append(
+            _PrefixedEvalCallback(
+                eval_env,
+                prefix=f"ID/{short}",
+                best_model_save_path=f"{model_dir}/best_{env_id}/",
+                log_path=f"{log_dir}/eval_id/{env_id}/",
+                eval_freq=eval_freq,
+                n_eval_episodes=n_eval,
+                deterministic=True,
+            )
+        )
+    for env_id in cfg.ood_envs:
+        short = _short(env_id)
+        eval_env = SubprocVecEnv(
+            [_make_sb3_env_fn(env_id, cfg, f"{log_dir}/eval_ood/{env_id}")]
+        )
+        callbacks.append(
+            _PrefixedEvalCallback(
+                eval_env,
+                prefix=f"OOD/{short}",
+                best_model_save_path=None,
+                log_path=f"{log_dir}/eval_ood/{env_id}/",
+                eval_freq=eval_freq,
+                n_eval_episodes=n_eval,
+                deterministic=True,
+            )
+        )
+    return CallbackList(callbacks)
+# =============================================================================
+# Aggregation
+# =============================================================================
+def _aggregate(
+    all_seed_results: list[dict[str, Any]],
+) -> dict[str, dict[str, float | list[float]]]:
+    """Compute mean/std across seeds for every shared metric key."""
+    if not all_seed_results:
+        return {}
+    metric_keys = [k for k in all_seed_results[0].keys() if k != "seed"]
+    agg: dict[str, dict[str, float | list[float]]] = {}
+    for key in metric_keys:
+        values = [r[key] for r in all_seed_results if key in r]
+        if values:
+            agg[key] = {
+                "mean": float(np.mean(values)),
+                "std": float(np.std(values)),
+                "values": [float(v) for v in values],
+            }
+    return agg
+def _print_aggregated(seeds: list[int], agg: dict[str, dict[str, Any]]) -> None:
+    if not agg:
+        logger.info("No per-environment metrics to aggregate (RL eval is callback-driven)")
+        return
+    logger.info("Aggregated results across %d seeds: %s", len(seeds), seeds)
+    for split in ("ID", "OOD"):
+        env_metrics: dict[str, dict[str, dict[str, Any]]] = {}
+        for key, stats in agg.items():
+            if not key.startswith(f"{split}/"):
+                continue
+            _split, env_name, metric_name = key.split("/", 2)
+            env_metrics.setdefault(env_name, {})[metric_name] = stats
+        if not env_metrics:
+            continue
+        logger.info("--- %s environments ---", split)
+        for env_name, metrics in sorted(env_metrics.items()):
+            wr = metrics.get("win_rate", {})
+            steps = metrics.get("avg_steps", {})
+            logger.info(
+                "%-30s | win_rate=%5.1f%% +/- %4.1f | avg_steps=%5.1f +/- %4.1f",
+                env_name,
+                wr.get("mean", 0.0),
+                wr.get("std", 0.0),
+                steps.get("mean", 0.0),
+                steps.get("std", 0.0),
+            )
+def _save_aggregated(
+    out_path: Path,
+    algo: str,
+    seeds: list[int],
+    all_seed_results: list[dict[str, Any]],
+    agg: dict[str, dict[str, Any]],
+) -> None:
+    payload = {
+        "algorithm": algo,
+        "seeds": seeds,
+        "n_seeds": len(seeds),
+        "per_seed_results": all_seed_results,
+        "aggregated": {
+            k: {"mean": v["mean"], "std": v["std"]} for k, v in agg.items()
+        },
+    }
+    out_path.write_bytes(orjson.dumps(payload, option=orjson.OPT_INDENT_2))
+    logger.info("Aggregated results written to %s", out_path)
+# =============================================================================
+# Public entry point
+# =============================================================================
+def run_baselines(
+    cfg: SimpleNamespace,
+    algo: str,
+    seeds: list[int] | None = None,
+    output_path: str | None = None,
+) -> None:
+    """Train and evaluate one baseline algorithm across one or more seeds.
+    Args:
+        cfg: Project config namespace (must contain ``baselines_*`` keys).
+        algo: One of ``ppo``, ``a2c``, ``dqn``, ``ppo-rnn``, ``bc``, ``dt``.
+        seeds: Optional list of seeds. ``None`` -> ``[cfg.seed]`` (or
+            a single seed of ``0`` if ``cfg.seed`` is ``None``).
+        output_path: Optional override for the aggregated-results JSON
+            destination. When ``None``, results land under
+            ``cfg.baselines_output_dir``.
+    """
+    if algo not in ALL_BASELINE_ALGOS:
+        raise ValueError(
+            f"Unknown algo {algo!r}. Choose one of {ALL_BASELINE_ALGOS}."
+        )
+    if seeds is None:
+        seeds = [cfg.seed if cfg.seed is not None else 0]
+    if not seeds:
+        raise ValueError("seeds must be non-empty")
+    out_dir = _resolve_output_dir(cfg, None)
+    if output_path is not None:
+        agg_json_path = Path(output_path)
+        agg_json_path.parent.mkdir(parents=True, exist_ok=True)
+    else:
+        agg_json_path = out_dir / f"results_{algo}_{len(seeds)}seeds.json"
+    logger.info(
+        "Running baseline %s on %d seed(s): %s (output -> %s)",
+        algo, len(seeds), seeds, agg_json_path,
+    )
+    all_seed_results: list[dict[str, Any]] = []
+    n_envs_per_id = int(cfg.baselines_n_envs_per_id)
+    for seed_idx, seed in enumerate(seeds):
+        logger.info(
+            "============================================================\n"
+            " %s seed %d (%d/%d)\n"
+            "============================================================",
+            algo.upper(), seed, seed_idx + 1, len(seeds),
+        )
+        _seed_everything(seed)
+        run_name = f"{algo}-multitask-seed{seed}"
+        log = _init_baseline_logger(cfg, run_name)
+        run_id = (
+            log._run.id  # type: ignore[union-attr]
+            if log._use_wandb and log._run is not None
+            else f"local-{algo}-seed{seed}"
+        )
+        log_dir = str(out_dir / "logs" / run_id)
+        model_dir = str(out_dir / "models" / run_id)
+        os.makedirs(log_dir, exist_ok=True)
+        os.makedirs(model_dir, exist_ok=True)
+        seed_results: dict[str, Any] = {"seed": seed}
+        try:
+            if algo == "dt":
+                model, dt_metrics = _train_dt(cfg, log, log_dir, seed)
+                seed_results.update(dt_metrics)
+                torch.save(
+                    {
+                        "model_state_dict": model.state_dict(),
+                        "config": {
+                            "n_actions": int(cfg.action_dim),
+                            "embed_dim": int(cfg.baselines_dt_embed_dim),
+                            "n_heads": int(cfg.baselines_dt_n_heads),
+                            "n_layers": int(cfg.baselines_dt_n_layers),
+                            "context_len": int(cfg.baselines_dt_context_len),
+                            "max_ep_len": int(cfg.baselines_dt_max_ep_len),
+                        },
+                    },
+                    f"{model_dir}/dt_final_seed{seed}.pt",
+                )
+            else:
+                # SB3 RL families and BC both need the parallel train env.
+                train_env_fns = [
+                    _make_sb3_env_fn(env_id, cfg, log_dir)
+                    for env_id in list(cfg.id_envs) * n_envs_per_id
+                ]
+                train_env = SubprocVecEnv(train_env_fns)
+                try:
+                    if algo == "bc":
+                        policy, bc_metrics = _train_bc(
+                            cfg, train_env, log, log_dir, seed,
+                        )
+                        seed_results.update(bc_metrics)
+                        policy.save(f"{model_dir}/bc_final_seed{seed}")
+                    else:
+                        sb3_model = _build_sb3_model(
+                            algo, train_env, cfg, seed,
+                            tb_log_dir=str(out_dir / "tb" / run_id),
+                        )
+                        callbacks = _build_sb3_callbacks(
+                            cfg, train_env, log_dir, model_dir,
+                        )
+                        logger.info(
+                            "Training %s for %d env-steps across %d ID maps "
+                            "(%d parallel envs)...",
+                            algo.upper(),
+                            int(cfg.total_timesteps),
+                            len(cfg.id_envs),
+                            train_env.num_envs,
+                        )
+                        sb3_model.learn(
+                            total_timesteps=int(cfg.total_timesteps),
+                            callback=callbacks,
+                        )
+                        sb3_model.save(f"{model_dir}/{algo}_final_seed{seed}")
+                finally:
+                    train_env.close()
+            all_seed_results.append(seed_results)
+        finally:
+            log.finish()
+        logger.info("%s seed %d complete.", algo.upper(), seed)
+    agg = _aggregate(all_seed_results)
+    _print_aggregated(seeds, agg)
+    if agg:
+        _save_aggregated(agg_json_path, algo, seeds, all_seed_results, agg)
+        # Final summary write to the project Logger so the aggregated
+        # numbers land on a dedicated W&B run.
+        summary_run_name = f"{algo}-multitask-summary"
+        summary_log = _init_baseline_logger(cfg, summary_run_name)
+        try:
+            summary_payload: dict[str, float] = {}
+            for key, stats in agg.items():
+                summary_payload[f"summary/{key}/mean"] = stats["mean"]
+                summary_payload[f"summary/{key}/std"] = stats["std"]
+            summary_log.log_summary(summary_payload)
+        finally:
+            summary_log.finish()
+    logger.info("All %d seed(s) complete.", len(seeds))

src/planners/collect.py ADDED Viewed

	@@ -0,0 +1,588 @@

+"""Data collection with DAgger and oracle replay.
+Implements model episode rollout with replanning and DAgger-style
+data collection using the BFS oracle and efficiency filter.
+Supports parallel episode collection via ``ThreadPoolExecutor``.
+"""
+from __future__ import annotations
+import copy
+import logging
+import os
+import random
+import time
+from concurrent.futures import ThreadPoolExecutor
+from typing import TYPE_CHECKING
+from types import SimpleNamespace
+import numpy as np
+import torch
+from src.buffer import ReplayBuffer
+from src.curriculum import DynamicCurriculum, efficiency_filter
+from src.diffusion.sampling import greedy_sample, remdm_sample
+from src.envs.minihack_env import collect_oracle_trajectory, make_env
+if TYPE_CHECKING:
+    from src.models.denoiser import ModelEMA
+logger = logging.getLogger(__name__)
+@torch.no_grad()
+def run_model_episode(
+    model: torch.nn.Module,
+    env_id: str,
+    cfg: SimpleNamespace,
+    device: torch.device | str,
+    seed: int | None = None,
+    max_steps: int = 500,
+    des_file: str | None = None,
+    blind_global: bool = False,
+    stochastic: bool = False,
+) -> dict:
+    """Roll out the diffusion model on a single episode.
+    Maintains a ``seq_len``-length plan and replans every
+    ``cfg.replan_every`` steps.
+    Args:
+        model: Denoising model (eval mode).
+        env_id: MiniHack registry ID.
+        cfg: Config namespace.
+        device: Torch device.
+        seed: Optional RNG seed.
+        max_steps: Maximum episode length.
+        des_file: Optional ``.des`` file content for custom scenarios.
+        blind_global: If ``True``, zero out global map (local-only ablation).
+        stochastic: If ``True``, use stochastic ReMDM sampling (evaluation).
+            If ``False`` (default), use greedy argmax (DAgger collection).
+    Returns:
+        Dict with ``"local"`` ``[T,9,9]``, ``"global"`` ``[T,21,79]``,
+        ``"actions"`` ``[T]``, ``"won"`` bool, ``"steps"`` int,
+        ``"total_reward"`` float, ``"seed"`` int.
+    """
+    if seed is None:
+        seed = random.randint(0, 2**31 - 1)
+    _use_stochastic = stochastic
+    env = make_env(env_id, des_file, cfg)
+    try:
+        (local, glb), _info = env.reset(seed=seed)
+        locals_list = [local]
+        globals_list = [glb]
+        actions_list: list[int] = []
+        won = False
+        total_reward = 0.0
+        plan: torch.Tensor | None = None
+        step_in_plan = 0
+        model.eval()
+        for step_idx in range(max_steps):
+            # Replan when needed
+            if plan is None or step_in_plan >= cfg.replan_every:
+                local_t = torch.from_numpy(
+                    local[np.newaxis]
+                ).long().to(device)  # [1, 9, 9]
+                glb_t = torch.from_numpy(
+                    glb[np.newaxis]
+                ).long().to(device)  # [1, 21, 79]
+                if _use_stochastic:
+                    plan = remdm_sample(
+                        model, local_t, glb_t, cfg, device,
+                        physics_aware=getattr(
+                            cfg, "physics_aware_sampling", False,
+                        ),
+                        blind_global=blind_global,
+                    )
+                else:
+                    plan = greedy_sample(
+                        model, local_t, glb_t, cfg, device,
+                        blind_global=blind_global,
+                    )  # [1, seq_len]
+                step_in_plan = 0
+            action = plan[0, step_in_plan].item()
+            action = max(0, min(action, cfg.action_dim - 1))
+            actions_list.append(action)
+            step_in_plan += 1
+            (local, glb), reward, terminated, truncated, info = env.step(
+                action,
+            )
+            total_reward += reward
+            locals_list.append(local)
+            globals_list.append(glb)
+            if info.get("won", False):
+                won = True
+            if terminated or truncated:
+                break
+    finally:
+        env.close()
+    # Trim trailing obs
+    locals_arr = np.stack(locals_list[:-1], axis=0).astype(np.int16)
+    globals_arr = np.stack(globals_list[:-1], axis=0).astype(np.int16)
+    actions_arr = np.array(actions_list, dtype=np.int64)
+    return {
+        "local": locals_arr,
+        "global": globals_arr,
+        "actions": actions_arr,
+        "won": won,
+        "steps": len(actions_list),
+        "total_reward": total_reward,
+        "seed": seed,
+    }
+def _collect_episode_thread(
+    model: torch.nn.Module,
+    env_id: str,
+    seed: int,
+    cfg: SimpleNamespace,
+) -> dict | None:
+    """Thread worker: run one paired (model + oracle) episode.
+    Both NLE (C code) and PyTorch CPU inference release the GIL,
+    so true parallelism is achieved with threads. Each call uses
+    its own model copy and env instance.
+    Args:
+        model: CPU-resident eval-mode model (thread's own copy).
+        env_id: MiniHack environment ID.
+        seed: RNG seed for the episode.
+        cfg: Config namespace.
+    Returns:
+        Stats dict or ``None`` on failure.
+    """
+    try:
+        model_result = run_model_episode(
+            model, env_id, cfg, "cpu", seed,
+        )
+        oracle_result = collect_oracle_trajectory(env_id, seed, cfg)
+        oracle_steps = (
+            len(oracle_result["actions"]) if oracle_result else 999
+        )
+        return {
+            "env_id": env_id,
+            "seed": seed,
+            "model_won": model_result["won"],
+            "model_steps": model_result["steps"],
+            "oracle_steps": oracle_steps,
+            "oracle_result": oracle_result,
+        }
+    except Exception:
+        logger.error(
+            f"Thread worker failed for {env_id} seed={seed}", exc_info=True,
+        )
+        return None
+class DataCollector:
+    """DAgger-style data collector.
+    Each iteration: sample an environment from the curriculum, run the
+    model, run the oracle on the same seed, apply efficiency filter, and
+    optionally add the oracle trajectory to the buffer.
+    Supports parallel episode collection via ``cfg.num_collection_workers``.
+    Uses a live reference to the ``ModelEMA`` object so the collector
+    always uses the latest EMA weights (synced before each rollout).
+    Args:
+        ema: EMA tracker holding shadow weights.
+        model: Training model (architecture template for EMA snapshot).
+        buffer: Replay buffer to populate.
+        curriculum: Dynamic environment curriculum.
+        cfg: Config namespace.
+        device: Torch device.
+    """
+    def __init__(
+        self,
+        ema: "ModelEMA",
+        model: torch.nn.Module,
+        buffer: ReplayBuffer,
+        curriculum: DynamicCurriculum,
+        cfg: SimpleNamespace,
+        device: torch.device | str,
+    ) -> None:
+        self._ema = ema
+        self._model_template = model
+        # Materialise an eval-mode copy; refreshed before each rollout
+        self.ema_model = ema.make_eval_model(model)
+        self.buffer = buffer
+        self.curriculum = curriculum
+        self.cfg = cfg
+        self.device = device
+        self._num_workers = getattr(cfg, "num_collection_workers", 0)
+        self._last_profile: dict[str, float] = {}
+        self._thread_pool: ThreadPoolExecutor | None = None
+        self._thread_models: list[torch.nn.Module] = []
+        if self._num_workers > 0:
+            n = min(self._num_workers, os.cpu_count() or 4)
+            self._thread_pool = ThreadPoolExecutor(max_workers=n)
+            # Create one CPU model copy per thread
+            for _ in range(n):
+                m = copy.deepcopy(model).cpu()
+                m.eval()
+                self._thread_models.append(m)
+    def _sync_ema(self) -> None:
+        """Copy latest EMA shadow weights into the eval model."""
+        self._ema.apply_to(self.ema_model)
+        self.ema_model.eval()
+    def collect_one_iteration(self) -> dict:
+        """Run one DAgger collection iteration (single episode).
+        Returns:
+            Stats dict with ``"env_id"``, ``"model_won"``,
+            ``"model_steps"``, ``"oracle_steps"``,
+            ``"added_to_buffer"`` keys.
+        """
+        self._sync_ema()
+        env_id = self.curriculum.sample_env()
+        seed = random.randint(0, 2**31 - 1)
+        # Model rollout
+        model_result = run_model_episode(
+            self.ema_model, env_id, self.cfg, self.device, seed,
+        )
+        # Oracle rollout (same seed)
+        oracle_result = collect_oracle_trajectory(
+            env_id, seed, self.cfg,
+        )
+        oracle_steps = (
+            len(oracle_result["actions"]) if oracle_result else 999
+        )
+        # Efficiency filter
+        add = efficiency_filter(
+            model_result["won"],
+            model_result["steps"],
+            oracle_steps,
+            self.cfg.efficiency_multiplier,
+        )
+        if add and oracle_result is not None:
+            self.buffer.add(oracle_result)
+        self.curriculum.update(env_id, model_result["won"])
+        return {
+            "env_id": env_id,
+            "model_won": model_result["won"],
+            "model_steps": model_result["steps"],
+            "oracle_steps": oracle_steps,
+            "added_to_buffer": add and oracle_result is not None,
+        }
+    def collect_batch_parallel(
+        self, n_episodes: int,
+    ) -> list[dict]:
+        """Collect multiple episodes in parallel using threads.
+        Both NLE env calls and PyTorch CPU inference release the GIL,
+        enabling true parallelism. Each thread uses a pre-allocated
+        CPU model copy. Weights are synced from EMA once per call.
+        Args:
+            n_episodes: Number of episodes to collect.
+        Returns:
+            List of per-episode stats dicts.
+        """
+        assert self._thread_pool is not None, (
+            "collect_batch_parallel requires num_collection_workers > 0"
+        )
+        self._sync_ema()
+        # Sync EMA weights to all thread-local CPU models
+        ema_sd = self.ema_model.state_dict()
+        cpu_sd = {k: v.cpu() for k, v in ema_sd.items()}
+        for tm in self._thread_models:
+            tm.load_state_dict(cpu_sd)
+            tm.eval()
+        # Build task list
+        tasks = []
+        for _ in range(n_episodes):
+            env_id = self.curriculum.sample_env()
+            seed = random.randint(0, 2**31 - 1)
+            tasks.append((env_id, seed))
+        # Round-robin assign models to tasks
+        n_models = len(self._thread_models)
+        futures = []
+        for i, (env_id, seed) in enumerate(tasks):
+            model = self._thread_models[i % n_models]
+            f = self._thread_pool.submit(
+                _collect_episode_thread, model, env_id, seed, self.cfg,
+            )
+            futures.append(f)
+        results = [f.result() for f in futures]
+        # Process results: efficiency filter + buffer add
+        stats_list = []
+        for res in results:
+            if res is None:
+                continue
+            add = efficiency_filter(
+                res["model_won"],
+                res["model_steps"],
+                res["oracle_steps"],
+                self.cfg.efficiency_multiplier,
+            )
+            oracle_result = res["oracle_result"]
+            if add and oracle_result is not None:
+                self.buffer.add(oracle_result)
+            self.curriculum.update(res["env_id"], res["model_won"])
+            stats_list.append({
+                "env_id": res["env_id"],
+                "model_won": res["model_won"],
+                "model_steps": res["model_steps"],
+                "oracle_steps": res["oracle_steps"],
+                "added_to_buffer": add and oracle_result is not None,
+            })
+        return stats_list
+    # ── GPU-batched collection ──────────────────────────────────
+    def collect_batch_gpu(self, n_episodes: int) -> list[dict]:
+        """Collect episodes with GPU-batched model inference.
+        Runs all model episodes with batched GPU forward passes
+        (B=n_episodes instead of B=1), then runs oracle rollouts
+        in parallel threads for efficiency filtering.
+        Args:
+            n_episodes: Number of episodes to collect.
+        Returns:
+            List of per-episode stats dicts.
+        """
+        self._sync_ema()
+        cfg = self.cfg
+        self._last_profile = {}
+        tasks = [
+            (self.curriculum.sample_env(), random.randint(0, 2**31 - 1))
+            for _ in range(n_episodes)
+        ]
+        # Phase 1: GPU-batched model rollouts
+        t0 = time.perf_counter()
+        model_results = self._run_model_episodes_batched(tasks)
+        model_time = time.perf_counter() - t0
+        # Phase 2: Oracle rollouts (threaded, CPU-only BFS)
+        t0 = time.perf_counter()
+        n_workers = min(n_episodes, os.cpu_count() or 4)
+        with ThreadPoolExecutor(max_workers=n_workers) as pool:
+            oracle_futures = [
+                pool.submit(
+                    collect_oracle_trajectory, env_id, seed, cfg,
+                )
+                for env_id, seed in tasks
+            ]
+            oracle_results = [f.result() for f in oracle_futures]
+        oracle_time = time.perf_counter() - t0
+        # Phase 3: Efficiency filter + buffer add
+        stats_list: list[dict] = []
+        for (env_id, _seed), m_res, o_res in zip(
+            tasks, model_results, oracle_results,
+        ):
+            oracle_steps = (
+                len(o_res["actions"]) if o_res else 999
+            )
+            add = efficiency_filter(
+                m_res["won"],
+                m_res["steps"],
+                oracle_steps,
+                cfg.efficiency_multiplier,
+            )
+            if add and o_res is not None:
+                self.buffer.add(o_res)
+            self.curriculum.update(env_id, m_res["won"])
+            stats_list.append({
+                "env_id": env_id,
+                "model_won": m_res["won"],
+                "model_steps": m_res["steps"],
+                "oracle_steps": oracle_steps,
+                "added_to_buffer": add and o_res is not None,
+            })
+        self._last_profile["model_rollout_sec"] = model_time
+        self._last_profile["oracle_rollout_sec"] = oracle_time
+        return stats_list
+    @torch.no_grad()
+    def _run_model_episodes_batched(
+        self,
+        tasks: list[tuple[str, int]],
+    ) -> list[dict]:
+        """Run model episodes with batched GPU forward passes.
+        Creates one env per episode, steps them in lockstep, and
+        batches all replanning into single GPU forward passes
+        (B = number of active envs needing a replan).
+        Args:
+            tasks: List of ``(env_id, seed)`` pairs.
+        Returns:
+            List of trajectory dicts matching
+            ``run_model_episode`` output format.
+        """
+        cfg = self.cfg
+        device = self.device
+        model = self.ema_model
+        model.eval()
+        n = len(tasks)
+        max_steps = 500
+        K = getattr(
+            cfg, "diffusion_steps_collect", cfg.diffusion_steps_eval,
+        )
+        cs = cfg.crop_size
+        # Create and reset all envs
+        envs: list = []
+        cur_local = np.zeros((n, cs, cs), dtype=np.int16)
+        cur_global = np.zeros(
+            (n, cfg.map_h, cfg.map_w), dtype=np.int16,
+        )
+        t_reset = time.perf_counter()
+        for i, (env_id, seed) in enumerate(tasks):
+            env = make_env(env_id, None, cfg)
+            (local, glb), _ = env.reset(seed=seed)
+            envs.append(env)
+            cur_local[i] = local
+            cur_global[i] = glb
+        reset_time = time.perf_counter() - t_reset
+        # Pre-allocate history buffers
+        obs_local = np.zeros(
+            (n, max_steps + 1, cs, cs), dtype=np.int16,
+        )
+        obs_global = np.zeros(
+            (n, max_steps + 1, cfg.map_h, cfg.map_w),
+            dtype=np.int16,
+        )
+        act_buf = np.zeros((n, max_steps), dtype=np.int64)
+        obs_local[:, 0] = cur_local
+        obs_global[:, 0] = cur_global
+        # Per-episode state vectors
+        plans = np.zeros((n, cfg.seq_len), dtype=np.int64)
+        step_in_plan = np.zeros(n, dtype=np.int32)
+        need_replan = np.ones(n, dtype=bool)
+        done = np.zeros(n, dtype=bool)
+        won = np.zeros(n, dtype=bool)
+        total_reward = np.zeros(n, dtype=np.float64)
+        n_steps = np.zeros(n, dtype=np.int32)
+        inference_time = 0.0
+        env_step_time = 0.0
+        try:
+            for _ in range(max_steps):
+                # Batch replan on GPU
+                replan_idx = np.where(
+                    need_replan & ~done,
+                )[0]
+                if len(replan_idx) > 0:
+                    t0 = time.perf_counter()
+                    local_t = torch.from_numpy(
+                        cur_local[replan_idx],
+                    ).long().to(device)
+                    glb_t = torch.from_numpy(
+                        cur_global[replan_idx],
+                    ).long().to(device)
+                    batch_plans = greedy_sample(
+                        model, local_t, glb_t, cfg, device,
+                        num_steps=K,
+                    ).cpu().numpy()
+                    plans[replan_idx] = batch_plans
+                    step_in_plan[replan_idx] = 0
+                    need_replan[replan_idx] = False
+                    inference_time += time.perf_counter() - t0
+                # Step all active envs
+                t0 = time.perf_counter()
+                any_active = False
+                for i in range(n):
+                    if done[i]:
+                        continue
+                    any_active = True
+                    action = int(plans[i, step_in_plan[i]])
+                    action = max(
+                        0, min(action, cfg.action_dim - 1),
+                    )
+                    act_buf[i, n_steps[i]] = action
+                    step_in_plan[i] += 1
+                    n_steps[i] += 1
+                    if step_in_plan[i] >= cfg.replan_every:
+                        need_replan[i] = True
+                    obs, reward, term, trunc, info = (
+                        envs[i].step(action)
+                    )
+                    local, glb = obs
+                    total_reward[i] += reward
+                    cur_local[i] = local
+                    cur_global[i] = glb
+                    obs_local[i, n_steps[i]] = local
+                    obs_global[i, n_steps[i]] = glb
+                    if info.get("won", False):
+                        won[i] = True
+                    if term or trunc:
+                        done[i] = True
+                env_step_time += time.perf_counter() - t0
+                if not any_active:
+                    break
+        finally:
+            for env in envs:
+                env.close()
+        # Build result dicts
+        results: list[dict] = []
+        for i in range(n):
+            T = int(n_steps[i])
+            results.append({
+                "local": obs_local[i, :T].copy(),
+                "global": obs_global[i, :T].copy(),
+                "actions": act_buf[i, :T].copy(),
+                "won": bool(won[i]),
+                "steps": T,
+                "total_reward": float(total_reward[i]),
+                "seed": tasks[i][1],
+            })
+        self._last_profile.update({
+            "env_reset_sec": reset_time,
+            "gpu_inference_sec": inference_time,
+            "env_step_sec": env_step_time,
+        })
+        return results

src/planners/collect_oracle.py ADDED Viewed

	@@ -0,0 +1,185 @@

+"""Standalone BFS oracle data collection for offline training datasets.
+Runs the BFS oracle across in-distribution MiniHack environments using
+multiprocessing and saves the resulting trajectories in the dict format
+expected by ``ReplayBuffer.load_offline_data()``.
+Usage::
+    python main.py --mode collect
+    python main.py --mode collect collect_episodes_per_env=2000
+    python main.py --mode collect collect_output=data/small.pt
+"""
+from __future__ import annotations
+import logging
+import os
+import time
+from concurrent.futures import ProcessPoolExecutor, as_completed
+from pathlib import Path
+from types import SimpleNamespace
+import torch
+from src.envs.minihack_env import collect_oracle_trajectory
+logger = logging.getLogger(__name__)
+def _collect_single(
+    args: tuple[str, int, SimpleNamespace],
+) -> dict | None:
+    """Process-pool worker: collect one oracle trajectory.
+    Module-level function so ``ProcessPoolExecutor`` can pickle it.
+    Args:
+        args: ``(env_id, seed, cfg)`` tuple.
+    Returns:
+        Trajectory dict with ``"local"``, ``"global"``,
+        ``"actions"``, ``"env_id"`` keys, or ``None`` on failure.
+    """
+    env_id, seed, cfg = args
+    return collect_oracle_trajectory(env_id, seed, cfg)
+def _format_eta(seconds: float) -> str:
+    """Format seconds into a human-readable ETA string.
+    Args:
+        seconds: Remaining time in seconds.
+    Returns:
+        Formatted string like ``"2m 30s"`` or ``"45s"``.
+    """
+    if seconds < 60:
+        return f"{seconds:.0f}s"
+    minutes = int(seconds // 60)
+    secs = int(seconds % 60)
+    return f"{minutes}m {secs:02d}s"
+def run_collect(cfg: SimpleNamespace) -> None:
+    """Collect BFS oracle demonstrations and save as a .pt dataset.
+    Collects ``collect_episodes_per_env`` episodes per ID environment
+    using ``ProcessPoolExecutor`` for parallelism, then saves the
+    trajectories in the dict format consumed by
+    ``ReplayBuffer.load_offline_data()``.
+    The output file can be loaded directly by ``--mode offline``::
+        python main.py --mode collect
+        python main.py --mode offline --data data/dataset.pt
+    Args:
+        cfg: Config namespace. Reads ``collect_episodes_per_env``,
+            ``collect_num_workers``, ``collect_output``, ``id_envs``,
+            ``seed``.
+    """
+    eps_per_env: int = cfg.collect_episodes_per_env
+    max_workers: int = min(
+        cfg.collect_num_workers, os.cpu_count() or 4,
+    )
+    output_path: str = cfg.collect_output
+    id_envs: list[str] = cfg.id_envs
+    base_seed: int = cfg.seed if cfg.seed is not None else 0
+    total_episodes = eps_per_env * len(id_envs)
+    logger.info(
+        "Collecting %d oracle episodes "
+        "(%d per env, %d envs, %d workers)",
+        total_episodes, eps_per_env, len(id_envs), max_workers,
+    )
+    # Deterministic task list: (env_id, seed, cfg) per episode
+    tasks: list[tuple[str, int, SimpleNamespace]] = []
+    for env_idx, env_id in enumerate(id_envs):
+        for ep in range(eps_per_env):
+            seed = base_seed + env_idx * eps_per_env + ep
+            tasks.append((env_id, seed, cfg))
+    trajectories: list[dict] = []
+    per_env_count: dict[str, int] = {eid: 0 for eid in id_envs}
+    per_env_steps: dict[str, int] = {eid: 0 for eid in id_envs}
+    failures = 0
+    completed = 0
+    t_start = time.perf_counter()
+    log_interval = max(1, total_episodes // 50)
+    with ProcessPoolExecutor(max_workers=max_workers) as executor:
+        future_to_env: dict = {
+            executor.submit(_collect_single, task): task[0]
+            for task in tasks
+        }
+        for future in as_completed(future_to_env):
+            env_id = future_to_env[future]
+            completed += 1
+            try:
+                result = future.result()
+            except Exception:
+                logger.error(
+                    "Worker crashed for %s", env_id, exc_info=True,
+                )
+                result = None
+            if result is not None:
+                trajectories.append(result)
+                per_env_count[env_id] += 1
+                per_env_steps[env_id] += len(result["actions"])
+            else:
+                failures += 1
+            if (
+                completed % log_interval == 0
+                or completed == total_episodes
+            ):
+                elapsed = time.perf_counter() - t_start
+                rate = completed / max(elapsed, 1e-6)
+                eta = (total_episodes - completed) / max(rate, 1e-6)
+                env_summary = "  ".join(
+                    f"{eid.split('-')[-2]}:{per_env_count[eid]}"
+                    for eid in id_envs
+                )
+                logger.info(
+                    "  %d/%d (%.1f%%)  %.1f eps/s  ETA: %s  |  %s",
+                    completed, total_episodes,
+                    100 * completed / total_episodes,
+                    rate, _format_eta(eta), env_summary,
+                )
+    elapsed = time.perf_counter() - t_start
+    # Summary
+    total_steps = sum(per_env_steps.values())
+    logger.info("Collection complete in %.1fs", elapsed)
+    logger.info(
+        "  Trajectories: %d (%d failures)",
+        len(trajectories), failures,
+    )
+    logger.info("  Total steps: %d", total_steps)
+    for env_id in id_envs:
+        n = per_env_count[env_id]
+        s = per_env_steps[env_id]
+        avg = s / max(n, 1)
+        logger.info(
+            "  %s: %d eps, %d steps, avg %.1f steps/ep",
+            env_id, n, s, avg,
+        )
+    # Save in the dict format expected by ReplayBuffer.load_offline_data()
+    out = Path(output_path).resolve()
+    out.parent.mkdir(parents=True, exist_ok=True)
+    dataset: dict = {"trajectories": trajectories}
+    torch.save(dataset, str(out))
+    file_mb = out.stat().st_size / (1024 * 1024)
+    logger.info(
+        "Saved %d trajectories to %s (%.1f MB)",
+        len(trajectories), out, file_mb,
+    )

src/planners/inference.py ADDED Viewed

	@@ -0,0 +1,360 @@

+"""Stateless evaluation runner.
+Runs episodes using the diffusion model and collects per-environment
+win rates, average rewards, and step counts.  All episodes for a given
+environment are rolled out in lockstep so that replanning calls are
+batched into single GPU forward passes (B = n_episodes).
+"""
+from __future__ import annotations
+import json
+import logging
+from datetime import datetime, timezone
+from pathlib import Path
+from types import SimpleNamespace
+import numpy as np
+import torch
+from src.models.denoiser import ModelEMA, make_model
+from src.planners.logging import Logger
+logger = logging.getLogger(__name__)
+class Evaluator:
+    """Stateless evaluation runner.
+    Runs the model on a set of environments and returns aggregate
+    statistics per environment.  Episodes within each environment are
+    executed in lockstep so replanning calls are GPU-batched.
+    """
+    @torch.no_grad()
+    def evaluate(
+        self,
+        env_ids: list[str],
+        model: torch.nn.Module,
+        n_episodes: int,
+        cfg: SimpleNamespace,
+        device: torch.device | str,
+        des_files: list[str] | None = None,
+        blind_global: bool = False,
+    ) -> dict[str, dict]:
+        """Evaluate *model* on each environment in *env_ids*.
+        All *n_episodes* for a given environment run in lockstep so
+        that replanning forward passes are batched (B = active envs
+        needing a replan).
+        Args:
+            env_ids: List of MiniHack environment IDs.
+            model: Denoising model (eval mode).
+            n_episodes: Episodes per environment.
+            cfg: Config namespace.
+            device: Torch device.
+            des_files: Optional list of ``.des`` file paths for custom
+                scenario evaluation. Each file yields one extra env entry
+                keyed by its filename stem.
+            blind_global: If ``True``, zero out global map observations
+                (local-only ablation mode).
+        Returns:
+            ``{env_id: {"win_rate", "wins", "avg_reward", "avg_steps",
+            "n_episodes"}}``
+        """
+        model.eval()
+        results: dict[str, dict] = {}
+        # Build list of (env_id, des_content) pairs
+        eval_targets: list[tuple[str, str | None]] = [
+            (eid, None) for eid in env_ids
+        ]
+        if des_files:
+            for des_path in des_files:
+                from pathlib import Path
+                stem = Path(des_path).stem
+                with open(des_path) as fh:
+                    eval_targets.append((stem, fh.read()))
+        for env_id, des_content in eval_targets:
+            seeds = [
+                42 + hash((env_id, ep)) % (2**31)
+                for ep in range(n_episodes)
+            ]
+            ep_results = self._run_episodes_batched(
+                model, env_id, n_episodes, cfg, device,
+                seeds=seeds,
+                des_content=des_content,
+                blind_global=blind_global,
+            )
+            wins = sum(1 for r in ep_results if r["won"])
+            total_reward = sum(r["total_reward"] for r in ep_results)
+            total_steps = sum(r["steps"] for r in ep_results)
+            n = max(len(ep_results), 1)
+            results[env_id] = {
+                "win_rate": wins / n,
+                "wins": wins,
+                "avg_reward": total_reward / n,
+                "avg_steps": total_steps / n,
+                "n_episodes": len(ep_results),
+            }
+        return results
+    @torch.no_grad()
+    def _run_episodes_batched(
+        self,
+        model: torch.nn.Module,
+        env_id: str,
+        n_episodes: int,
+        cfg: SimpleNamespace,
+        device: torch.device | str,
+        seeds: list[int],
+        des_content: str | None = None,
+        blind_global: bool = False,
+    ) -> list[dict]:
+        """Run episodes in lockstep with batched model inference.
+        Creates one environment per episode, steps them in lockstep,
+        and batches all replanning calls into single forward passes
+        (B = number of active envs needing a replan at each step).
+        Args:
+            model: Denoising model (eval mode).
+            env_id: MiniHack environment ID.
+            n_episodes: Number of episodes to run.
+            cfg: Config namespace.
+            device: Torch device.
+            seeds: Per-episode RNG seeds (length *n_episodes*).
+            des_content: Optional ``.des`` file content for custom
+                scenarios.
+            blind_global: If ``True``, zero out global map observations.
+        Returns:
+            List of per-episode dicts with ``"won"``, ``"steps"``,
+            ``"total_reward"`` keys.  Failed episodes report
+            ``won=False``.
+        """
+        from src.diffusion.sampling import remdm_sample
+        from src.envs.minihack_env import make_env
+        n = n_episodes
+        max_steps = 500
+        cs = cfg.crop_size
+        # Create and reset all envs
+        envs: list = []
+        cur_local = np.zeros((n, cs, cs), dtype=np.int16)
+        cur_global = np.zeros(
+            (n, cfg.map_h, cfg.map_w), dtype=np.int16,
+        )
+        failed = np.zeros(n, dtype=bool)
+        for i in range(n):
+            try:
+                env = make_env(env_id, des_content, cfg)
+                (local, glb), _ = env.reset(seed=seeds[i])
+                envs.append(env)
+                cur_local[i] = local
+                cur_global[i] = glb
+            except Exception:
+                logger.warning(
+                    "Failed to create env %s (ep %d)",
+                    env_id, i, exc_info=True,
+                )
+                envs.append(None)
+                failed[i] = True
+        # Per-episode state vectors
+        plans = np.zeros((n, cfg.seq_len), dtype=np.int64)
+        step_in_plan = np.zeros(n, dtype=np.int32)
+        need_replan = np.ones(n, dtype=bool)
+        done = failed.copy()
+        won = np.zeros(n, dtype=bool)
+        total_reward = np.zeros(n, dtype=np.float64)
+        n_steps = np.zeros(n, dtype=np.int32)
+        try:
+            for _ in range(max_steps):
+                # Batch replan for active envs that need it
+                replan_idx = np.where(need_replan & ~done)[0]
+                if len(replan_idx) > 0:
+                    local_t = torch.from_numpy(
+                        cur_local[replan_idx],
+                    ).long().to(device)  # [B_r, cs, cs]
+                    glb_t = torch.from_numpy(
+                        cur_global[replan_idx],
+                    ).long().to(device)  # [B_r, map_h, map_w]
+                    batch_plans = remdm_sample(
+                        model, local_t, glb_t, cfg, device,
+                        physics_aware=getattr(
+                            cfg, "physics_aware_sampling", False,
+                        ),
+                        blind_global=blind_global,
+                    ).cpu().numpy()  # [B_r, seq_len]
+                    plans[replan_idx] = batch_plans
+                    step_in_plan[replan_idx] = 0
+                    need_replan[replan_idx] = False
+                # Step all active envs
+                any_active = False
+                for i in range(n):
+                    if done[i]:
+                        continue
+                    any_active = True
+                    action = int(plans[i, step_in_plan[i]])
+                    action = max(
+                        0, min(action, cfg.action_dim - 1),
+                    )
+                    step_in_plan[i] += 1
+                    n_steps[i] += 1
+                    if step_in_plan[i] >= cfg.replan_every:
+                        need_replan[i] = True
+                    try:
+                        obs, reward, term, trunc, info = (
+                            envs[i].step(action)
+                        )
+                        local, glb = obs
+                        total_reward[i] += reward
+                        cur_local[i] = local
+                        cur_global[i] = glb
+                        if info.get("won", False):
+                            won[i] = True
+                        if term or trunc:
+                            done[i] = True
+                    except Exception:
+                        logger.warning(
+                            "Episode %d step failed for %s",
+                            i, env_id, exc_info=True,
+                        )
+                        done[i] = True
+                if not any_active:
+                    break
+        finally:
+            for env in envs:
+                if env is not None:
+                    env.close()
+        return [
+            {
+                "won": bool(won[i]),
+                "steps": int(n_steps[i]),
+                "total_reward": float(total_reward[i]),
+            }
+            for i in range(n)
+        ]
+def format_eval_results(
+    results: dict[str, dict], label: str = "Eval",
+) -> str:
+    """Format evaluation results as an ASCII table.
+    Args:
+        results: Output of ``Evaluator.evaluate``.
+        label: Table header label.
+    Returns:
+        Formatted string.
+    """
+    lines = [f"{'=' * 60}", f"  {label} Results", f"{'=' * 60}"]
+    lines.append(
+        f"  {'Environment':<35} {'WinRate':>8} {'Steps':>8}"
+    )
+    lines.append(f"  {'-' * 53}")
+    for env_id, stats in results.items():
+        wr = f"{stats['win_rate']:.2%}"
+        st = f"{stats['avg_steps']:.1f}"
+        lines.append(f"  {env_id:<35} {wr:>8} {st:>8}")
+    lines.append(f"{'=' * 60}")
+    return "\n".join(lines)
+def save_eval_json(
+    results: dict,
+    path: str,
+    metadata: dict | None = None,
+) -> None:
+    """Save evaluation results to a JSON file.
+    Args:
+        results: Evaluation results dict.
+        path: Output file path.
+        metadata: Optional extra metadata (e.g. iteration).
+    """
+    payload = {
+        "timestamp": datetime.now(timezone.utc).isoformat(),
+        "results": results,
+    }
+    if metadata:
+        payload["metadata"] = metadata
+    resolved = str(Path(path).resolve())
+    Path(resolved).parent.mkdir(parents=True, exist_ok=True)
+    try:
+        with open(resolved, "w") as f:
+            json.dump(payload, f, indent=2, default=str)
+    except Exception:
+        logger.error(f"Failed to save eval JSON to {resolved}", exc_info=True)
+def run_inference(
+    cfg,
+    checkpoint_path: str,
+    env_ids: list[str] | None,
+    episodes: int,
+    output_path: str | None,
+    use_ema: bool,
+    log: Logger | None = None,
+    des_files: list[str] | None = None,
+    blind_global: bool = False,
+) -> None:
+    """Evaluate a checkpoint on specified environments."""
+    device = cfg.device
+    logger.info(f"Inference on {device}")
+    model = make_model(cfg).to(device)
+    ckpt = torch.load(
+        checkpoint_path, map_location=device, weights_only=False,
+    )
+    if "model_state_dict" in ckpt:
+        model.load_state_dict(ckpt["model_state_dict"])
+        if use_ema and "ema_state_dict" in ckpt:
+            ema = ModelEMA(model, decay=cfg.ema_decay)
+            ema.load_state_dict(ckpt["ema_state_dict"])
+            ema.apply_to(model)
+    else:
+        model.load_state_dict(ckpt)
+    model.eval()
+    if env_ids is None:
+        env_ids = cfg.id_envs + cfg.ood_envs
+    evaluator = Evaluator()
+    results = evaluator.evaluate(
+        env_ids, model, episodes, cfg, device,
+        des_files=des_files, blind_global=blind_global,
+    )
+    print(format_eval_results(results, label="Inference"))
+    if log is not None:
+        log.log_eval(results, step=0, prefix="inference")
+        log.log_summary(
+            {f"inference/{env_id}/win_rate": stats["win_rate"]
+             for env_id, stats in results.items()}
+        )
+    if output_path:
+        save_eval_json(results, output_path)
+        logger.info(f"Results saved to {output_path}")

src/planners/logging.py ADDED Viewed

	@@ -0,0 +1,291 @@

+"""Centralised W&B and stdout logging.
+Mirrors the Craftax logging conventions with metric namespaces:
+``diffusion/``, ``train/``, ``eval_id/``, ``eval_ood/``.
+"""
+from __future__ import annotations
+import logging
+import torch
+from typing import TYPE_CHECKING
+from types import SimpleNamespace
+if TYPE_CHECKING:
+    from wandb.sdk.wandb_run import Run as _WandbRun
+logger = logging.getLogger(__name__)
+def download_artifact(
+    artifact_ref: str, dst_dir: str = "artifacts",
+) -> str | None:
+    """Download a W&B artifact via the public API (no active run needed).
+    Args:
+        artifact_ref: Fully qualified artifact reference, e.g.
+            ``"entity/project/checkpoint-iter1000:latest"``.
+        dst_dir: Local directory to download into.
+    Returns:
+        Path to the ``.pth`` file inside the downloaded artifact
+        directory, or ``None`` on failure.
+    """
+    try:
+        import wandb
+        from pathlib import Path
+        api = wandb.Api()
+        artifact = api.artifact(artifact_ref)
+        artifact_dir = artifact.download(root=dst_dir)
+        pth_files = list(Path(artifact_dir).glob("*.pth"))
+        if not pth_files:
+            logger.error(
+                f"No .pth file found in artifact {artifact_ref}"
+            )
+            return None
+        path = str(pth_files[0])
+        logger.info(f"Downloaded artifact {artifact_ref} -> {path}")
+        return path
+    except Exception:
+        logger.error(
+            f"Failed to download artifact {artifact_ref}",
+            exc_info=True,
+        )
+        return None
+def _auto_run_name(cfg: SimpleNamespace) -> str:
+    """Generate a descriptive W&B run name from key hyperparameters.
+    Format: ``seq{seq_len}_d{n_embd}_L{n_layer}_lr{dagger_lr}_bs{batch}_eta{eta}_{remask}``
+    Args:
+        cfg: Config namespace.
+    Returns:
+        A concise, human-readable run name.
+    """
+    parts = [
+        f"seq{cfg.seq_len}",
+        f"d{cfg.n_embd}",
+        f"L{cfg.n_layer}",
+        f"lr{cfg.dagger_lr:.0e}",
+        f"bs{cfg.dagger_batch_size}",
+        f"eta{cfg.eta}",
+        f"{cfg.remask_strategy}",
+    ]
+    if cfg.use_importance_weighting:
+        parts.append("subs")
+    if getattr(cfg, "physics_aware_sampling", False):
+        parts.append("phys")
+    if cfg.seed is not None:
+        parts.append(f"s{cfg.seed}")
+    return "_".join(parts)
+class Logger:
+    """Centralised logger for W&B and stdout.
+    Args:
+        cfg: Config namespace with ``use_wandb``, ``wandb_project``,
+            ``wandb_entity``, ``seed``.
+    """
+    def __init__(self, cfg: SimpleNamespace) -> None:
+        self._use_wandb = cfg.use_wandb
+        self._run: _WandbRun | None = None
+        if self._use_wandb:
+            try:
+                import wandb
+                run_name = getattr(cfg, "wandb_run_name", None)
+                if not run_name:
+                    run_name = _auto_run_name(cfg)
+                resume_id = getattr(cfg, "wandb_resume_id", None)
+                self._run = wandb.init(
+                    project=cfg.wandb_project,
+                    entity=cfg.wandb_entity or None,
+                    name=run_name,
+                    config=vars(cfg),
+                    id=resume_id or None,
+                    resume="must" if resume_id else "never",
+                )
+                # Define custom metric x-axes
+                wandb.define_metric("iteration")
+                for ns in (
+                    "diffusion/*", "train/*", "perf/*", "speed/*",
+                    "model/*",
+                    "eval_id/*", "eval_ood/*",
+                    "curriculum/*",
+                    "ckpt_eval_id/*", "ckpt_eval_ood/*", "ckpt_eval/*",
+                    "inference/*",
+                ):
+                    wandb.define_metric(ns, step_metric="iteration")
+            except Exception:
+                logger.error("W&B init failed", exc_info=True)
+                self._use_wandb = False
+    def log_summary(self, metrics: dict) -> None:
+        """Write key/value pairs to the wandb run summary (final aggregates).
+        Args:
+            metrics: Flat ``{key: value}`` dict.
+        """
+        if self._use_wandb and self._run is not None:
+            try:
+                self._run.summary.update(metrics)
+            except Exception:
+                pass
+    def log(self, metrics: dict, step: int) -> None:
+        """Log a dict of metrics.
+        Args:
+            metrics: Flat ``{namespace/key: value}`` dict.
+            step: Global step index.
+        """
+        if self._use_wandb and self._run is not None:
+            try:
+                import wandb
+                # Include "iteration" so define_metric(step_metric="iteration") works
+                wandb.log({**metrics, "iteration": step}, step=step)
+            except Exception:
+                pass
+        # Stdout summary every 10 steps
+        if step % 10 == 0:
+            parts = [f"step={step}"]
+            for k, v in metrics.items():
+                if isinstance(v, float):
+                    if abs(v) < 1e-3 and v != 0.0:
+                        parts.append(f"{k}={v:.2e}")
+                    else:
+                        parts.append(f"{k}={v:.4f}")
+                else:
+                    parts.append(f"{k}={v}")
+            logger.info("  ".join(parts))
+    def log_eval(
+        self, results: dict[str, dict], step: int, prefix: str,
+    ) -> None:
+        """Flatten evaluation results and log them.
+        Args:
+            results: ``{env_id: {"win_rate", ...}}``
+            step: Global step.
+            prefix: Metric namespace prefix (e.g. ``"eval_id"``).
+        """
+        flat: dict[str, float] = {}
+        for env_id, stats in results.items():
+            for key, val in stats.items():
+                if isinstance(val, (int, float)):
+                    flat[f"{prefix}/{env_id}/{key}"] = val
+        self.log(flat, step=step)
+    def log_checkpoint_artifact(
+        self,
+        checkpoint_path: str,
+        config_path: str | None,
+        iteration: int,
+        metadata: dict | None = None,
+        artifact_name: str | None = None,
+    ) -> None:
+        """Upload a checkpoint as a W&B artifact with config attached.
+        Args:
+            checkpoint_path: Path to the ``.pth`` checkpoint file.
+            config_path: Path to the YAML config snapshot to attach.
+                If ``None``, only the checkpoint is uploaded.
+            iteration: Iteration number (used in the default artifact
+                name when ``artifact_name`` is not provided).
+            metadata: Optional metadata dict stored on the artifact.
+            artifact_name: Optional explicit artifact name. When
+                ``None``, defaults to ``f"checkpoint-iter{iteration}"``.
+                Offline BC passes a step-based name to avoid the
+                misleading "iter" prefix.
+        """
+        if not self._use_wandb or self._run is None:
+            return
+        try:
+            import wandb
+            name = artifact_name or f"checkpoint-iter{iteration}"
+            artifact = wandb.Artifact(
+                name=name,
+                type="model",
+                metadata=metadata or {},
+            )
+            artifact.add_file(checkpoint_path)
+            if config_path is not None:
+                artifact.add_file(config_path, name="config.yaml")
+            logged = self._run.log_artifact(artifact)  # type: ignore[union-attr]
+            logged.wait()  # block until upload completes
+            logger.info("W&B artifact uploaded: %s", name)
+        except Exception:
+            logger.error("W&B artifact upload failed", exc_info=True)
+    def finish(self) -> None:
+        """Close the W&B run if active."""
+        if self._use_wandb and self._run is not None:
+            try:
+                import wandb
+                wandb.finish()
+            except Exception:
+                pass
+# ---------------------------------------------------------------------------
+# Metric helper functions (used by both src/ and experiments/)
+# ---------------------------------------------------------------------------
+def gpu_memory_mb() -> float:
+    """Return peak GPU memory allocated in MB since last reset.
+    Returns:
+        Peak memory in MB, or 0.0 if CUDA is unavailable.
+    """
+    if torch.cuda.is_available():
+        return torch.cuda.max_memory_allocated() / (1024 * 1024)
+    return 0.0
+def reset_gpu_memory_stats() -> None:
+    """Reset GPU peak memory stats for the current device."""
+    if torch.cuda.is_available():
+        torch.cuda.reset_peak_memory_stats()
+def compute_param_norm(model: torch.nn.Module) -> float:
+    """Compute total L2 norm of all model parameters.
+    Args:
+        model: The model.
+    Returns:
+        Total L2 norm as a float.
+    """
+    total = 0.0
+    for p in model.parameters():
+        total += p.data.norm(2).item() ** 2
+    return total ** 0.5
+def compute_param_drift(
+    model: torch.nn.Module,
+    ref_state: dict[str, torch.Tensor],
+) -> float:
+    """Compute L2 distance between current model params and a reference state.
+    Args:
+        model: Current model.
+        ref_state: Reference state_dict (e.g. pretrained weights).
+    Returns:
+        L2 distance as a float.
+    """
+    total = 0.0
+    for name, p in model.named_parameters():
+        if name in ref_state:
+            total += (p.data - ref_state[name]).norm(2).item() ** 2
+    return total ** 0.5

src/planners/offline.py ADDED Viewed

	@@ -0,0 +1,727 @@

+"""Offline behavioural cloning trainer.
+Mirrors the Craftax ``make_train`` closure pattern. Trains the diffusion
+model on pre-collected oracle demonstrations using the MDLM ELBO loss
+with optional auxiliary goal loss.
+"""
+from __future__ import annotations
+import sys
+import time
+from pathlib import Path
+import logging
+from types import SimpleNamespace
+from typing import Callable
+import torch
+import torch.nn as nn
+import yaml
+from src.buffer import ReplayBuffer
+from src.config import make_run_dir
+from src.diffusion.forward import q_sample
+from src.diffusion.loss import auxiliary_goal_loss, mdlm_loss
+from src.diffusion.schedules import get_schedule
+from src.models.denoiser import ModelEMA, make_model, try_compile
+from src.planners.inference import Evaluator, save_eval_json
+from src.planners.logging import (
+    Logger,
+    compute_param_drift,
+    compute_param_norm,
+    gpu_memory_mb,
+    reset_gpu_memory_stats,
+)
+logger = logging.getLogger(__name__)
+def make_offline_trainer(cfg: SimpleNamespace) -> Callable:
+    """Build the offline BC training closure.
+    Args:
+        cfg: Config namespace.
+    Returns:
+        ``train_offline(model, ema_model, buffer, cfg, device) -> dict``
+    """
+    schedule_fn = get_schedule(cfg.noise_schedule)
+    def train_offline(
+        model: nn.Module,
+        ema_model: ModelEMA,
+        buffer: ReplayBuffer,
+        cfg: SimpleNamespace,
+        device: torch.device | str,
+        log: Logger | None = None,
+        raw_model: nn.Module | None = None,
+        resume_state: dict | None = None,
+        evaluator: Evaluator | None = None,
+        id_envs: list[str] | None = None,
+        ood_envs: list[str] | None = None,
+    ) -> dict:
+        """Run offline BC training.
+        Args:
+            model: Denoising model (may be torch.compiled).
+            ema_model: EMA tracker.
+            buffer: Replay buffer with offline data.
+            cfg: Config namespace.
+            device: Torch device.
+            log: Optional Logger for wandb and stdout metrics.
+            raw_model: Uncompiled model for EMA updates. If ``None``,
+                uses *model* directly.
+            resume_state: Checkpoint dict to resume from. If provided,
+                restores optimizer, scheduler, epoch, and step state.
+            evaluator: Optional ``Evaluator`` instance for periodic ID/OOD
+                evaluation. When ``None``, no eval is run during training.
+            id_envs: In-distribution environment IDs for periodic eval.
+                Required (non-empty) if ``evaluator`` is provided and
+                ``cfg.id_eval_every_timesteps > 0``.
+            ood_envs: Out-of-distribution environment IDs for periodic
+                eval. Required (non-empty) if ``evaluator`` is provided
+                and ``cfg.ood_eval_every_timesteps > 0``.
+        Returns:
+            Dict with ``"final_loss"`` and ``"loss_history"``.
+        """
+        _ema_source = raw_model if raw_model is not None else model
+        model.train()
+        optimizer = torch.optim.AdamW(
+            model.parameters(), lr=cfg.offline_lr,
+            weight_decay=cfg.weight_decay,
+        )
+        # Unified budget: `total_timesteps` counts env.step()-equivalent
+        # samples consumed during training. Each gradient step consumes
+        # `offline_batch_size` samples, so total grad steps derives
+        # directly from the budget and is independent of dataset size
+        # — this is what gives offline / DAgger / SB3 runs a common
+        # denominator when comparing curves.
+        total_grad_steps = max(
+            1, cfg.total_timesteps // cfg.offline_batch_size,
+        )
+        # Optional override: pin offline gradient budget independently
+        # of `total_timesteps`. Used for paper-fair compute matching
+        # against a specific DAgger iteration count, e.g.
+        # `offline_total_grad_steps: 60000` to match 600 DAgger iters
+        # × `grad_steps_per_iteration: 100` AdamW updates regardless of
+        # what env-step budget DAgger consumed in those iters.
+        _grad_override = getattr(cfg, "offline_total_grad_steps", None)
+        if _grad_override is not None and _grad_override > 0:
+            total_grad_steps = int(_grad_override)
+            logger.info(
+                "Offline grad budget pinned via offline_total_grad_steps="
+                f"{total_grad_steps} (overrides total_timesteps)"
+            )
+        scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
+            optimizer, T_max=total_grad_steps,
+            eta_min=cfg.offline_lr * 0.1,
+        )
+        # Checkpoint cadence — defaults to deriving from
+        # `checkpoint_every_timesteps` (env-step units → grad-step units
+        # via // batch_size). The optional `offline_checkpoint_every_grad_steps`
+        # override is used when an offline run is pinned via
+        # `offline_total_grad_steps` and needs an aligned cadence in
+        # grad-step units (env-step cadence diverges wildly from grad-step
+        # cadence between offline and DAgger because their sample-to-step
+        # ratios differ by ~50x).
+        _ckpt_grad_override = getattr(
+            cfg, "offline_checkpoint_every_grad_steps", None,
+        )
+        if _ckpt_grad_override is not None and _ckpt_grad_override > 0:
+            ckpt_every_step = int(_ckpt_grad_override)
+        else:
+            ckpt_every_step = (
+                cfg.checkpoint_every_timesteps // cfg.offline_batch_size
+                if cfg.checkpoint_every_timesteps > 0 else 0
+            )
+        # Eval cadence — same override pattern. Without this, an offline
+        # run pinned at e.g. 60k grad steps with the default
+        # `id_eval_every_timesteps=250000` would fire ~491 evals
+        # (250000 // 2048 = 122 grad steps per eval), which is
+        # impractically dense.
+        _eval_grad_override = getattr(
+            cfg, "offline_eval_every_grad_steps", None,
+        )
+        if _eval_grad_override is not None and _eval_grad_override > 0:
+            id_eval_every_env_steps = (
+                int(_eval_grad_override) * cfg.offline_batch_size
+            )
+            ood_eval_every_env_steps = id_eval_every_env_steps
+        else:
+            id_eval_every_env_steps = cfg.id_eval_every_timesteps
+            ood_eval_every_env_steps = cfg.ood_eval_every_timesteps
+        # Logging cadence. `offline_log_every` is the *minimum* cadence;
+        # the actual `log_every` is clamped on both ends so the number of
+        # log points stays in [~10, ~1000] regardless of run length:
+        #
+        #   * Lower bound (`floor`): on very long runs, force `log_every`
+        #     up so total log points cap at ~1000. Without this, a 600k
+        #     grad-step run with the default `offline_log_every=10` would
+        #     emit 60,000 W&B points — silent log spam.
+        #
+        #   * Upper bound (`ceiling`): on very short runs (smoke, fast
+        #     ablations) clamp `log_every` down so every run emits at
+        #     least ~10 log points and curves stay comparable across
+        #     budgets.
+        #
+        # When the configured value sits inside the [floor, ceiling]
+        # window (the common case), it is used unchanged.
+        _floor = max(1, total_grad_steps // 1000)
+        _ceiling = max(1, total_grad_steps // 10)
+        log_every = min(
+            _ceiling, max(_floor, cfg.offline_log_every),
+        )
+        # Restore optimizer/scheduler state if resuming
+        step = 0
+        if resume_state is not None:
+            if "optimizer_state_dict" in resume_state:
+                optimizer.load_state_dict(
+                    resume_state["optimizer_state_dict"],
+                )
+            if "scheduler_state_dict" in resume_state:
+                scheduler.load_state_dict(
+                    resume_state["scheduler_state_dict"],
+                )
+            step = resume_state.get("step", 0)
+            logger.info(
+                f"Resumed offline training from step {step}/"
+                f"{total_grad_steps}"
+            )
+        # AMP: enabled when use_amp=true and on CUDA
+        _use_amp = (
+            getattr(cfg, "use_amp", False)
+            and str(device).startswith("cuda")
+        )
+        scaler = torch.amp.GradScaler("cuda", enabled=_use_amp)
+        loss_history: list[float] = []
+        _batch_start = time.perf_counter()
+        last_ckpt_step = step
+        # Periodic eval anchors (env-step units, mirroring online.py).
+        # Snapping to current env_steps avoids accumulated drift across
+        # resumes; the next eval fires once another full interval has
+        # been processed since the resume point.
+        last_id_eval_env_steps = step * cfg.offline_batch_size
+        last_ood_eval_env_steps = step * cfg.offline_batch_size
+        # Snapshot of initial weights for `model/param_drift_from_init`.
+        # Mirrors online.py:Trainer.__init__.
+        _init_state = {
+            k: v.detach().clone()
+            for k, v in _ema_source.state_dict().items()
+            if v.is_floating_point()
+        }
+        # Counts logging emissions (not raw grad steps), used to gate
+        # the once-per-10-windows model health metrics analogously to
+        # online.py's `iteration % 10 == 0` cadence.
+        log_windows = 0
+        reset_gpu_memory_stats()
+        while step < total_grad_steps:
+            batch = buffer.sample(cfg.offline_batch_size)
+            if batch is None:
+                break
+            local_np, global_np, actions_np = batch
+            local_t = torch.from_numpy(local_np).long().to(device)
+            global_t = torch.from_numpy(global_np).long().to(device)
+            actions_t = torch.from_numpy(actions_np).long().to(device)
+            B = actions_t.shape[0]
+            t = torch.rand(B, device=device)  # [B] in [0, 1)
+            t = t.clamp(1e-5, 1.0 - 1e-5)
+            zt = q_sample(
+                actions_t, t, cfg.mask_token, cfg.pad_token,
+                schedule_fn,
+            )
+            t_discrete = (
+                t * cfg.num_diffusion_steps
+            ).long().clamp(0, cfg.num_diffusion_steps - 1)  # [B]
+            optimizer.zero_grad()
+            with torch.amp.autocast("cuda", enabled=_use_amp):
+                out = model(local_t, global_t, zt, t_discrete)
+                loss_diff = mdlm_loss(
+                    out["actions"], actions_t, zt, t,
+                    cfg.mask_token, cfg.pad_token, schedule_fn,
+                    weight_clip=cfg.loss_weight_clip,
+                    label_smoothing=cfg.label_smoothing,
+                    use_importance_weighting=cfg.use_importance_weighting,
+                )
+                loss_aux = torch.tensor(0.0, device=device)
+                if "goal_pred" in out:
+                    loss_aux = auxiliary_goal_loss(
+                        out["goal_pred"], global_t,
+                    )
+                loss = loss_diff + cfg.aux_loss_weight * loss_aux
+            scaler.scale(loss).backward()
+            scaler.unscale_(optimizer)
+            grad_norm = nn.utils.clip_grad_norm_(
+                model.parameters(), cfg.offline_grad_clip,
+            )
+            scaler.step(optimizer)
+            scaler.update()
+            scheduler.step()
+            ema_model.update(_ema_source)
+            loss_history.append(loss.item())
+            step += 1
+            # env-step equivalent: samples processed so far.
+            env_steps = step * cfg.offline_batch_size
+            if log is not None and step % log_every == 0:
+                step_time = time.perf_counter() - _batch_start
+                log_windows += 1
+                # Buffer state — for offline mode `offline_size` always
+                # equals `len(buffer)` (no online appends), so the
+                # online fraction is always 0.0. Logged anyway for
+                # symmetry with the DAgger curves.
+                buf_total = len(buffer)
+                buf_online_frac = (
+                    (buf_total - buffer.offline_size) / max(buf_total, 1)
+                    if hasattr(buffer, "offline_size")
+                    else 0.0
+                )
+                # Throughput: samples processed in this logging window.
+                samples_window = log_every * cfg.offline_batch_size
+                samples_per_sec = samples_window / max(step_time, 1e-6)
+                _ema_source_ref = _ema_source
+                metrics = {
+                    "diffusion/loss": loss.item(),
+                    "diffusion/loss_diff": loss_diff.item(),
+                    "diffusion/loss_aux": loss_aux.item(),
+                    "train/buffer_size": buf_total,
+                    "train/buffer_online_frac": buf_online_frac,
+                    "train/lr": scheduler.get_last_lr()[0],
+                    "train/env_steps": env_steps,
+                    "train/progress": step / total_grad_steps,
+                    "train/grad_norm": grad_norm.item(),
+                    "speed/train_step_time_sec": step_time,
+                    "speed/samples_per_sec": samples_per_sec,
+                    "speed/gpu_memory_mb": gpu_memory_mb(),
+                    # Legacy `perf/` mirror keys (kept for backward compat
+                    # with existing dashboards / DAgger curves).
+                    "perf/train_time_s": step_time,
+                    "perf/grad_steps_per_sec": (
+                        log_every / max(step_time, 1e-6)
+                    ),
+                }
+                if hasattr(_ema_source_ref, "global_gate"):
+                    gate_val = torch.sigmoid(
+                        _ema_source_ref.global_gate,
+                    ).item()
+                    metrics["train/global_gate"] = gate_val
+                    metrics["model/ema_gate_value"] = gate_val
+                # Model health (every 10 logging windows to keep overhead
+                # low — matches online.py's `iteration % 10 == 0`).
+                if log_windows % 10 == 1:
+                    metrics["model/param_norm"] = compute_param_norm(
+                        _ema_source_ref,
+                    )
+                    metrics["model/param_drift_from_init"] = (
+                        compute_param_drift(
+                            _ema_source_ref, _init_state,
+                        )
+                    )
+                log.log(metrics, step=step)
+                _batch_start = time.perf_counter()
+                reset_gpu_memory_stats()
+                logger.info(
+                    f"step {step}/{total_grad_steps} "
+                    f"(env_steps={env_steps}) loss={loss.item():.4f}"
+                )
+            # Periodic ID eval — env-step delta-check (mirrors
+            # online.py:277-305). Eval is opt-in: skipped entirely when
+            # no Evaluator was threaded through. The cadence variable
+            # already accounts for the optional
+            # `offline_eval_every_grad_steps` override.
+            if (
+                evaluator is not None
+                and id_envs
+                and id_eval_every_env_steps > 0
+                and env_steps - last_id_eval_env_steps
+                >= id_eval_every_env_steps
+            ):
+                eval_model = ema_model.make_eval_model(_ema_source)
+                results = evaluator.evaluate(
+                    id_envs, eval_model, cfg.eval_episodes_per_env,
+                    cfg, device,
+                )
+                if log is not None:
+                    log.log_eval(results, step=step, prefix="eval_id")
+                    mean_id_wr = (
+                        sum(s["win_rate"] for s in results.values())
+                        / len(results)
+                    ) if results else 0.0
+                    log.log(
+                        {"eval_id/mean_win_rate": mean_id_wr},
+                        step=step,
+                    )
+                last_id_eval_env_steps = env_steps
+            # Periodic OOD eval — same delta-check pattern.
+            if (
+                evaluator is not None
+                and ood_envs
+                and ood_eval_every_env_steps > 0
+                and env_steps - last_ood_eval_env_steps
+                >= ood_eval_every_env_steps
+            ):
+                eval_model = ema_model.make_eval_model(_ema_source)
+                results = evaluator.evaluate(
+                    ood_envs, eval_model, cfg.eval_episodes_per_env,
+                    cfg, device,
+                )
+                if log is not None:
+                    log.log_eval(results, step=step, prefix="eval_ood")
+                    mean_ood_wr = (
+                        sum(s["win_rate"] for s in results.values())
+                        / len(results)
+                    ) if results else 0.0
+                    log.log(
+                        {"eval_ood/mean_win_rate": mean_ood_wr},
+                        step=step,
+                    )
+                last_ood_eval_env_steps = env_steps
+            # Periodic step-level checkpoint (cadence derived from
+            # checkpoint_every_timesteps)
+            if (
+                ckpt_every_step > 0
+                and step - last_ckpt_step >= ckpt_every_step
+            ):
+                _save_offline_checkpoint(
+                    _ema_source, ema_model, optimizer, scheduler,
+                    step, cfg, log,
+                    evaluator=evaluator,
+                    id_envs=id_envs,
+                    ood_envs=ood_envs,
+                    device=device,
+                )
+                last_ckpt_step = step
+        if log is not None:
+            log.log_summary({
+                "offline/final_loss": loss_history[-1] if loss_history else 0.0,
+                "offline/total_steps": step,
+                "offline/total_timesteps": step * cfg.offline_batch_size,
+            })
+        return {
+            "final_loss": loss_history[-1] if loss_history else 0.0,
+            "loss_history": loss_history,
+        }
+    return train_offline
+def _save_offline_checkpoint(
+    model: nn.Module,
+    ema_model: ModelEMA,
+    optimizer: torch.optim.Optimizer,
+    scheduler: torch.optim.lr_scheduler.LRScheduler,
+    step: int,
+    cfg: SimpleNamespace,
+    log: Logger | None,
+    evaluator: Evaluator | None = None,
+    id_envs: list[str] | None = None,
+    ood_envs: list[str] | None = None,
+    device: torch.device | str | None = None,
+) -> None:
+    """Save an offline training checkpoint, eval, and W&B artifact.
+    Mirrors the DAgger ``Trainer.save_checkpoint`` flow:
+        1. Persist model + EMA + optimizer + scheduler state to disk.
+        2. Save a YAML config snapshot alongside the checkpoint.
+        3. Run an EMA-weight ID + OOD eval and emit ``ckpt_eval_*``
+           metrics + an eval JSON sidecar.
+        4. Upload the checkpoint + config snapshot as a W&B artifact.
+    Steps 3 and 4 are skipped gracefully when ``evaluator`` / envs /
+    ``device`` are not provided, so callers that just want the bare
+    state dump still work.
+    Args:
+        model: Raw (uncompiled) model — used both for ``state_dict``
+            persistence and as the source argument to
+            ``ema_model.make_eval_model``.
+        ema_model: EMA tracker.
+        optimizer: Optimizer.
+        scheduler: LR scheduler.
+        step: Global gradient step count (used in filenames + metadata).
+        cfg: Config namespace.
+        log: Logger (used to extract W&B run ID, log eval metrics,
+            and upload artifact).
+        evaluator: Optional evaluator. When ``None``, the checkpoint
+            eval is skipped.
+        id_envs: ID env IDs for the checkpoint eval.
+        ood_envs: OOD env IDs for the checkpoint eval.
+        device: Torch device for the checkpoint eval.
+    """
+    wandb_run_id: str | None = None
+    if log is not None and log._use_wandb and log._run is not None:
+        wandb_run_id = log._run.id
+    ckpt_dir = Path(cfg.checkpoint_dir)
+    ckpt_dir.mkdir(parents=True, exist_ok=True)
+    path = ckpt_dir / f"offline_step{step}.pth"
+    torch.save(
+        {
+            "model_state_dict": model.state_dict(),
+            "ema_state_dict": ema_model.state_dict(),
+            "optimizer_state_dict": optimizer.state_dict(),
+            "scheduler_state_dict": scheduler.state_dict(),
+            "step": step,
+            "env_steps": step * cfg.offline_batch_size,
+            "wandb_run_id": wandb_run_id,
+        },
+        path,
+    )
+    logger.info(f"Offline checkpoint saved: {path}")
+    # Save config snapshot alongside checkpoint (mirrors DAgger).
+    config_path: Path | None = ckpt_dir / f"config_offline_step{step}.yaml"
+    try:
+        cfg_dict = {
+            k: v for k, v in vars(cfg).items() if not k.startswith("_")
+        }
+        with open(config_path, "w") as f:
+            yaml.dump(cfg_dict, f, default_flow_style=False)
+    except Exception:
+        logger.error("Failed to save config snapshot", exc_info=True)
+        config_path = None
+    # Checkpoint-time eval — mirrors Trainer.save_checkpoint in online.py.
+    # Skipped when the caller did not thread an evaluator through.
+    if (
+        evaluator is not None
+        and id_envs
+        and ood_envs
+        and device is not None
+    ):
+        try:
+            eval_model = ema_model.make_eval_model(model)
+            id_results = evaluator.evaluate(
+                id_envs, eval_model, cfg.checkpoint_eval_episodes,
+                cfg, device,
+            )
+            ood_results = evaluator.evaluate(
+                ood_envs, eval_model, cfg.checkpoint_eval_episodes,
+                cfg, device,
+            )
+            id_winrate = (
+                sum(s["win_rate"] for s in id_results.values())
+                / len(id_results)
+            ) if id_results else 0.0
+            ood_winrate = (
+                sum(s["win_rate"] for s in ood_results.values())
+                / len(ood_results)
+            ) if ood_results else 0.0
+            current_lr = scheduler.get_last_lr()[0]
+            training_meta = {
+                "step": step,
+                "env_steps": step * cfg.offline_batch_size,
+                "total_timesteps": cfg.total_timesteps,
+                "lr": current_lr,
+                "offline_batch_size": cfg.offline_batch_size,
+                "aux_loss_weight": cfg.aux_loss_weight,
+                "ema_decay": cfg.ema_decay,
+                "id_winrate": id_winrate,
+                "ood_winrate": ood_winrate,
+                "per_env_id": {
+                    env_id: {
+                        "win_rate": s["win_rate"],
+                        "wins": s.get("wins", 0),
+                        "avg_reward": s["avg_reward"],
+                        "avg_steps": s["avg_steps"],
+                        "n_episodes": s["n_episodes"],
+                    }
+                    for env_id, s in id_results.items()
+                },
+                "per_env_ood": {
+                    env_id: {
+                        "win_rate": s["win_rate"],
+                        "wins": s.get("wins", 0),
+                        "avg_reward": s["avg_reward"],
+                        "avg_steps": s["avg_steps"],
+                        "n_episodes": s["n_episodes"],
+                    }
+                    for env_id, s in ood_results.items()
+                },
+            }
+            json_path = ckpt_dir / f"eval_offline_step{step}.json"
+            save_eval_json(
+                {"id": id_results, "ood": ood_results},
+                str(json_path),
+                metadata=training_meta,
+            )
+            if log is not None:
+                log.log_eval(
+                    id_results, step=step, prefix="ckpt_eval_id",
+                )
+                log.log_eval(
+                    ood_results, step=step, prefix="ckpt_eval_ood",
+                )
+                log.log(
+                    {
+                        "ckpt_eval/id_winrate": id_winrate,
+                        "ckpt_eval/ood_winrate": ood_winrate,
+                    },
+                    step=step,
+                )
+                log.log_summary({
+                    f"ckpt_offline_step{step}/id_winrate": id_winrate,
+                    f"ckpt_offline_step{step}/ood_winrate": ood_winrate,
+                })
+        except Exception:
+            logger.error(
+                "Offline checkpoint eval failed", exc_info=True,
+            )
+    # W&B artifact upload (no-op when wandb is not initialised).
+    if log is not None:
+        log.log_checkpoint_artifact(
+            checkpoint_path=str(path),
+            config_path=str(config_path) if config_path else None,
+            iteration=step,
+            metadata={"step": step, "mode": "offline"},
+            artifact_name=f"checkpoint-offline-step{step}",
+        )
+def load_offline_dataset(
+    path: str | None, cfg: SimpleNamespace,
+) -> dict | None:
+    """Load an offline dataset from disk.
+    Args:
+        path: Path to a ``.pt`` file, or ``None``.
+        cfg: Config namespace (unused, reserved for future).
+    Returns:
+        Loaded dict or ``None``.
+    """
+    if path is None:
+        return None
+    try:
+        import torch as _torch
+        return _torch.load(path, map_location="cpu", weights_only=False)
+    except Exception:
+        logger.error(f"Failed to load dataset from {path}", exc_info=True)
+        return None
+def run_offline(
+    cfg: SimpleNamespace,
+    data_path: str | None,
+    checkpoint_path: str | None = None,
+) -> None:
+    """Offline BC training on pre-collected data.
+    Args:
+        cfg: Config namespace.
+        data_path: Path to ``.pt`` dataset file.
+        checkpoint_path: Optional checkpoint to resume from. Restores
+            model, EMA, optimizer, scheduler, and W&B run for curve
+            continuity.
+    """
+    make_run_dir(cfg, tag="offline")
+    device = cfg.device
+    logger.info(f"Offline BC on {device}")
+    data = load_offline_dataset(data_path, cfg)
+    if data is None:
+        logger.error("No dataset provided or failed to load. Exiting.")
+        sys.exit(1)
+    # Offline buffer must hold the full pre-collected dataset. DAgger's
+    # `buffer_capacity` (typically 10k) would silently FIFO-evict 99% of
+    # the dataset, so honour the optional `offline_buffer_capacity`
+    # override when present.
+    _offline_buf_cap = (
+        getattr(cfg, "offline_buffer_capacity", None) or cfg.buffer_capacity
+    )
+    buffer = ReplayBuffer(_offline_buf_cap, cfg.seq_len, cfg.pad_token)
+    buffer.load_offline_data(data, cfg.id_envs)
+    logger.info(f"Loaded {len(buffer)} windows")
+    if len(buffer) == 0:
+        logger.error(
+            "Buffer is empty after loading dataset — no trajectories matched "
+            f"id_envs={cfg.id_envs}. Exiting."
+        )
+        sys.exit(1)
+    raw_model = make_model(cfg).to(device)
+    # torch.compile: wrap for training only; shares params with raw_model
+    model = try_compile(raw_model, cfg)
+    ema = ModelEMA(raw_model, decay=cfg.ema_decay)
+    # If resuming, extract W&B run ID from checkpoint before Logger init
+    resume_state: dict | None = None
+    if checkpoint_path:
+        resume_state = torch.load(
+            checkpoint_path, map_location=device, weights_only=False,
+        )
+        raw_model.load_state_dict(resume_state["model_state_dict"])
+        ema.load_state_dict(resume_state["ema_state_dict"])
+        resume_id = getattr(cfg, "wandb_resume_id", None)
+        if not resume_id:
+            saved_id = resume_state.get("wandb_run_id")
+            if saved_id:
+                cfg.wandb_resume_id = saved_id
+                logger.info(f"W&B run ID from checkpoint: {saved_id}")
+    log = Logger(cfg)
+    evaluator = Evaluator()
+    train_fn = make_offline_trainer(cfg)
+    result = train_fn(
+        model, ema, buffer, cfg, device, log=log,
+        raw_model=raw_model, resume_state=resume_state,
+        evaluator=evaluator,
+        id_envs=cfg.id_envs,
+        ood_envs=cfg.ood_envs,
+    )
+    logger.info(
+        f"Offline training done. Final loss: {result['final_loss']:.4f}"
+    )
+    # Save final checkpoint for downstream compatibility (DAgger, inference)
+    wandb_run_id: str | None = None
+    if log._use_wandb and log._run is not None:
+        wandb_run_id = log._run.id
+    ckpt_dir = Path(cfg.checkpoint_dir)
+    path = ckpt_dir / "offline_final.pth"
+    torch.save(
+        {
+            "model_state_dict": raw_model.state_dict(),
+            "ema_state_dict": ema.state_dict(),
+            "wandb_run_id": wandb_run_id,
+        },
+        path,
+    )
+    logger.info(f"Saved offline checkpoint: {path}")
+    log.finish()

src/planners/online.py ADDED Viewed

	@@ -0,0 +1,721 @@

+"""DAgger online training loop.
+Orchestrates the full DAgger pipeline: collect data via model + oracle,
+train on buffer, evaluate periodically, and checkpoint.
+"""
+from __future__ import annotations
+import logging
+import random
+import time
+from pathlib import Path
+from types import SimpleNamespace
+import numpy as np
+import torch
+import torch.nn as nn
+import yaml
+from src.buffer import ReplayBuffer
+from src.config import make_run_dir
+from src.diffusion.forward import q_sample
+from src.diffusion.loss import auxiliary_goal_loss, mdlm_loss
+from src.diffusion.schedules import get_schedule
+from src.models.denoiser import ModelEMA, make_model, try_compile
+from src.planners.collect import DataCollector
+from src.planners.inference import Evaluator, save_eval_json
+from src.planners.logging import (
+    Logger, gpu_memory_mb, reset_gpu_memory_stats,
+    compute_param_norm, compute_param_drift,
+)
+from src.curriculum import DynamicCurriculum
+from src.envs.minihack_env import collect_oracle_trajectory
+logger = logging.getLogger(__name__)
+class Trainer:
+    """Full DAgger training loop.
+    Args:
+        model: Denoising model.
+        ema_model: EMA tracker.
+        optimizer: Torch optimizer.
+        scheduler: Optional LR scheduler.
+        buffer: Replay buffer.
+        collector: DAgger data collector.
+        evaluator: Evaluation runner.
+        log: Centralised logger.
+        cfg: Config namespace.
+        device: Torch device.
+    """
+    def __init__(
+        self,
+        model: nn.Module,
+        ema_model: ModelEMA,
+        optimizer: torch.optim.Optimizer,
+        scheduler: torch.optim.lr_scheduler.LRScheduler | None,
+        buffer: ReplayBuffer,
+        collector: DataCollector,
+        evaluator: Evaluator,
+        log: Logger,
+        cfg: SimpleNamespace,
+        device: torch.device | str,
+        raw_model: nn.Module | None = None,
+    ) -> None:
+        self.model = model
+        # raw_model is the uncompiled model used for eval deep-copies.
+        # When torch.compile is off, raw_model is the same as model.
+        self._raw_model = raw_model if raw_model is not None else model
+        self.ema_model = ema_model
+        self.optimizer = optimizer
+        self.scheduler = scheduler
+        self.buffer = buffer
+        self.collector = collector
+        self.evaluator = evaluator
+        self.log = log
+        self.cfg = cfg
+        self.device = device
+        self._schedule_fn = get_schedule(cfg.noise_schedule)
+        # Snapshot of initial weights for param drift tracking
+        self._init_state = {
+            k: v.clone() for k, v in self._raw_model.state_dict().items()
+            if v.is_floating_point()
+        }
+        # AMP scaler: enabled only when use_amp=true and on CUDA
+        self._use_amp = (
+            getattr(cfg, "use_amp", False) and str(device).startswith("cuda")
+        )
+        self._scaler = torch.amp.GradScaler("cuda", enabled=self._use_amp)
+    # ── Main loop ────────────────────────────────────────────────
+    def train(
+        self, start_iter: int = 0, start_env_steps: int = 0,
+    ) -> None:
+        """Run the DAgger training loop.
+        The budget is ``cfg.total_timesteps`` — total env.step() calls
+        across model + oracle rollouts. Iteration count is derived; it
+        depends on how many env steps each iteration consumes (which in
+        turn depends on episode length and efficiency filter outcomes).
+        Args:
+            start_iter: Iteration index to resume from (for logging).
+            start_env_steps: Cumulative env steps already consumed.
+        """
+        cfg = self.cfg
+        env_steps_total = start_env_steps
+        iteration = start_iter
+        last_id_eval_step = start_env_steps
+        last_ood_eval_step = start_env_steps
+        last_ckpt_step = start_env_steps
+        while env_steps_total < cfg.total_timesteps:
+            reset_gpu_memory_stats()
+            iter_start = time.perf_counter()
+            # 1. Collect N episodes per iteration
+            n_eps = getattr(cfg, "episodes_per_iteration", 1)
+            num_workers = getattr(cfg, "num_collection_workers", 0)
+            model_wins = 0
+            added_total = 0
+            # Accumulators across all n_eps episodes — must be summed,
+            # NOT taken from a single (last) episode, otherwise the
+            # unified env-step budget undercounts by ~n_eps×.
+            model_steps_iter = 0
+            oracle_steps_iter = 0
+            last_env_id: str = ""
+            collect_start = time.perf_counter()
+            use_gpu_batch = (
+                str(self.device).startswith("cuda") and n_eps > 1
+            )
+            if use_gpu_batch:
+                # GPU-batched collection (all envs in lockstep)
+                batch_stats = self.collector.collect_batch_gpu(n_eps)
+                for s in batch_stats:
+                    model_wins += int(s["model_won"])
+                    added_total += int(s["added_to_buffer"])
+                    model_steps_iter += int(s["model_steps"])
+                    oracle_steps_iter += int(s["oracle_steps"])
+                    last_env_id = s.get("env_id", last_env_id)
+            elif num_workers > 0 and n_eps > 1:
+                # Threaded CPU collection (fallback)
+                batch_stats = self.collector.collect_batch_parallel(
+                    n_eps,
+                )
+                for s in batch_stats:
+                    model_wins += int(s["model_won"])
+                    added_total += int(s["added_to_buffer"])
+                    model_steps_iter += int(s["model_steps"])
+                    oracle_steps_iter += int(s["oracle_steps"])
+                    last_env_id = s.get("env_id", last_env_id)
+            else:
+                # Sequential collection (reference behaviour)
+                for _ in range(n_eps):
+                    s = self.collector.collect_one_iteration()
+                    model_wins += int(s["model_won"])
+                    added_total += int(s["added_to_buffer"])
+                    model_steps_iter += int(s["model_steps"])
+                    oracle_steps_iter += int(s["oracle_steps"])
+                    last_env_id = s.get("env_id", last_env_id)
+            collect_time = time.perf_counter() - collect_start
+            collect_stats = {
+                "env_id": last_env_id,
+                "model_won": model_wins,
+                "added_to_buffer": added_total,
+                "model_steps": model_steps_iter,
+                "oracle_steps": oracle_steps_iter,
+            }
+            # Advance the unified env-step budget. Both model and oracle
+            # rollouts consume real env.step() calls (the oracle rollout
+            # runs in its own env instance in collect_oracle_trajectory),
+            # so both contribute to the budget.
+            iter_env_steps = model_steps_iter + oracle_steps_iter
+            env_steps_total += iter_env_steps
+            # 2. Gradient steps (EMA updated after each step)
+            self.model.train()
+            step_metrics: list[dict[str, float]] = []
+            train_start = time.perf_counter()
+            for _ in range(cfg.grad_steps_per_iteration):
+                m = self._train_step()
+                step_metrics.append(m)
+                self.ema_model.update(self._raw_model)
+            train_time = time.perf_counter() - train_start
+            iter_time = time.perf_counter() - iter_start
+            # 4. Log
+            n_steps = len(step_metrics) or 1
+            avg_loss = sum(m["loss"] for m in step_metrics) / n_steps
+            avg_loss_diff = sum(m["loss_diff"] for m in step_metrics) / n_steps
+            avg_loss_aux = sum(m["loss_aux"] for m in step_metrics) / n_steps
+            avg_grad_norm = sum(m["grad_norm"] for m in step_metrics) / n_steps
+            current_lr = (
+                self.scheduler.get_last_lr()[0]
+                if self.scheduler is not None
+                else self.cfg.dagger_lr
+            )
+            # Global gate value (how open is the global stream)
+            gate_val = None
+            if hasattr(self._raw_model, "global_gate"):
+                gate_val = torch.sigmoid(
+                    self._raw_model.global_gate
+                ).item()
+            # Buffer online fraction
+            buf_total = len(self.buffer)
+            buf_online_frac = (
+                (buf_total - self.buffer.offline_size) / max(buf_total, 1)
+                if hasattr(self.buffer, "offline_size")
+                else 0.0
+            )
+            # Samples per second
+            total_samples = n_steps * cfg.dagger_batch_size
+            samples_per_sec = total_samples / max(train_time, 1e-6)
+            # Env steps per second (uses the iter-summed total, not a
+            # single episode — same bug class as the env-step budget).
+            env_steps_per_sec = iter_env_steps / max(collect_time, 1e-6)
+            metrics = {
+                "diffusion/loss": avg_loss,
+                "diffusion/loss_diff": avg_loss_diff,
+                "diffusion/loss_aux": avg_loss_aux,
+                "train/buffer_size": buf_total,
+                "train/buffer_online_frac": buf_online_frac,
+                "train/model_won": int(collect_stats["model_won"]),
+                "train/added_to_buffer": int(
+                    collect_stats["added_to_buffer"]
+                ),
+                "train/episodes_collected": n_eps,
+                "train/model_steps": collect_stats["model_steps"],
+                "train/oracle_steps": collect_stats["oracle_steps"],
+                "train/efficiency_ratio": (
+                    collect_stats["model_steps"]
+                    / max(collect_stats["oracle_steps"], 1)
+                ),
+                "train/lr": current_lr,
+                "train/grad_norm": avg_grad_norm,
+                "train/env_steps": env_steps_total,
+                "train/progress": env_steps_total / cfg.total_timesteps,
+                "speed/iter_time_sec": iter_time,
+                "speed/collect_time_sec": collect_time,
+                "speed/train_step_time_sec": train_time,
+                "speed/samples_per_sec": samples_per_sec,
+                "speed/env_steps_per_sec": env_steps_per_sec,
+                "speed/gpu_memory_mb": gpu_memory_mb(),
+                # Keep old perf/ keys for backward compat
+                "perf/iter_time_s": iter_time,
+                "perf/collect_time_s": collect_time,
+                "perf/train_time_s": train_time,
+                "perf/grad_steps_per_sec": (
+                    cfg.grad_steps_per_iteration / max(train_time, 1e-6)
+                ),
+            }
+            if gate_val is not None:
+                metrics["train/global_gate"] = gate_val
+                metrics["model/ema_gate_value"] = gate_val
+            # Model health (every 10 iters to avoid overhead)
+            if iteration % 10 == 0:
+                metrics["model/param_norm"] = compute_param_norm(
+                    self._raw_model
+                )
+                metrics["model/param_drift_from_init"] = compute_param_drift(
+                    self._raw_model, self._init_state
+                )
+            # Profile breakdown from GPU-batched collection
+            _profile = getattr(self.collector, "_last_profile", {})
+            for _pk, _pv in _profile.items():
+                metrics[f"profile/{_pk}"] = _pv
+            self.log.log(metrics, step=iteration)
+            # 5. ID eval — triggered when env-step delta crosses threshold
+            if (
+                cfg.id_eval_every_timesteps > 0
+                and env_steps_total - last_id_eval_step
+                >= cfg.id_eval_every_timesteps
+            ):
+                eval_model = self.ema_model.make_eval_model(self._raw_model)
+                results = self.evaluator.evaluate(
+                    cfg.id_envs,
+                    eval_model,
+                    cfg.eval_episodes_per_env,
+                    cfg,
+                    self.device,
+                )
+                self.log.log_eval(results, step=iteration, prefix="eval_id")
+                mean_id_wr = float(np.mean(
+                    [s["win_rate"] for s in results.values()]
+                )) if results else 0.0
+                self.log.log(
+                    {
+                        "eval_id/mean_win_rate": mean_id_wr,
+                        **{
+                            f"curriculum/{env_id}/win_rate":
+                                self.collector.curriculum.win_rate(env_id)
+                            for env_id in self.cfg.id_envs
+                        },
+                    },
+                    step=iteration,
+                )
+                last_id_eval_step = env_steps_total
+            # 6. OOD eval — env-step-triggered
+            if (
+                cfg.ood_eval_every_timesteps > 0
+                and env_steps_total - last_ood_eval_step
+                >= cfg.ood_eval_every_timesteps
+            ):
+                eval_model = self.ema_model.make_eval_model(self._raw_model)
+                results = self.evaluator.evaluate(
+                    cfg.ood_envs,
+                    eval_model,
+                    cfg.eval_episodes_per_env,
+                    cfg,
+                    self.device,
+                )
+                self.log.log_eval(results, step=iteration, prefix="eval_ood")
+                mean_ood_wr = float(np.mean(
+                    [s["win_rate"] for s in results.values()]
+                )) if results else 0.0
+                self.log.log(
+                    {"eval_ood/mean_win_rate": mean_ood_wr}, step=iteration,
+                )
+                last_ood_eval_step = env_steps_total
+            # 7. Checkpoint — env-step-triggered
+            if (
+                cfg.checkpoint_every_timesteps > 0
+                and env_steps_total - last_ckpt_step
+                >= cfg.checkpoint_every_timesteps
+            ):
+                self.save_checkpoint(iteration, env_steps_total)
+                last_ckpt_step = env_steps_total
+            iteration += 1
+        # Final checkpoint
+        if cfg.save_policy:
+            self.save_checkpoint(iteration, env_steps_total)
+    # ── Single gradient step ─────────────────────────────────────
+    def _train_step(self) -> dict[str, float]:
+        """One gradient step on a buffer sample.
+        Uses AMP (mixed precision) when ``cfg.use_amp`` is ``True``
+        and training on CUDA.
+        Returns:
+            Dict with ``"loss"``, ``"loss_diff"``, ``"loss_aux"``,
+            and ``"grad_norm"`` scalars.
+        """
+        cfg = self.cfg
+        batch = self.buffer.sample(cfg.dagger_batch_size)
+        if batch is None:
+            return {"loss": 0.0, "loss_diff": 0.0,
+                    "loss_aux": 0.0, "grad_norm": 0.0}
+        local_np, global_np, actions_np = batch
+        local_t = torch.from_numpy(local_np).long().to(self.device)
+        global_t = torch.from_numpy(global_np).long().to(self.device)
+        actions_t = torch.from_numpy(actions_np).long().to(self.device)
+        B = actions_t.shape[0]
+        t = torch.rand(B, device=self.device).clamp(1e-5, 1.0 - 1e-5)
+        zt = q_sample(
+            actions_t, t, cfg.mask_token, cfg.pad_token,
+            self._schedule_fn,
+        )
+        t_discrete = (t * cfg.num_diffusion_steps).long().clamp(
+            0, cfg.num_diffusion_steps - 1,
+        )
+        self.optimizer.zero_grad()
+        with torch.amp.autocast("cuda", enabled=self._use_amp):
+            out = self.model(local_t, global_t, zt, t_discrete)
+            loss_diff = mdlm_loss(
+                out["actions"], actions_t, zt, t,
+                cfg.mask_token, cfg.pad_token, self._schedule_fn,
+                weight_clip=cfg.loss_weight_clip,
+                label_smoothing=cfg.label_smoothing,
+                use_importance_weighting=cfg.use_importance_weighting,
+            )
+            loss_aux = torch.tensor(0.0, device=self.device)
+            if "goal_pred" in out:
+                loss_aux = auxiliary_goal_loss(out["goal_pred"], global_t)
+            loss = loss_diff + cfg.aux_loss_weight * loss_aux
+        self._scaler.scale(loss).backward()
+        self._scaler.unscale_(self.optimizer)
+        grad_norm = nn.utils.clip_grad_norm_(
+            self.model.parameters(), cfg.dagger_grad_clip,
+        )
+        self._scaler.step(self.optimizer)
+        self._scaler.update()
+        if self.scheduler is not None:
+            self.scheduler.step()
+        return {
+            "loss": loss.item(),
+            "loss_diff": loss_diff.item(),
+            "loss_aux": loss_aux.item(),
+            "grad_norm": grad_norm.item(),
+        }
+    # ── Checkpointing ────────────────────────────────────────────
+    def save_checkpoint(
+        self, iteration: int, env_steps: int,
+    ) -> None:
+        """Save a training checkpoint.
+        Args:
+            iteration: Current iteration number (for filename + metadata).
+            env_steps: Cumulative env.step() count consumed so far.
+        """
+        ckpt_dir = Path(self.cfg.checkpoint_dir)
+        ckpt_dir.mkdir(parents=True, exist_ok=True)
+        path = ckpt_dir / f"iter{iteration}.pth"
+        # Capture W&B run ID for seamless resumption
+        wandb_run_id: str | None = None
+        if self.log._use_wandb and self.log._run is not None:
+            wandb_run_id = self.log._run.id
+        state = {
+            "model_state_dict": self._raw_model.state_dict(),
+            "ema_state_dict": self.ema_model.state_dict(),
+            "optimizer_state_dict": self.optimizer.state_dict(),
+            "scheduler_state_dict": (
+                self.scheduler.state_dict()
+                if self.scheduler is not None
+                else None
+            ),
+            "curriculum_state": self.collector.curriculum.state_dict(),
+            "iteration": iteration,
+            "env_steps": env_steps,
+            "wandb_run_id": wandb_run_id,
+            "rng_states": {
+                "torch": torch.get_rng_state(),
+                "numpy": np.random.get_state(),
+                "python": random.getstate(),
+            },
+        }
+        try:
+            torch.save(state, path)
+            logger.info(f"Checkpoint saved: {path}")
+        except Exception:
+            logger.error(
+                f"Failed to save checkpoint to {path}", exc_info=True,
+            )
+        # Save config snapshot alongside checkpoint
+        config_path = ckpt_dir / f"config_iter{iteration}.yaml"
+        try:
+            cfg_dict = {
+                k: v for k, v in vars(self.cfg).items()
+                if not k.startswith("_")
+            }
+            with open(config_path, "w") as f:
+                yaml.dump(cfg_dict, f, default_flow_style=False)
+        except Exception:
+            logger.error("Failed to save config snapshot", exc_info=True)
+            config_path = None
+        # Run eval at checkpoint and save JSON
+        try:
+            eval_model = self.ema_model.make_eval_model(self._raw_model)
+            id_results = self.evaluator.evaluate(
+                self.cfg.id_envs, eval_model,
+                self.cfg.checkpoint_eval_episodes,
+                self.cfg, self.device,
+            )
+            ood_results = self.evaluator.evaluate(
+                self.cfg.ood_envs, eval_model,
+                self.cfg.checkpoint_eval_episodes,
+                self.cfg, self.device,
+            )
+            id_winrate = float(np.mean(
+                [s["win_rate"] for s in id_results.values()]
+            )) if id_results else 0.0
+            ood_winrate = float(np.mean(
+                [s["win_rate"] for s in ood_results.values()]
+            )) if ood_results else 0.0
+            current_lr = (
+                self.scheduler.get_last_lr()[0]
+                if self.scheduler is not None
+                else self.cfg.dagger_lr
+            )
+            training_meta = {
+                "iteration": iteration,
+                "env_steps": env_steps,
+                "total_timesteps": self.cfg.total_timesteps,
+                "lr": current_lr,
+                "dagger_batch_size": self.cfg.dagger_batch_size,
+                "aux_loss_weight": self.cfg.aux_loss_weight,
+                "buffer_size": len(self.buffer),
+                "buffer_capacity": self.cfg.buffer_capacity,
+                "ema_decay": self.cfg.ema_decay,
+                "grad_steps_per_iteration": self.cfg.grad_steps_per_iteration,
+                "episodes_per_iteration": getattr(
+                    self.cfg, "episodes_per_iteration", 1
+                ),
+                "id_winrate": id_winrate,
+                "ood_winrate": ood_winrate,
+                "per_env_id": {
+                    env_id: {
+                        "win_rate": s["win_rate"],
+                        "wins": s.get("wins", 0),
+                        "avg_reward": s["avg_reward"],
+                        "avg_steps": s["avg_steps"],
+                        "n_episodes": s["n_episodes"],
+                    }
+                    for env_id, s in id_results.items()
+                },
+                "per_env_ood": {
+                    env_id: {
+                        "win_rate": s["win_rate"],
+                        "wins": s.get("wins", 0),
+                        "avg_reward": s["avg_reward"],
+                        "avg_steps": s["avg_steps"],
+                        "n_episodes": s["n_episodes"],
+                    }
+                    for env_id, s in ood_results.items()
+                },
+            }
+            json_path = ckpt_dir / f"eval_iter{iteration}.json"
+            save_eval_json(
+                {"id": id_results, "ood": ood_results},
+                str(json_path),
+                metadata=training_meta,
+            )
+            # W&B checkpoint log — per-env step metrics + aggregates
+            self.log.log_eval(
+                id_results, step=iteration, prefix="ckpt_eval_id",
+            )
+            self.log.log_eval(
+                ood_results, step=iteration, prefix="ckpt_eval_ood",
+            )
+            self.log.log(
+                {
+                    "ckpt_eval/id_winrate": id_winrate,
+                    "ckpt_eval/ood_winrate": ood_winrate,
+                },
+                step=iteration,
+            )
+            self.log.log_summary({
+                f"ckpt_{iteration}/id_winrate": id_winrate,
+                f"ckpt_{iteration}/ood_winrate": ood_winrate,
+            })
+        except Exception:
+            logger.error("Checkpoint eval failed", exc_info=True)
+        # HuggingFace Hub upload (no-op if HF_TOKEN or hub_run_id not set)
+        try:
+            from scripts.hf_upload import maybe_upload_checkpoint
+            maybe_upload_checkpoint(
+                str(ckpt_dir),
+                getattr(self.cfg, "hub_run_id", None),
+                getattr(self.cfg, "hub_repo_id", None),
+            )
+        except Exception:
+            logger.error("HF Hub upload failed", exc_info=True)
+        # W&B artifact upload
+        self.log.log_checkpoint_artifact(
+            checkpoint_path=str(path),
+            config_path=str(config_path) if config_path else None,
+            iteration=iteration,
+            metadata={
+                "iteration": iteration,
+                "buffer_size": len(self.buffer),
+            },
+        )
+    def load_checkpoint(self, path: str) -> tuple[int, int]:
+        """Load a training checkpoint.
+        Args:
+            path: Path to ``.pth`` checkpoint file.
+        Returns:
+            ``(start_iter, start_env_steps)`` — the iteration and
+            cumulative env-step count to resume from.
+        """
+        ckpt = torch.load(
+            path, map_location=self.device, weights_only=False,
+        )
+        self._raw_model.load_state_dict(ckpt["model_state_dict"])
+        self.ema_model.load_state_dict(ckpt["ema_state_dict"])
+        self.optimizer.load_state_dict(ckpt["optimizer_state_dict"])
+        if (
+            self.scheduler is not None
+            and ckpt.get("scheduler_state_dict") is not None
+        ):
+            self.scheduler.load_state_dict(ckpt["scheduler_state_dict"])
+        if "curriculum_state" in ckpt:
+            self.collector.curriculum.load_state_dict(
+                ckpt["curriculum_state"],
+            )
+        # Restore RNG states (best-effort)
+        rng = ckpt.get("rng_states", {})
+        try:
+            if "torch" in rng:
+                torch.set_rng_state(rng["torch"])
+            if "numpy" in rng:
+                np.random.set_state(rng["numpy"])
+            if "python" in rng:
+                random.setstate(rng["python"])
+        except Exception:
+            logger.warning(
+                "RNG state restore failed; continuing with fresh state",
+            )
+        iteration = ckpt.get("iteration", 0)
+        env_steps = ckpt.get("env_steps", 0)
+        resume_from = iteration + 1
+        logger.info(
+            f"Resumed from checkpoint: {path} (iter {iteration}, "
+            f"env_steps={env_steps}), starting at iter {resume_from}"
+        )
+        return resume_from, env_steps
+def run_dagger(
+    cfg: SimpleNamespace,
+    checkpoint_path: str | None,
+    no_warm_start: bool,
+) -> None:
+    """DAgger online training loop."""
+    make_run_dir(cfg, tag="dagger")
+    device = cfg.device
+    logger.info(f"DAgger training on {device}")
+    raw_model = make_model(cfg).to(device)
+    # EMA and eval always use the raw (uncompiled) model — deep-copying
+    # a compiled model breaks FX tracing.
+    ema = ModelEMA(raw_model, decay=cfg.ema_decay)
+    # torch.compile: wrap for training only; shares parameters with raw_model
+    model = try_compile(raw_model, cfg)
+    optimizer = torch.optim.AdamW(
+        raw_model.parameters(), lr=cfg.dagger_lr,
+        weight_decay=cfg.weight_decay,
+    )
+    buffer = ReplayBuffer(cfg.buffer_capacity, cfg.seq_len, cfg.pad_token)
+    curriculum = DynamicCurriculum(
+        cfg.id_envs, cfg.curriculum_queue_size, cfg.curriculum_preseed,
+    )
+    # Seed buffer with some oracle data
+    for i, env_id in enumerate(cfg.id_envs):
+        for s in range(3):
+            traj = collect_oracle_trajectory(env_id, seed=i * 100 + s, cfg=cfg)
+            if traj is not None:
+                buffer.add(traj)
+    logger.info(f"Buffer seeded with {len(buffer)} windows")
+    # If resuming, extract W&B run ID from checkpoint before Logger init
+    # so the same W&B run is continued (curve continuity).
+    if checkpoint_path and not no_warm_start:
+        resume_id = getattr(cfg, "wandb_resume_id", None)
+        if not resume_id:
+            ckpt_peek = torch.load(
+                checkpoint_path, map_location="cpu", weights_only=False,
+            )
+            saved_id = ckpt_peek.get("wandb_run_id")
+            if saved_id:
+                cfg.wandb_resume_id = saved_id
+                logger.info(
+                    f"W&B run ID from checkpoint: {saved_id}"
+                )
+            del ckpt_peek
+    # DataCollector uses raw_model for eval copies (not compiled)
+    collector = DataCollector(ema, raw_model, buffer, curriculum, cfg, device)
+    evaluator = Evaluator()
+    log = Logger(cfg)
+    trainer = Trainer(
+        model, ema, optimizer, None, buffer, collector,
+        evaluator, log, cfg, device, raw_model=raw_model,
+    )
+    start_iter = 0
+    start_env_steps = 0
+    if checkpoint_path and not no_warm_start:
+        start_iter, start_env_steps = trainer.load_checkpoint(
+            checkpoint_path,
+        )
+    trainer.train(
+        start_iter=start_iter, start_env_steps=start_env_steps,
+    )
+    log.finish()

src/planners/smoke.py ADDED Viewed

	@@ -0,0 +1,63 @@

+import logging
+import torch
+from src.buffer import ReplayBuffer
+from src.curriculum import DynamicCurriculum
+from src.envs.minihack_env import collect_oracle_trajectory
+from src.models.denoiser import ModelEMA, make_model, try_compile
+from src.planners.collect import DataCollector
+from src.planners.inference import Evaluator, format_eval_results
+from src.planners.logging import Logger
+from src.planners.online import Trainer
+logger = logging.getLogger(__name__)
+def run_smoke(cfg) -> None:
+    """Smoke test: collect oracle data, train briefly, eval."""
+    device = cfg.device
+    logger.info(f"Smoke test on {device}")
+    # Collect a few oracle trajectories into the buffer
+    buffer = ReplayBuffer(cfg.buffer_capacity, cfg.seq_len, cfg.pad_token)
+    for i, env_id in enumerate(cfg.id_envs):
+        traj = collect_oracle_trajectory(env_id, seed=i, cfg=cfg)
+        if traj is not None:
+            buffer.add(traj)
+    logger.info(f"Buffer seeded with {len(buffer)} windows")
+    raw_model = make_model(cfg).to(device)
+    model = try_compile(raw_model, cfg)
+    ema = ModelEMA(raw_model, decay=cfg.ema_decay)
+    optimizer = torch.optim.AdamW(
+        raw_model.parameters(), lr=cfg.dagger_lr,
+        weight_decay=cfg.weight_decay,
+    )
+    curriculum = DynamicCurriculum(
+        cfg.id_envs, cfg.curriculum_queue_size, cfg.curriculum_preseed,
+    )
+    collector = DataCollector(ema, raw_model, buffer, curriculum, cfg, device)
+    evaluator = Evaluator()
+    log = Logger(cfg)
+    trainer = Trainer(
+        model, ema, optimizer, None, buffer, collector,
+        evaluator, log, cfg, device, raw_model=raw_model,
+    )
+    trainer.train(start_iter=0)
+    # Final eval
+    eval_model = ema.make_eval_model(raw_model)
+    results = evaluator.evaluate(
+        cfg.id_envs, eval_model, cfg.eval_episodes_per_env, cfg, device,
+    )
+    print(format_eval_results(results, label="Smoke"))
+    log.log_eval(results, step=0, prefix="smoke_eval")
+    mean_wr = float(sum(s["win_rate"] for s in results.values()) / len(results)) if results else 0.0
+    log.log({"smoke_eval/mean_win_rate": mean_wr}, step=0)
+    log.finish()