Maze 17x17 SFT — bs=256 checkpoint sweep

SFT checkpoints for a small Qwen2-style decoder (~3.94M params, hidden=256, 4 layers, vocab=32) trained on the max-rl/maze_17x17_diverse_1.3m 17×17 maze dataset. Two training runs at batch size 256, differing only in which path is used as the SFT target per maze:

random/ — for each maze, pick one of 16 sampled paths uniformly at random per epoch (length-diverse target distribution).
shortest/ — for each maze, always use the shortest sampled path (single near-optimal target).

Both runs share data, seed, scheduler, and stopping rule; only the SFT label distribution differs.

Training config

architecture: Qwen2ForCausalLM, 4 layers × hidden 256 × 2 heads
vocab: 32 tokens (custom maze tokenizer — WALL, PATH, START, GOAL, NEWLINE, GRID_START, GRID_END, PATH_START, DONE, UP/DOWN/LEFT/RIGHT, etc.)
batch size: 256 (no gradient accumulation; micro-batch 32)
learning rate: 1.41e-3 (sqrt-scaled from a 5e-4 baseline at bs=32)
scheduler: cosine with 2% warmup
max steps: 5000, with early-stop at eval pass@1 ≥ 0.05
eval / save freq: every 250 steps
seed: 0
eval prompts: 256 held-out mazes from the diverse-1.3M test split, 4 generations per prompt at temperature 1.0

Per-batch-size LR scaling: 5e-4 * sqrt(bs/32) (so bs=256 → 1.41e-3).

Repo layout

random/
  init_model/         # vocab + scratch-init weights (shared starting point)
  ckpt-250/
  ckpt-500/
  ...
  ckpt-4250/          # last saved step (run was capped early)
  train.log
shortest/
  init_model/
  ckpt-250/
  ...
  ckpt-3000/
  train.log

Each ckpt-<step>/ directory is a self-contained HF checkpoint (config.json, model.safetensors, tokenizer files) loadable with AutoModelForCausalLM.from_pretrained(...).

Reproduction

Code lives in stablegradients/GOS-17X17-Maze under maze_v2/. Sweep entry-point: maze_v2/src/run_sft_sweep.sh; trainer: maze/sft.py. The dataset jsonl/json files this repo was trained on live in max-rl/maze_17x17_diverse_1.3m.

Caveats

Both runs were stopped before completing all 5000 steps (random reached step 4250, shortest reached step 3000 — likely manual cancellation, not early-stop on pass@1).
The 32-token vocabulary is custom and only covers maze tokens — these checkpoints are not general language models.

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Reinforcement Learning