Maze 17x17 SFT β€” bs=256 checkpoint sweep

SFT checkpoints for a small Qwen2-style decoder (~3.94M params, hidden=256, 4 layers, vocab=32) trained on the max-rl/maze_17x17_diverse_1.3m 17Γ—17 maze dataset. Two training runs at batch size 256, differing only in which path is used as the SFT target per maze:

  • random/ β€” for each maze, pick one of 16 sampled paths uniformly at random per epoch (length-diverse target distribution).
  • shortest/ β€” for each maze, always use the shortest sampled path (single near-optimal target).

Both runs share data, seed, scheduler, and stopping rule; only the SFT label distribution differs.

Training config

  • architecture: Qwen2ForCausalLM, 4 layers Γ— hidden 256 Γ— 2 heads
  • vocab: 32 tokens (custom maze tokenizer β€” WALL, PATH, START, GOAL, NEWLINE, GRID_START, GRID_END, PATH_START, DONE, UP/DOWN/LEFT/RIGHT, etc.)
  • batch size: 256 (no gradient accumulation; micro-batch 32)
  • learning rate: 1.41e-3 (sqrt-scaled from a 5e-4 baseline at bs=32)
  • scheduler: cosine with 2% warmup
  • max steps: 5000, with early-stop at eval pass@1 β‰₯ 0.05
  • eval / save freq: every 250 steps
  • seed: 0
  • eval prompts: 256 held-out mazes from the diverse-1.3M test split, 4 generations per prompt at temperature 1.0

Per-batch-size LR scaling: 5e-4 * sqrt(bs/32) (so bs=256 β†’ 1.41e-3).

Repo layout

random/
  init_model/         # vocab + scratch-init weights (shared starting point)
  ckpt-250/
  ckpt-500/
  ...
  ckpt-4250/          # last saved step (run was capped early)
  train.log
shortest/
  init_model/
  ckpt-250/
  ...
  ckpt-3000/
  train.log

Each ckpt-<step>/ directory is a self-contained HF checkpoint (config.json, model.safetensors, tokenizer files) loadable with AutoModelForCausalLM.from_pretrained(...).

Reproduction

Code lives in stablegradients/GOS-17X17-Maze under maze_v2/. Sweep entry-point: maze_v2/src/run_sft_sweep.sh; trainer: maze/sft.py. The dataset jsonl/json files this repo was trained on live in max-rl/maze_17x17_diverse_1.3m.

Caveats

  • Both runs were stopped before completing all 5000 steps (random reached step 4250, shortest reached step 3000 β€” likely manual cancellation, not early-stop on pass@1).
  • The 32-token vocabulary is custom and only covers maze tokens β€” these checkpoints are not general language models.
Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading