Instructions to use max-rl/maze_17x17_sft_bs256_checkpoints with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use max-rl/maze_17x17_sft_bs256_checkpoints with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("max-rl/maze_17x17_sft_bs256_checkpoints", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Maze 17x17 SFT β bs=256 checkpoint sweep
SFT checkpoints for a small Qwen2-style decoder (~3.94M params, hidden=256,
4 layers, vocab=32) trained on the
max-rl/maze_17x17_diverse_1.3m
17Γ17 maze dataset. Two training runs at batch size 256, differing only in
which path is used as the SFT target per maze:
random/β for each maze, pick one of 16 sampled paths uniformly at random per epoch (length-diverse target distribution).shortest/β for each maze, always use the shortest sampled path (single near-optimal target).
Both runs share data, seed, scheduler, and stopping rule; only the SFT label distribution differs.
Training config
- architecture: Qwen2ForCausalLM, 4 layers Γ hidden 256 Γ 2 heads
- vocab: 32 tokens (custom maze tokenizer β
WALL,PATH,START,GOAL,NEWLINE,GRID_START,GRID_END,PATH_START,DONE,UP/DOWN/LEFT/RIGHT, etc.) - batch size: 256 (no gradient accumulation; micro-batch 32)
- learning rate: 1.41e-3 (sqrt-scaled from a 5e-4 baseline at bs=32)
- scheduler: cosine with 2% warmup
- max steps: 5000, with early-stop at eval
pass@1 β₯ 0.05 - eval / save freq: every 250 steps
- seed: 0
- eval prompts: 256 held-out mazes from the diverse-1.3M test split, 4 generations per prompt at temperature 1.0
Per-batch-size LR scaling: 5e-4 * sqrt(bs/32) (so bs=256 β 1.41e-3).
Repo layout
random/
init_model/ # vocab + scratch-init weights (shared starting point)
ckpt-250/
ckpt-500/
...
ckpt-4250/ # last saved step (run was capped early)
train.log
shortest/
init_model/
ckpt-250/
...
ckpt-3000/
train.log
Each ckpt-<step>/ directory is a self-contained HF checkpoint
(config.json, model.safetensors, tokenizer files) loadable with
AutoModelForCausalLM.from_pretrained(...).
Reproduction
Code lives in stablegradients/GOS-17X17-Maze
under maze_v2/. Sweep entry-point: maze_v2/src/run_sft_sweep.sh; trainer:
maze/sft.py. The dataset jsonl/json files this repo was trained on live in
max-rl/maze_17x17_diverse_1.3m.
Caveats
- Both runs were stopped before completing all 5000 steps (random reached step 4250, shortest reached step 3000 β likely manual cancellation, not early-stop on pass@1).
- The 32-token vocabulary is custom and only covers maze tokens β these checkpoints are not general language models.