license: apache-2.0 tags: - hierarchical-reasoning-model - sudoku - puzzle-solving - recursive-reasoning - adaptive-computation pretty_name: HRM Sudoku-Extreme Checkpoints
HRM Sudoku-Extreme Checkpoints
Trained checkpoints for reproducing Hierarchical Reasoning Model (HRM) results on the Sudoku-Extreme benchmark. Trained on a single NVIDIA GH200 GPU.
Models
Original HRM (sudoku-extreme/original-hrm/)
- Run name: liberal-bee
- Architecture: HierarchicalReasoningModel_ACTV1 (~27M parameters)
- Dataset: sudoku-extreme-1k-aug-1000 (vanilla, 1000 puzzles, 1000x augmented)
- Training: 20,000 epochs, lr=7e-5, batch=384, 1 GPU
- Test exact accuracy: 53% (paper: 55% ±2%)
- Checkpoints: 20 checkpoints from step 2604 to step 52080
Augmented HRM (sudoku-extreme/augmented-hrm/)
- Run name: hopeful-quetzal
- Architecture: HierarchicalReasoningModel_ACTV1 (~27M parameters)
- Dataset: sudoku-extreme-1k-aug-1000-hint (with easier puzzles mixed in)
- Training: 40,000 epochs, lr=1e-4, batch=768, 1 GPU
- Peak single-checkpoint test accuracy: 54.2% (paper: 59.9%)
- Ensemble accuracy (10 ckpts + 9 permutations, 1000 samples): 90.5% (paper: 96.9%)
- Checkpoints: 40 checkpoints from step 1302 to step 52080
Source Papers
- Original HRM: Hierarchical Reasoning Model (Wang et al., 2025)
- Augmented HRM: Are Your Reasoning Models Reasoning or Guessing? (Ren & Liu, 2026)
How to Evaluate
Original HRM — Single Checkpoint
Train and check eval/exact_accuracy in W&B, as described in the HRM repo.
Augmented HRM — Ensemble (10 checkpoints + 9 permutations)
Using batch_inference.py:
# Snapshot evaluation (1000 test samples)
python batch_inference.py \
--checkpoints "step_40362,step_41664,...,step_52080" \
--permutes 9 --num_batch 10 --batch_size 100
# Full evaluation (422,786 test samples)
python batch_inference.py \
--checkpoints "step_40362,step_41664,...,step_52080" \
--permutes 9
Reproduction Notes
- All models trained on a single NVIDIA GH200 GPU (102GB VRAM). The papers used 8 GPUs.
- The Original HRM result (53%) falls within the paper's stated ±2% variance for small-sample learning.
- The Augmented HRM gap (90.5% vs 96.9%) is attributed to single-GPU vs multi-GPU training dynamics.
- Optimizer: adam-atan2-pytorch v0.2.8