| --- | |
| license: apache-2.0 | |
| tags: | |
| - hierarchical-reasoning-model | |
| - sudoku | |
| - puzzle-solving | |
| - recursive-reasoning | |
| - adaptive-computation | |
| pretty_name: HRM Sudoku-Extreme Checkpoints | |
| --- | |
| # HRM Sudoku-Extreme Checkpoints | |
| Trained checkpoints for reproducing **Hierarchical Reasoning Model (HRM)** results on the Sudoku-Extreme benchmark. Trained on a single NVIDIA GH200 GPU. | |
| ## Models | |
| ### Original HRM (`sudoku-extreme/original-hrm/`) | |
| - **Run name:** liberal-bee | |
| - **Architecture:** HierarchicalReasoningModel_ACTV1 (~27M parameters) | |
| - **Dataset:** sudoku-extreme-1k-aug-1000 (vanilla, 1000 puzzles, 1000x augmented) | |
| - **Training:** 20,000 epochs, lr=7e-5, batch=384, 1 GPU | |
| - **Test exact accuracy:** 53% (paper: 55% ±2%) | |
| - **Checkpoints:** 20 checkpoints from step 2604 to step 52080 | |
| ### Augmented HRM (`sudoku-extreme/augmented-hrm/`) | |
| - **Run name:** hopeful-quetzal | |
| - **Architecture:** HierarchicalReasoningModel_ACTV1 (~27M parameters) | |
| - **Dataset:** sudoku-extreme-1k-aug-1000-hint (with easier puzzles mixed in) | |
| - **Training:** 40,000 epochs, lr=1e-4, batch=768, 1 GPU | |
| - **Peak single-checkpoint test accuracy:** 54.2% (paper: 59.9%) | |
| - **Ensemble accuracy (10 ckpts + 9 permutations, 1000 samples):** 90.5% (paper: 96.9%) | |
| - **Checkpoints:** 40 checkpoints from step 1302 to step 52080 | |
| ## Source Papers | |
| - **Original HRM:** [Hierarchical Reasoning Model](https://arxiv.org/abs/2506.21734) (Wang et al., 2025) | |
| - **Augmented HRM:** [Are Your Reasoning Models Reasoning or Guessing?](https://arxiv.org/abs/2601.10679) (Ren & Liu, 2026) | |
| ## How to Evaluate | |
| ### Original HRM — Single Checkpoint | |
| Train and check `eval/exact_accuracy` in W&B, as described in the [HRM repo](https://github.com/sapientinc/HRM). | |
| ### Augmented HRM — Ensemble (10 checkpoints + 9 permutations) | |
| Using [batch_inference.py](https://github.com/renrua52/hrm-mechanistic-analysis): | |
| ```python | |
| # Snapshot evaluation (1000 test samples) | |
| python batch_inference.py \ | |
| --checkpoints "step_40362,step_41664,...,step_52080" \ | |
| --permutes 9 --num_batch 10 --batch_size 100 | |
| # Full evaluation (422,786 test samples) | |
| python batch_inference.py \ | |
| --checkpoints "step_40362,step_41664,...,step_52080" \ | |
| --permutes 9 | |
| Reproduction Notes | |
| - All models trained on a single NVIDIA GH200 GPU (102GB VRAM). The papers used 8 GPUs. | |
| - The Original HRM result (53%) falls within the paper's stated ±2% variance for small-sample learning. | |
| - The Augmented HRM gap (90.5% vs 96.9%) is attributed to single-GPU vs multi-GPU training dynamics. | |
| - Optimizer: adam-atan2-pytorch v0.2.8 |