--- license: apache-2.0 tags: - hierarchical-reasoning-model - sudoku - puzzle-solving - recursive-reasoning - adaptive-computation pretty_name: HRM Sudoku-Extreme Checkpoints --- # HRM Sudoku-Extreme Checkpoints Trained checkpoints for reproducing **Hierarchical Reasoning Model (HRM)** results on the Sudoku-Extreme benchmark. Trained on a single NVIDIA GH200 GPU. ## Models ### Original HRM (`sudoku-extreme/original-hrm/`) - **Run name:** liberal-bee - **Architecture:** HierarchicalReasoningModel_ACTV1 (~27M parameters) - **Dataset:** sudoku-extreme-1k-aug-1000 (vanilla, 1000 puzzles, 1000x augmented) - **Training:** 20,000 epochs, lr=7e-5, batch=384, 1 GPU - **Test exact accuracy:** 53% (paper: 55% ±2%) - **Checkpoints:** 20 checkpoints from step 2604 to step 52080 ### Augmented HRM (`sudoku-extreme/augmented-hrm/`) - **Run name:** hopeful-quetzal - **Architecture:** HierarchicalReasoningModel_ACTV1 (~27M parameters) - **Dataset:** sudoku-extreme-1k-aug-1000-hint (with easier puzzles mixed in) - **Training:** 40,000 epochs, lr=1e-4, batch=768, 1 GPU - **Peak single-checkpoint test accuracy:** 54.2% (paper: 59.9%) - **Ensemble accuracy (10 ckpts + 9 permutations, 1000 samples):** 90.5% (paper: 96.9%) - **Checkpoints:** 40 checkpoints from step 1302 to step 52080 ## Source Papers - **Original HRM:** [Hierarchical Reasoning Model](https://arxiv.org/abs/2506.21734) (Wang et al., 2025) - **Augmented HRM:** [Are Your Reasoning Models Reasoning or Guessing?](https://arxiv.org/abs/2601.10679) (Ren & Liu, 2026) ## How to Evaluate ### Original HRM — Single Checkpoint Train and check `eval/exact_accuracy` in W&B, as described in the [HRM repo](https://github.com/sapientinc/HRM). ### Augmented HRM — Ensemble (10 checkpoints + 9 permutations) Using [batch_inference.py](https://github.com/renrua52/hrm-mechanistic-analysis): ```python # Snapshot evaluation (1000 test samples) python batch_inference.py \ --checkpoints "step_40362,step_41664,...,step_52080" \ --permutes 9 --num_batch 10 --batch_size 100 # Full evaluation (422,786 test samples) python batch_inference.py \ --checkpoints "step_40362,step_41664,...,step_52080" \ --permutes 9 Reproduction Notes - All models trained on a single NVIDIA GH200 GPU (102GB VRAM). The papers used 8 GPUs. - The Original HRM result (53%) falls within the paper's stated ±2% variance for small-sample learning. - The Augmented HRM gap (90.5% vs 96.9%) is attributed to single-GPU vs multi-GPU training dynamics. - Optimizer: adam-atan2-pytorch v0.2.8