license: apache-2.0 tags: - hierarchical-reasoning-model - sudoku - puzzle-solving - recursive-reasoning - adaptive-computation pretty_name: HRM Sudoku-Extreme Checkpoints

HRM Sudoku-Extreme Checkpoints

Trained checkpoints for reproducing Hierarchical Reasoning Model (HRM) results on the Sudoku-Extreme benchmark. Trained on a single NVIDIA GH200 GPU.

Models

Original HRM (`sudoku-extreme/original-hrm/`)

Run name: liberal-bee
Architecture: HierarchicalReasoningModel_ACTV1 (~27M parameters)
Dataset: sudoku-extreme-1k-aug-1000 (vanilla, 1000 puzzles, 1000x augmented)
Training: 20,000 epochs, lr=7e-5, batch=384, 1 GPU
Test exact accuracy: 53% (paper: 55% ±2%)
Checkpoints: 20 checkpoints from step 2604 to step 52080

Augmented HRM (`sudoku-extreme/augmented-hrm/`)

Run name: hopeful-quetzal
Architecture: HierarchicalReasoningModel_ACTV1 (~27M parameters)
Dataset: sudoku-extreme-1k-aug-1000-hint (with easier puzzles mixed in)
Training: 40,000 epochs, lr=1e-4, batch=768, 1 GPU
Peak single-checkpoint test accuracy: 54.2% (paper: 59.9%)
Ensemble accuracy (10 ckpts + 9 permutations, 1000 samples): 90.5% (paper: 96.9%)
Checkpoints: 40 checkpoints from step 1302 to step 52080

Source Papers

Original HRM: Hierarchical Reasoning Model (Wang et al., 2025)
Augmented HRM: Are Your Reasoning Models Reasoning or Guessing? (Ren & Liu, 2026)

How to Evaluate

Original HRM — Single Checkpoint

Train and check eval/exact_accuracy in W&B, as described in the HRM repo.

Augmented HRM — Ensemble (10 checkpoints + 9 permutations)

Using batch_inference.py:

# Snapshot evaluation (1000 test samples)
python batch_inference.py \
  --checkpoints "step_40362,step_41664,...,step_52080" \
  --permutes 9 --num_batch 10 --batch_size 100

# Full evaluation (422,786 test samples)
python batch_inference.py \
  --checkpoints "step_40362,step_41664,...,step_52080" \
  --permutes 9

Reproduction Notes

- All models trained on a single NVIDIA GH200 GPU (102GB VRAM). The papers used 8 GPUs.
- The Original HRM result (53%) falls within the paper's stated ±2% variance for small-sample learning.
- The Augmented HRM gap (90.5% vs 96.9%) is attributed to single-GPU vs multi-GPU training dynamics.
- Optimizer: adam-atan2-pytorch v0.2.8