---
  license: apache-2.0
  tags:
    - hierarchical-reasoning-model
    - sudoku
    - puzzle-solving
    - recursive-reasoning
    - adaptive-computation
  pretty_name: HRM Sudoku-Extreme Checkpoints
  ---

  # HRM Sudoku-Extreme Checkpoints

  Trained checkpoints for reproducing **Hierarchical Reasoning Model (HRM)** results on the Sudoku-Extreme benchmark. Trained on a single NVIDIA GH200 GPU.

  ## Models

  ### Original HRM (`sudoku-extreme/original-hrm/`)
  - **Run name:** liberal-bee
  - **Architecture:** HierarchicalReasoningModel_ACTV1 (~27M parameters)
  - **Dataset:** sudoku-extreme-1k-aug-1000 (vanilla, 1000 puzzles, 1000x augmented)
  - **Training:** 20,000 epochs, lr=7e-5, batch=384, 1 GPU
  - **Test exact accuracy:** 53% (paper: 55% ±2%)
  - **Checkpoints:** 20 checkpoints from step 2604 to step 52080

  ### Augmented HRM (`sudoku-extreme/augmented-hrm/`)
  - **Run name:** hopeful-quetzal
  - **Architecture:** HierarchicalReasoningModel_ACTV1 (~27M parameters)
  - **Dataset:** sudoku-extreme-1k-aug-1000-hint (with easier puzzles mixed in)
  - **Training:** 40,000 epochs, lr=1e-4, batch=768, 1 GPU
  - **Peak single-checkpoint test accuracy:** 54.2% (paper: 59.9%)
  - **Ensemble accuracy (10 ckpts + 9 permutations, 1000 samples):** 90.5% (paper: 96.9%)
  - **Checkpoints:** 40 checkpoints from step 1302 to step 52080

  ## Source Papers

  - **Original HRM:** [Hierarchical Reasoning Model](https://arxiv.org/abs/2506.21734) (Wang et al., 2025)
  - **Augmented HRM:** [Are Your Reasoning Models Reasoning or Guessing?](https://arxiv.org/abs/2601.10679) (Ren & Liu, 2026)

  ## How to Evaluate

  ### Original HRM — Single Checkpoint
  Train and check `eval/exact_accuracy` in W&B, as described in the [HRM repo](https://github.com/sapientinc/HRM).

  ### Augmented HRM — Ensemble (10 checkpoints + 9 permutations)
  Using [batch_inference.py](https://github.com/renrua52/hrm-mechanistic-analysis):

  ```python
  # Snapshot evaluation (1000 test samples)
  python batch_inference.py \
    --checkpoints "step_40362,step_41664,...,step_52080" \
    --permutes 9 --num_batch 10 --batch_size 100

  # Full evaluation (422,786 test samples)
  python batch_inference.py \
    --checkpoints "step_40362,step_41664,...,step_52080" \
    --permutes 9

  Reproduction Notes

  - All models trained on a single NVIDIA GH200 GPU (102GB VRAM). The papers used 8 GPUs.
  - The Original HRM result (53%) falls within the paper's stated ±2% variance for small-sample learning.
  - The Augmented HRM gap (90.5% vs 96.9%) is attributed to single-GPU vs multi-GPU training dynamics.
  - Optimizer: adam-atan2-pytorch v0.2.8