ThomasHeim
/

HRM-Reproduction-Checkpoints

Model card Files Files and versions

HRM-Reproduction-Checkpoints / readme.md

ThomasHeim's picture

Update readme.md

6b21b0e verified 3 days ago

|

history blame contribute delete

2.65 kB

	---
	license: apache-2.0
	tags:
	- hierarchical-reasoning-model
	- sudoku
	- puzzle-solving
	- recursive-reasoning
	- adaptive-computation
	pretty_name: HRM Sudoku-Extreme Checkpoints
	---

	# HRM Sudoku-Extreme Checkpoints

	Trained checkpoints for reproducing Hierarchical Reasoning Model (HRM) results on the Sudoku-Extreme benchmark. Trained on a single NVIDIA GH200 GPU.

	## Models

	### Original HRM (`sudoku-extreme/original-hrm/`)
	- Run name: liberal-bee
	- Architecture: HierarchicalReasoningModel_ACTV1 (~27M parameters)
	- Dataset: sudoku-extreme-1k-aug-1000 (vanilla, 1000 puzzles, 1000x augmented)
	- Training: 20,000 epochs, lr=7e-5, batch=384, 1 GPU
	- Test exact accuracy: 53% (paper: 55% ±2%)
	- Checkpoints: 20 checkpoints from step 2604 to step 52080

	### Augmented HRM (`sudoku-extreme/augmented-hrm/`)
	- Run name: hopeful-quetzal
	- Architecture: HierarchicalReasoningModel_ACTV1 (~27M parameters)
	- Dataset: sudoku-extreme-1k-aug-1000-hint (with easier puzzles mixed in)
	- Training: 40,000 epochs, lr=1e-4, batch=768, 1 GPU
	- Peak single-checkpoint test accuracy: 54.2% (paper: 59.9%)
	- Ensemble accuracy (10 ckpts + 9 permutations, 1000 samples): 90.5% (paper: 96.9%)
	- Checkpoints: 40 checkpoints from step 1302 to step 52080

	## Source Papers

	- Original HRM: [Hierarchical Reasoning Model](https://arxiv.org/abs/2506.21734) (Wang et al., 2025)
	- Augmented HRM: [Are Your Reasoning Models Reasoning or Guessing?](https://arxiv.org/abs/2601.10679) (Ren & Liu, 2026)

	## How to Evaluate

	### Original HRM — Single Checkpoint
	Train and check `eval/exact_accuracy` in W&B, as described in the [HRM repo](https://github.com/sapientinc/HRM).

	### Augmented HRM — Ensemble (10 checkpoints + 9 permutations)
	Using [batch_inference.py](https://github.com/renrua52/hrm-mechanistic-analysis):

	```python
	# Snapshot evaluation (1000 test samples)
	python batch_inference.py \
	--checkpoints "step_40362,step_41664,...,step_52080" \
	--permutes 9 --num_batch 10 --batch_size 100

	# Full evaluation (422,786 test samples)
	python batch_inference.py \
	--checkpoints "step_40362,step_41664,...,step_52080" \
	--permutes 9

	Reproduction Notes

	- All models trained on a single NVIDIA GH200 GPU (102GB VRAM). The papers used 8 GPUs.
	- The Original HRM result (53%) falls within the paper's stated ±2% variance for small-sample learning.
	- The Augmented HRM gap (90.5% vs 96.9%) is attributed to single-GPU vs multi-GPU training dynamics.
	- Optimizer: adam-atan2-pytorch v0.2.8