Create readme.md
Browse files
readme.md
ADDED
|
@@ -0,0 +1,64 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
tags:
|
| 4 |
+
- hierarchical-reasoning-model
|
| 5 |
+
- sudoku
|
| 6 |
+
- puzzle-solving
|
| 7 |
+
- recursive-reasoning
|
| 8 |
+
- adaptive-computation
|
| 9 |
+
pretty_name: HRM Sudoku-Extreme Checkpoints
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
# HRM Sudoku-Extreme Checkpoints
|
| 13 |
+
|
| 14 |
+
Trained checkpoints for reproducing **Hierarchical Reasoning Model (HRM)** results on the Sudoku-Extreme benchmark. Trained on a single NVIDIA GH200 GPU.
|
| 15 |
+
|
| 16 |
+
## Models
|
| 17 |
+
|
| 18 |
+
### Original HRM (`sudoku-extreme/original-hrm/`)
|
| 19 |
+
- **Run name:** liberal-bee
|
| 20 |
+
- **Architecture:** HierarchicalReasoningModel_ACTV1 (~27M parameters)
|
| 21 |
+
- **Dataset:** sudoku-extreme-1k-aug-1000 (vanilla, 1000 puzzles, 1000x augmented)
|
| 22 |
+
- **Training:** 20,000 epochs, lr=7e-5, batch=384, 1 GPU
|
| 23 |
+
- **Test exact accuracy:** 53% (paper: 55% ±2%)
|
| 24 |
+
- **Checkpoints:** 20 checkpoints from step 2604 to step 52080
|
| 25 |
+
|
| 26 |
+
### Augmented HRM (`sudoku-extreme/augmented-hrm/`)
|
| 27 |
+
- **Run name:** hopeful-quetzal
|
| 28 |
+
- **Architecture:** HierarchicalReasoningModel_ACTV1 (~27M parameters)
|
| 29 |
+
- **Dataset:** sudoku-extreme-1k-aug-1000-hint (with easier puzzles mixed in)
|
| 30 |
+
- **Training:** 40,000 epochs, lr=1e-4, batch=768, 1 GPU
|
| 31 |
+
- **Peak single-checkpoint test accuracy:** 54.2% (paper: 59.9%)
|
| 32 |
+
- **Ensemble accuracy (10 ckpts + 9 permutations, 1000 samples):** 90.5% (paper: 96.9%)
|
| 33 |
+
- **Checkpoints:** 40 checkpoints from step 1302 to step 52080
|
| 34 |
+
|
| 35 |
+
## Source Papers
|
| 36 |
+
|
| 37 |
+
- **Original HRM:** [Hierarchical Reasoning Model](https://arxiv.org/abs/2506.21734) (Wang et al., 2025)
|
| 38 |
+
- **Augmented HRM:** [Are Your Reasoning Models Reasoning or Guessing?](https://arxiv.org/abs/2601.10679) (Ren & Liu, 2026)
|
| 39 |
+
|
| 40 |
+
## How to Evaluate
|
| 41 |
+
|
| 42 |
+
### Original HRM — Single Checkpoint
|
| 43 |
+
Train and check `eval/exact_accuracy` in W&B, as described in the [HRM repo](https://github.com/sapientinc/HRM).
|
| 44 |
+
|
| 45 |
+
### Augmented HRM — Ensemble (10 checkpoints + 9 permutations)
|
| 46 |
+
Using [batch_inference.py](https://github.com/renrua52/hrm-mechanistic-analysis):
|
| 47 |
+
|
| 48 |
+
```python
|
| 49 |
+
# Snapshot evaluation (1000 test samples)
|
| 50 |
+
python batch_inference.py \
|
| 51 |
+
--checkpoints "step_40362,step_41664,...,step_52080" \
|
| 52 |
+
--permutes 9 --num_batch 10 --batch_size 100
|
| 53 |
+
|
| 54 |
+
# Full evaluation (422,786 test samples)
|
| 55 |
+
python batch_inference.py \
|
| 56 |
+
--checkpoints "step_40362,step_41664,...,step_52080" \
|
| 57 |
+
--permutes 9
|
| 58 |
+
|
| 59 |
+
Reproduction Notes
|
| 60 |
+
|
| 61 |
+
- All models trained on a single NVIDIA GH200 GPU (102GB VRAM). The papers used 8 GPUs.
|
| 62 |
+
- The Original HRM result (53%) falls within the paper's stated ±2% variance for small-sample learning.
|
| 63 |
+
- The Augmented HRM gap (90.5% vs 96.9%) is attributed to single-GPU vs multi-GPU training dynamics.
|
| 64 |
+
- Optimizer: adam-atan2-pytorch v0.2.8
|