ThomasHeim commited on
Commit
427088e
·
verified ·
1 Parent(s): aa6c32b

Create readme.md

Browse files
Files changed (1) hide show
  1. readme.md +64 -0
readme.md ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - hierarchical-reasoning-model
5
+ - sudoku
6
+ - puzzle-solving
7
+ - recursive-reasoning
8
+ - adaptive-computation
9
+ pretty_name: HRM Sudoku-Extreme Checkpoints
10
+ ---
11
+
12
+ # HRM Sudoku-Extreme Checkpoints
13
+
14
+ Trained checkpoints for reproducing **Hierarchical Reasoning Model (HRM)** results on the Sudoku-Extreme benchmark. Trained on a single NVIDIA GH200 GPU.
15
+
16
+ ## Models
17
+
18
+ ### Original HRM (`sudoku-extreme/original-hrm/`)
19
+ - **Run name:** liberal-bee
20
+ - **Architecture:** HierarchicalReasoningModel_ACTV1 (~27M parameters)
21
+ - **Dataset:** sudoku-extreme-1k-aug-1000 (vanilla, 1000 puzzles, 1000x augmented)
22
+ - **Training:** 20,000 epochs, lr=7e-5, batch=384, 1 GPU
23
+ - **Test exact accuracy:** 53% (paper: 55% ±2%)
24
+ - **Checkpoints:** 20 checkpoints from step 2604 to step 52080
25
+
26
+ ### Augmented HRM (`sudoku-extreme/augmented-hrm/`)
27
+ - **Run name:** hopeful-quetzal
28
+ - **Architecture:** HierarchicalReasoningModel_ACTV1 (~27M parameters)
29
+ - **Dataset:** sudoku-extreme-1k-aug-1000-hint (with easier puzzles mixed in)
30
+ - **Training:** 40,000 epochs, lr=1e-4, batch=768, 1 GPU
31
+ - **Peak single-checkpoint test accuracy:** 54.2% (paper: 59.9%)
32
+ - **Ensemble accuracy (10 ckpts + 9 permutations, 1000 samples):** 90.5% (paper: 96.9%)
33
+ - **Checkpoints:** 40 checkpoints from step 1302 to step 52080
34
+
35
+ ## Source Papers
36
+
37
+ - **Original HRM:** [Hierarchical Reasoning Model](https://arxiv.org/abs/2506.21734) (Wang et al., 2025)
38
+ - **Augmented HRM:** [Are Your Reasoning Models Reasoning or Guessing?](https://arxiv.org/abs/2601.10679) (Ren & Liu, 2026)
39
+
40
+ ## How to Evaluate
41
+
42
+ ### Original HRM — Single Checkpoint
43
+ Train and check `eval/exact_accuracy` in W&B, as described in the [HRM repo](https://github.com/sapientinc/HRM).
44
+
45
+ ### Augmented HRM — Ensemble (10 checkpoints + 9 permutations)
46
+ Using [batch_inference.py](https://github.com/renrua52/hrm-mechanistic-analysis):
47
+
48
+ ```python
49
+ # Snapshot evaluation (1000 test samples)
50
+ python batch_inference.py \
51
+ --checkpoints "step_40362,step_41664,...,step_52080" \
52
+ --permutes 9 --num_batch 10 --batch_size 100
53
+
54
+ # Full evaluation (422,786 test samples)
55
+ python batch_inference.py \
56
+ --checkpoints "step_40362,step_41664,...,step_52080" \
57
+ --permutes 9
58
+
59
+ Reproduction Notes
60
+
61
+ - All models trained on a single NVIDIA GH200 GPU (102GB VRAM). The papers used 8 GPUs.
62
+ - The Original HRM result (53%) falls within the paper's stated ±2% variance for small-sample learning.
63
+ - The Augmented HRM gap (90.5% vs 96.9%) is attributed to single-GPU vs multi-GPU training dynamics.
64
+ - Optimizer: adam-atan2-pytorch v0.2.8