HRM-MoE: Efficient Sparse Pretraining with Hierarchical Reasoning

HRM-MoE is a sparse Mixture-of-Experts extension of HRM-Text. This release is the epoch-4 pretrained 64x8 MoE checkpoint, exported from the native FSDP2 checkpoint into a single bf16 model.safetensors file.

The main point of this checkpoint is sparse activation: the model has 5.41B total parameters, but each token activates only top-8 of 64 experts, for about 1.19B active parameters per token (21.9% of the full parameter pool). In other words, it keeps active compute close to dense HRM-Text-1B while giving the model a much larger expert parameter pool.

This is a pre-alignment base checkpoint. It is not a chat or instruction-following assistant.

Model Structure Comparison

Dense HRM-Text XL / HRM-Text-1B-style FFN

input tokens
  -> H/L recurrent HRM blocks
  -> dense SwiGLU FFN, intermediate width 4096
  -> all dense FFN parameters are active for every token

HRM-MoE 64x8

input tokens
  -> H/L recurrent HRM blocks
  -> router over 64 SwiGLU experts
  -> top-8 experts active per token, each expert width 512
  -> active FFN width = 8 x 512 = 4096

Model	Total params	Active params / token	Active ratio	FFN path
Dense HRM-Text XL	~1.18B	~1.18B	100%	dense SwiGLU, width 4096
HRM-MoE 64x8	5.41B	~1.19B	21.9%	64 routed SwiGLU experts, top-8, width 512 each

The active FFN width is intentionally matched to the dense XL baseline, while the sparse expert pool increases total capacity.

Results vs Dense HRM-Text

The comparison below uses the same HRM pretraining pipeline, same sampled pretraining data, same 32-GPU setup, and 4 pretraining epochs. Values are percentages unless otherwise noted.

Benchmark	Metric	Dense HRM-Text XL e4	HRM-MoE 64x8 e4	Delta
GSM8k	acc	83.93	84.99	+1.06
MATH	acc	54.96	60.08	+5.12
DROP	em	79.45	80.86	+1.41
DROP	f1	83.06	84.53	+1.47
MMLU	acc	61.38	61.18	-0.20
ARC	acc	83.02	87.80	+4.78
HellaSwag	acc	61.96	73.89	+11.93
Winogrande	acc	71.98	73.88	+1.90
BoolQ	acc	87.25	88.75	+1.50
MMLU-Pro	acc	32.72	37.57	+4.85
AIME25	maj_pass@1	13.33	16.67	+3.34
AIME25	maj_pass@10	36.67	36.67	+0.00
AIME25	maj_pass@100	53.33	56.67	+3.34

Model Details

Field	Value
Architecture	HRM-Text with sparse MoE FFN
Checkpoint	epoch 4 pretrained checkpoint
Format	bf16 `model.safetensors`
Total parameters	5.41B
Active parameters / token	~1.19B
Non-expert parameters	~0.58B
Expert parameters	~4.83B
Hidden size	1536
Layers per H / L stack	16
Attention heads	12
H_cycles x L_cycles	2 x 3
Max sequence length	4096
Vocabulary	65,536
Position encoding	RoPE theta 10000
Normalization	Parameterless Pre-RMSNorm
Attention	Gated attention
MoE experts	64
MoE top-k	8
Expert intermediate size	512
Active FFN width	8 x 512 = 4096
MoE kernel	grouped Triton expert GEMM
Exported weights	EMA weights
dtype	bfloat16

Requirements

This checkpoint uses the custom hrm_text_moe architecture through Hugging Face trust_remote_code=True.

pip install --upgrade transformers safetensors accelerate

The remote-code inference path is self-contained and uses PyTorch attention plus grouped expert bmm. Hopper-class GPUs are recommended for practical speed.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "Xiaoye08/HRM-MoE"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    dtype=torch.bfloat16,
).cuda().eval()

# synth,cot composite: reasoning / CoT style.
condition = "<|quad_end|><|object_ref_end|>"
prompt = f"<|im_start|>{condition}Explain why the sky is blue.<|im_end|>"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
inputs["token_type_ids"] = torch.ones_like(inputs["input_ids"])

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(tokenizer.decode(out[0], skip_special_tokens=False))

This mirrors the HRM-Text-1B usage pattern. The only extra flag is trust_remote_code=True, because hrm_text_moe is not yet a native Transformers architecture.

Prompt Format

HRM-Text and HRM-MoE use condition prefix tokens. Prompts should be rendered as:

<|im_start|><condition tokens>prompt text<|im_end|>

Common conditions:

direct -> <|object_ref_start|>
cot -> <|object_ref_end|>
noisy -> <|quad_start|>
synth -> <|quad_end|>

For reasoning-style prompting, synth,cot maps to <|quad_end|><|object_ref_end|>.

PrefixLM Mask

The checkpoint was pretrained with the HRM-Text PrefixLM objective. Pass token_type_ids = torch.ones_like(input_ids) for the prompt, marking it as one bidirectional prefix block before autoregressive decoding.

Training Snapshot

32 GPUs
4 pretraining epochs
global batch size 196,608 tokens
learning rate 2.2e-4
bfloat16 forward/backward
EMA decay 0.9999
grouped Triton MoE expert kernels

Limitations

Pre-alignment base checkpoint, not a chat model.
Not instruction-tuned, RLHF-trained, or safety-aligned.
English-focused pretraining mixture.
Requires trust_remote_code=True; CUDA is strongly recommended for practical inference.
Outputs may be inaccurate, biased, or unsafe.

License

Apache License 2.0.

Citation

This model is derived from HRM-Text. If you use HRM-Text or HRM-MoE, please cite:

@misc{wang2026hrmtextefficientpretrainingscaling,
      title={HRM-Text: Efficient Pretraining Beyond Scaling},
      author={Guan Wang and Changling Liu and Chenyu Wang and Cai Zhou and Yuhao Sun and Yifei Wu and Shuai Zhen and Luca Scimeca and Yasin Abbasi Yadkori},
      year={2026},
      eprint={2605.20613},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.20613},
}

Upstream

HRM-Text paper: https://arxiv.org/abs/2605.20613
HRM-Text reference model: https://huggingface.co/sapientinc/HRM-Text-1B
HRM-MoE code: https://github.com/XiaoYee/HRM-MoE

Downloads last month: 109

Safetensors

Model size

5B params

Tensor type

BF16

Paper for Xiaoye08/HRM-MoE

HRM-Text: Efficient Pretraining Beyond Scaling

Paper • 2605.20613 • Published May 20 • 323