arXiv Paper GitHub Base

HRM-MoE: Efficient Sparse Pretraining with Hierarchical Reasoning

HRM-MoE is a sparse Mixture-of-Experts extension of HRM-Text. This release is the epoch-4 pretrained 64x8 MoE checkpoint, exported from the native FSDP2 checkpoint into a single bf16 model.safetensors file.

The main point of this checkpoint is sparse activation: the model has 5.41B total parameters, but each token activates only top-8 of 64 experts, for about 1.19B active parameters per token (21.9% of the full parameter pool). In other words, it keeps active compute close to dense HRM-Text-1B while giving the model a much larger expert parameter pool.

This is a pre-alignment base checkpoint. It is not a chat or instruction-following assistant.

Model Structure Comparison

Dense HRM-Text XL / HRM-Text-1B-style FFN

input tokens
  -> H/L recurrent HRM blocks
  -> dense SwiGLU FFN, intermediate width 4096
  -> all dense FFN parameters are active for every token

HRM-MoE 64x8

input tokens
  -> H/L recurrent HRM blocks
  -> router over 64 SwiGLU experts
  -> top-8 experts active per token, each expert width 512
  -> active FFN width = 8 x 512 = 4096
Model Total params Active params / token Active ratio FFN path
Dense HRM-Text XL ~1.18B ~1.18B 100% dense SwiGLU, width 4096
HRM-MoE 64x8 5.41B ~1.19B 21.9% 64 routed SwiGLU experts, top-8, width 512 each

The active FFN width is intentionally matched to the dense XL baseline, while the sparse expert pool increases total capacity.

Results vs Dense HRM-Text

The comparison below uses the same HRM pretraining pipeline, same sampled pretraining data, same 32-GPU setup, and 4 pretraining epochs. Values are percentages unless otherwise noted.

Benchmark Metric Dense HRM-Text XL e4 HRM-MoE 64x8 e4 Delta
GSM8k acc 83.93 84.99 +1.06
MATH acc 54.96 60.08 +5.12
DROP em 79.45 80.86 +1.41
DROP f1 83.06 84.53 +1.47
MMLU acc 61.38 61.18 -0.20
ARC acc 83.02 87.80 +4.78
HellaSwag acc 61.96 73.89 +11.93
Winogrande acc 71.98 73.88 +1.90
BoolQ acc 87.25 88.75 +1.50
MMLU-Pro acc 32.72 37.57 +4.85
AIME25 maj_pass@1 13.33 16.67 +3.34
AIME25 maj_pass@10 36.67 36.67 +0.00
AIME25 maj_pass@100 53.33 56.67 +3.34

Model Details

Field Value
Architecture HRM-Text with sparse MoE FFN
Checkpoint epoch 4 pretrained checkpoint
Format bf16 model.safetensors
Total parameters 5.41B
Active parameters / token ~1.19B
Non-expert parameters ~0.58B
Expert parameters ~4.83B
Hidden size 1536
Layers per H / L stack 16
Attention heads 12
H_cycles x L_cycles 2 x 3
Max sequence length 4096
Vocabulary 65,536
Position encoding RoPE theta 10000
Normalization Parameterless Pre-RMSNorm
Attention Gated attention
MoE experts 64
MoE top-k 8
Expert intermediate size 512
Active FFN width 8 x 512 = 4096
MoE kernel grouped Triton expert GEMM
Exported weights EMA weights
dtype bfloat16

Requirements

This checkpoint uses the custom hrm_text_moe architecture through Hugging Face trust_remote_code=True.

pip install --upgrade transformers safetensors accelerate

The remote-code inference path is self-contained and uses PyTorch attention plus grouped expert bmm. Hopper-class GPUs are recommended for practical speed.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "Xiaoye08/HRM-MoE"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    dtype=torch.bfloat16,
).cuda().eval()

# synth,cot composite: reasoning / CoT style.
condition = "<|quad_end|><|object_ref_end|>"
prompt = f"<|im_start|>{condition}Explain why the sky is blue.<|im_end|>"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
inputs["token_type_ids"] = torch.ones_like(inputs["input_ids"])

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(tokenizer.decode(out[0], skip_special_tokens=False))

This mirrors the HRM-Text-1B usage pattern. The only extra flag is trust_remote_code=True, because hrm_text_moe is not yet a native Transformers architecture.

Prompt Format

HRM-Text and HRM-MoE use condition prefix tokens. Prompts should be rendered as:

<|im_start|><condition tokens>prompt text<|im_end|>

Common conditions:

  • direct -> <|object_ref_start|>
  • cot -> <|object_ref_end|>
  • noisy -> <|quad_start|>
  • synth -> <|quad_end|>

For reasoning-style prompting, synth,cot maps to <|quad_end|><|object_ref_end|>.

PrefixLM Mask

The checkpoint was pretrained with the HRM-Text PrefixLM objective. Pass token_type_ids = torch.ones_like(input_ids) for the prompt, marking it as one bidirectional prefix block before autoregressive decoding.

Training Snapshot

  • 32 GPUs
  • 4 pretraining epochs
  • global batch size 196,608 tokens
  • learning rate 2.2e-4
  • bfloat16 forward/backward
  • EMA decay 0.9999
  • grouped Triton MoE expert kernels

Limitations

  • Pre-alignment base checkpoint, not a chat model.
  • Not instruction-tuned, RLHF-trained, or safety-aligned.
  • English-focused pretraining mixture.
  • Requires trust_remote_code=True; CUDA is strongly recommended for practical inference.
  • Outputs may be inaccurate, biased, or unsafe.

License

Apache License 2.0.

Citation

This model is derived from HRM-Text. If you use HRM-Text or HRM-MoE, please cite:

@misc{wang2026hrmtextefficientpretrainingscaling,
      title={HRM-Text: Efficient Pretraining Beyond Scaling},
      author={Guan Wang and Changling Liu and Chenyu Wang and Cai Zhou and Yuhao Sun and Yifei Wu and Shuai Zhen and Luca Scimeca and Yasin Abbasi Yadkori},
      year={2026},
      eprint={2605.20613},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.20613},
}

Upstream

Downloads last month
28
Safetensors
Model size
5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Xiaoye08/HRM-MoE