arXiv Paper GitHub MoE

HRM-Text-0.6B

HRM-Text-0.6B is a dense HRM-Text L-size base checkpoint trained with the HRM pretraining pipeline. This release is the epoch-4 pretrained checkpoint, exported from the native FSDP2 checkpoint into a single bf16 model.safetensors file.

The model has 694.7M dense parameters, so every token activates the full 0.6B parameter budget. It is the dense active-compute baseline for the active-matched L-MoE 64x8 experiment in Xiaoye08/HRM-MoE.

This is a pre-alignment base checkpoint. It is not a chat or instruction-following assistant.

Model Structure Comparison

HRM-Text-0.6B dense L

input tokens
  -> H/L recurrent HRM blocks
  -> dense SwiGLU FFN, intermediate width 3584
  -> all dense FFN parameters are active for every token

Active-matched L-MoE 64x8

input tokens
  -> H/L recurrent HRM blocks
  -> router over 64 SwiGLU experts
  -> top-8 experts active per token, each expert width 448
  -> active FFN width = 8 x 448 = 3584
Model Total params Active params / token Active ratio FFN path
HRM-Text-0.6B dense L 694.7M 694.7M 100% dense SwiGLU, width 3584
L-MoE 64x8 active-matched larger sparse pool ~dense-L active compute sparse 64 routed SwiGLU experts, top-8, width 448 each

The MoE variant matches the active FFN width of this dense L checkpoint while increasing total expert capacity.

Results vs Active-Matched L-MoE

The comparison below uses the same sampled pretraining data, same 32-GPU setup, and 4 pretraining epochs. Values are percentages unless otherwise noted.

Benchmark Metric HRM-Text-0.6B dense e4 L-MoE 64x8 e4 Delta
GSM8k acc 79.61 83.62 +4.01
MATH acc 50.96 55.50 +4.54
DROP em 74.21 77.86 +3.65
DROP f1 77.94 81.47 +3.53
MMLU acc 53.74 58.60 +4.86
ARC acc 76.37 82.27 +5.90
HellaSwag acc 51.48 65.58 +14.10
Winogrande acc 66.93 70.32 +3.39
BoolQ acc 84.56 86.02 +1.46
MMLU-Pro acc 28.24 30.70 +2.46
AIME25 maj_pass@1 16.67 16.67 +0.00
AIME25 maj_pass@10 26.67 26.67 +0.00
AIME25 maj_pass@100 50.00 50.00 +0.00

This dense checkpoint is the main 0.6B baseline. The active-matched L-MoE improves Standard and MMLU-Pro metrics while leaving AIME25 majority voting roughly unchanged.

Model Details

Field Value
Architecture Dense HRM-Text
Checkpoint epoch 4 pretrained checkpoint
Format bf16 model.safetensors
Parameters 694,682,880
Hidden size 1280
Total dense transformer layers 24 (12 H + 12 L)
Layers per H / L stack 12
Attention heads 10
H_cycles x L_cycles 2 x 3
Max sequence length 4096
Vocabulary 65,536
Position encoding RoPE theta 10000
Normalization Parameterless Pre-RMSNorm
Attention Gated attention
FFN Dense SwiGLU
FFN intermediate size 3584
Exported weights EMA weights
dtype bfloat16

Requirements

This checkpoint includes remote code for the hrm_text architecture so it can be loaded on Transformers versions that do not yet ship native HRM-Text support.

pip install --upgrade transformers safetensors accelerate

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "Xiaoye08/HRM-Text-0.6B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    dtype=torch.bfloat16,
).cuda().eval()

# synth,cot composite: reasoning / CoT style.
condition = "<|quad_end|><|object_ref_end|>"
prompt = f"<|im_start|>{condition}Explain why the sky is blue.<|im_end|>"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
inputs["token_type_ids"] = torch.ones_like(inputs["input_ids"])

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(tokenizer.decode(out[0], skip_special_tokens=False))

Prompt Format

HRM-Text uses condition prefix tokens. Prompts should be rendered as:

<|im_start|><condition tokens>prompt text<|im_end|>

Common conditions:

  • direct -> <|object_ref_start|>
  • cot -> <|object_ref_end|>
  • noisy -> <|quad_start|>
  • synth -> <|quad_end|>

For reasoning-style prompting, synth,cot maps to <|quad_end|><|object_ref_end|>.

PrefixLM Mask

The checkpoint was pretrained with the HRM-Text PrefixLM objective. Pass token_type_ids = torch.ones_like(input_ids) for the prompt, marking it as one bidirectional prefix block before autoregressive decoding.

Training Snapshot

  • 32 GPUs
  • 4 pretraining epochs
  • global batch size 172,032 tokens
  • learning rate 2.5e-4
  • bfloat16 forward/backward
  • EMA decay 0.9999
  • official HRM-Text L configuration

Limitations

  • Pre-alignment base checkpoint, not a chat model.
  • Not instruction-tuned, RLHF-trained, or safety-aligned.
  • English-focused pretraining mixture.
  • Requires trust_remote_code=True unless using a Transformers version with native hrm_text support.
  • Outputs may be inaccurate, biased, or unsafe.

License

Apache License 2.0.

Citation

This model is derived from HRM-Text. If you use HRM-Text, please cite:

@misc{wang2026hrmtextefficientpretrainingscaling,
      title={HRM-Text: Efficient Pretraining Beyond Scaling},
      author={Guan Wang and Changling Liu and Chenyu Wang and Cai Zhou and Yuhao Sun and Yifei Wu and Shuai Zhen and Luca Scimeca and Yasin Abbasi Yadkori},
      year={2026},
      eprint={2605.20613},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.20613},
}

Related Models

Downloads last month
14
Safetensors
Model size
0.7B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Xiaoye08/HRM-Text-0.6B