Add files using upload-large-folder tool

b6cdd64 verified 2 days ago

7.67 kB

	---
	license: apache-2.0
	language:
	- en
	library_name: mlx
	pipeline_tag: text-generation
	tags:
	- mlx
	- apple-silicon
	- hrm
	- hierarchical-reasoning
	- prefix-lm
	- pre-alignment
	- non-chat
	- non-instruction-tuned
	base_model: sapientinc/HRM-Text-1B
	---

	# HRM-Text-1B-MLX

	MLX port of [sapientinc/HRM-Text-1B](https://huggingface.co/sapientinc/HRM-Text-1B) — Sapient Intelligence's Hierarchical Reasoning Model adapted to language modeling. Optimized for Apple Silicon with a recurrent KV cache, `mx.fast` kernels, and a clean `load`/`generate` API.

	## What this is

	A 1 B-parameter "pre-alignment" language model that replaces the standard "scale parameters, pretrain on internet-scale text" recipe with scale compute depth at fixed parameters via recurrence, train only on instruction-response pairs. Two transformer modules — H (slow / strategic) and L (fast / execution) — iterate over the same input embeddings for `H_cycles × (L_cycles + 1) = 8` stack passes per token, with additive state injection between modules.

	From the paper: 60.7 % MMLU / 81.9 % ARC-C / 82.2 % DROP / 84.5 % GSM8K / 56.2 % MATH — competitive with 2-7 B open models at 100-900× fewer training tokens and 96-432× less compute.

	> Disclaimer (carries over from upstream): This is a pre-alignment checkpoint, not a chat or instruction-following assistant. It was pre-trained on a PrefixLM objective with condition prefix tokens. No multi-turn SFT, no RLHF, no chat coating. Use it as a research substrate, not a finished assistant.

	## Install

	```bash
	pip install mlx safetensors transformers
	```

	Then drop `hrm_mlx.py` into your project (the implementation is a single file).

	## Usage

	```python
	import mlx.core as mx
	from hrm_mlx import load, generate

	model, tokenizer = load("RockTalk/HRM-Text-1B-MLX") # or a local path

	# Reasoning / chain-of-thought (default condition)
	print(generate(
	model, tokenizer,
	"Janet's ducks lay 16 eggs per day. She eats 3 for breakfast and bakes "
	"muffins with 4 every day. She sells the remainder at $2 per egg. "
	"How much does she make daily?",
	condition="synth,cot",
	max_tokens=300,
	))

	# Direct answer (few-shot extraction / multi-choice)
	print(generate(
	model, tokenizer,
	"Q: What is the capital of Argentina?\nA:",
	condition="direct",
	max_tokens=10,
	))

	# Streaming
	for chunk in generate(
	model, tokenizer, "Explain why the sky is blue.",
	condition="synth", max_tokens=200, stream=True,
	):
	print(chunk, end="", flush=True)

	# Sampled
	print(generate(
	model, tokenizer, "Write a short poem about silicon.",
	condition="synth", temperature=0.8, top_p=0.9, max_tokens=200,
	))
	```

	## Condition modes

	HRM-Text routes between training distributions via composite condition prefix tokens. Combine with comma-separated tags (order matters):

	\| Condition \| Use for \| Notes \|
	\|---\|---\|---\|
	\| `synth,cot` \| Math, multi-step reasoning \| Most verbose, formal step-by-step with LaTeX \|
	\| `synth` \| Factual explanations \| Cleanest factual answers; boxed final answer \|
	\| `direct` \| Few-shot extraction, MCQ \| Terse, 1-5 tokens \|
	\| `cot` alone \| Free-form chain-of-thought \| Moderate verbosity \|
	\| `noisy` \| Web-crawl-style instructions \| Often terse; can leak training-data formatting \|

	Practical guidance:

	- For NLP tasks (classification, extraction, structured output), use `direct` with 2-8 few-shot examples. Zero-shot `direct` is noticeably weaker.
	- For math / reasoning, use `synth,cot`.
	- The model is not a base LM — it cannot continue raw text ("Once upon a time, …" gets interpreted as a question to answer).

	## Architecture

	```
	z_H = embed(input_ids) * embedding_scale # 1 / initializer_range ≈ 39.19
	z_L = zeros

	for h in range(H_cycles=2):
	for l in range(L_cycles=3):
	z_L = L_module(z_L + z_H) # 16-layer transformer stack
	z_H = H_module(z_H + z_L) # same architecture, separate weights

	logits = lm_head(z_H)
	```

	Per forward pass: `L_module` runs 6 times and `H_module` runs 2 times — 8 stack invocations × 16 layers = 128 transformer-layer-equivalents of compute, all at 1.18 B parameters.

	\| Field \| Value \|
	\|---\|---\|
	\| Parameters \| 1.18 B \|
	\| Hidden size \| 1536 \|
	\| Layers per stack \| 16 \|
	\| Attention heads \| 12 (MHA, head_dim 128) \|
	\| Intermediate size \| 4096 \|
	\| H_cycles × L_cycles \| 2 × 3 \|
	\| Max sequence \| 4096 \|
	\| Vocabulary \| 65,536 \|
	\| Position encoding \| RoPE (θ = 10,000) \|
	\| Activation \| SwiGLU \|
	\| Normalization \| Parameterless Pre-RMSNorm \|
	\| Attention \| Gated (sigmoid output gate) \|
	\| Objective \| Instruction-only PrefixLM (40 B unique tokens) \|

	## MLX-specific implementation notes

	- 128-slot recurrent KV cache — one slot per `(H_cycle, L_cycle \| trailing_H, layer)`, indexed as `(h * (L_cycles+1) + l) * num_layers_per_stack + layer_idx`. Each slot uses chunked-grow allocation (256-token chunks) à la `mlx-lm`.
	- `mx.fast.scaled_dot_product_attention` for both prefill and step.
	- `mx.fast.rope` with `offset` parameter so cached and new positions stay aligned without a precomputed cos/sin table.
	- `mx.fast.rms_norm` with `weight=None` (parameterless RMSNorm).
	- Fused projections — the safetensors store `gqkv_proj` (gate, q, k, v concatenated) and `gate_up_proj` (gate, up concatenated). The MLX port keeps the fused layout, splitting on dim 0 inside the forward (one big matmul beats four small ones on the Metal backend).
	- PrefixLM mask — when `token_type_ids` is all-ones on prefill (the canonical usage), we pass `mask=None` to the fast SDPA kernel and let the bidirectional attention happen implicitly. Mixed prefix/causal masks are supported via an explicit additive mask.

	## Benchmarks (M3 Ultra, bf16, greedy)

	\| Prompt \| MLX (this port) \| PyTorch MPS (with cache) \| Speedup \|
	\|---\|---\|---\|---\|
	\| Sky-blue (~15-tok prompt → 50 tok) \| 36.3 tok/s \| 19.3 tok/s \| 1.89× \|
	\| Janet GSM8K (~75-tok prompt → ~150 tok) \| 37.5 tok/s \| 14.3 tok/s \| 2.62× \|
	\| Train meeting (~50-tok prompt → 400 tok) \| 36.5 tok/s \| 14.7 tok/s \| 2.47× \|
	\| Capitals 8-shot (~115-tok prompt → 10 tok) \| 24.5 tok/s \| 11.8 tok/s \| 2.07× \|

	Numerical parity: token-for-token match with the PyTorch reference under greedy decoding for the first ~30 tokens on every prompt, occasional late-stage divergence on near-tie logits (bf16 precision limit) thereafter — both implementations are correct, just picking different tokens out of statistical ties.

	## Files

	```
	config.json — model config (unchanged from upstream)
	tokenizer.json — tokenizer (unchanged)
	tokenizer_config.json — special tokens map (EOS = <\|box_end\|>, id 11)
	model_mlx.safetensors — bf16 weights in MLX-native layout (2.2 GB)
	README.md — this file
	hrm_mlx.py — the implementation (~350 lines)
	LICENSE — Apache 2.0
	```

	## License

	[Apache License 2.0](LICENSE) — same as upstream.

	## Citation

	The original HRM-Text paper:

	```bibtex
	@misc{wang2026hrmtextefficientpretrainingscaling,
	title={HRM-Text: Efficient Pretraining Beyond Scaling},
	author={Guan Wang and Changling Liu and Chenyu Wang and Cai Zhou and Yuhao Sun and Yifei Wu and Shuai Zhen and Luca Scimeca and Yasin Abbasi Yadkori},
	year={2026},
	eprint={2605.20613},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2605.20613},
	}
	```

	## Acknowledgments

	Upstream model and weights: [Sapient Intelligence](https://huggingface.co/sapientinc).

	MLX port by [@RockTalk](https://huggingface.co/RockTalk).