Instructions to use joelhenwang/OdinNext-138M-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use joelhenwang/OdinNext-138M-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="joelhenwang/OdinNext-138M-Instruct", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("joelhenwang/OdinNext-138M-Instruct", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use joelhenwang/OdinNext-138M-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "joelhenwang/OdinNext-138M-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "joelhenwang/OdinNext-138M-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/joelhenwang/OdinNext-138M-Instruct

SGLang

How to use joelhenwang/OdinNext-138M-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "joelhenwang/OdinNext-138M-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "joelhenwang/OdinNext-138M-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "joelhenwang/OdinNext-138M-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "joelhenwang/OdinNext-138M-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use joelhenwang/OdinNext-138M-Instruct with Docker Model Runner:
```
docker model run hf.co/joelhenwang/OdinNext-138M-Instruct
```

OdinNext-138M-Instruct / README.md

joelhenwang

Update README.md

77b819b verified 2 days ago

preview code

raw

history blame contribute delete

8.91 kB

	---
	license: apache-2.0
	language:
	- en
	library_name: transformers
	pipeline_tag: text-generation
	base_model: joelhenwang/OdinNext-138M-Base
	tags:
	- odinnext
	- hgrn2
	- linear-attention
	- recurrent
	- instruct
	- chatml
	- amd
	- rocm
	- custom_code
	- arxiv:2404.07904
	- arxiv:2605.06546
	- arxiv:2407.12665
	- arxiv:2506.14202
	---

	# OdinNext-138M-Instruct

	A 138.4M-parameter instruction-tuned language model that replaces softmax
	self-attention with an HGRN2 gated linear recurrence. Fine-tuned from
	[OdinNext-138M-Base](https://huggingface.co/joelhenwang/OdinNext-138M-Base),
	which was pretrained from scratch on 101.6B tokens on **two AMD Ryzen AI
	MAX+ 395 (Strix Halo) mini-PCs** — using a TST + DiffusionBlocks + dual-machine
	DDP stack that trained roughly 10-20x faster than a conventional
	end-to-end pass on the same hardware.

	This is a small model. It follows instructions and writes fluent, assistant-style
	answers (markdown, step-by-step), but its factual accuracy is limited by scale.
	Treat it as a lightweight assistant and a research artifact, not a knowledge base.

	> Uses custom Transformers code. `trust_remote_code=True` runs Python from this
	> repo — review the files or pin a commit before trusting it.

	## Results

	Zero-shot, on three widely-reported public benchmarks. **OdinNext rows were
	measured with our own harness** (`scripts/eval_benchmarks.py`; HellaSwag = acc_norm,
	ARC = mean of Easy+Challenge acc, PIQA = acc); the other rows are **as reported by
	Axiomic Labs** on the [GPT-X2-125M](https://huggingface.co/AxiomicLabs/GPT-X2-125M)
	card, so numbers are not perfectly comparable across harnesses.

	\| Company \| Model \| HellaSwag \| ARC (avg) \| PIQA \| Training tokens \|
	\|---\|---\|---\|---\|---\|---\|
	\| HuggingFace \| SmolLM2-135M \| 43.22% \| 44.62% \| 67.52% \| 2T \|
	\| Axiomic Labs \| GPT-X2-125M \| 40.55% \| 39.90% \| 66.97% \| 75B \|
	\| HuggingFace \| SmolLM-135M \| 42.70% \| 43.17% \| 67.19% \| 600B \|
	\| Facebook \| MobileLLM-R1-140M-base \| 33.91% \| 37.47% \| 62.79% \| 4.2T \|
	\| Axiomic Labs \| GPT-X-125M \| 36.57% \| 38.84% \| 65.72% \| 15B \|
	\| Facebook \| MobileLLM-125M \| 38.90% \| 35.50% \| 65.30% \| 1T \|
	\| OpenAI \| GPT-2 (124M) \| 31.49% \| 31.40% \| 63.28% \| ~10B \|
	\| EleutherAI \| Pythia-160M \| 30.46% \| 29.95% \| 57.94% \| ~225B \|
	\| Facebook \| OPT-125M \| 31.39% \| 31.53% \| 62.02% \| 180B \|
	\| EleutherAI \| GPT-Neo-125M \| 30.55% \| 31.43% \| 61.75% \| 300B \|
	\| This work \| OdinNext-138M-Base \| 33.05% \| 34.29% \| 58.81% \| 101.6B \|
	\| This work \| OdinNext-138M-Instruct \| 32.85% \| 33.14% \| 59.25% \| 101.6B + SFT/SeqKD \|


	## Usage

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch

	repo = "joelhenwang/OdinNext-138M-Instruct"
	tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(
	repo, trust_remote_code=True, torch_dtype=torch.float16,
	).to("cuda").eval()

	msgs = [{"role": "user", "content": "Explain photosynthesis in two sentences."}]
	ids = tok.apply_chat_template(msgs, return_tensors="pt", add_generation_prompt=True).to("cuda")
	out = model.generate(ids, max_new_tokens=200, do_sample=True, temperature=0.7,
	top_p=0.9, repetition_penalty=1.3)
	print(tok.decode(out[0, ids.shape[1]:], skip_special_tokens=True))
	```

	Uses ChatML (`<\|im_start\|>role\n...<\|im_end\|>`). A `repetition_penalty`
	around 1.2-1.3 is recommended at this scale.

	## Architecture

	Decoder-only causal LM, 16 pre-norm blocks:

	```text
	x = x + sigmoid(gate_attn) * HGRN2(ZCRMSNorm(x))
	x = x + sigmoid(gate_ffn) * SwiGLU2(ZCRMSNorm(x))
	```

	\| Item \| Value \|
	\|---\|---\|
	\| Parameters \| 138.4M (113.3M non-embedding) \|
	\| Layers / hidden / heads \| 16 / 768 / 6 \|
	\| Per-head recurrent state \| 128 x 128 \|
	\| FFN inner \| 2,048 \|
	\| Vocabulary \| 32,770 (custom 32K BPE + 2 ChatML tokens) \|
	\| Max sequence length \| 2,048 \|
	\| Mixer \| HGRN2 gated linear recurrence; RoPE (theta=100K) on even layers, position-free on odd \|
	\| Decoding state \| fixed-size recurrent state (O(1)/token), not a growing KV cache \|

	The HGRN2 state `S_t = diag(exp(g_t)) S_{t-1} + k_t (x) v_t` is **constant in size
	w.r.t. context length** (~3 MiB fp16 at batch 1) — unlike a Transformer KV cache
	that grows linearly with tokens.

	## Training

	### Data

	Pretraining used the Dolmino mix ([`allenai/dolma3_dolmino_mix-100B-1025`](https://huggingface.co/datasets/allenai/dolma3_dolmino_mix-100B-1025)),
	curated by dropping the synthetic and noisy partitions and keeping the natural
	text + code:

	- Excluded: all synthetic reasoning-trace subsets (Gemini / QwQ / R1 /
	OpenThoughts2 / Llama-Nemotron, math- and code-meta-reasoning, omr-rewrite,
	verifiable GPT-4.1 / o4-mini), adult content, and OCR'd science PDFs.
	- Kept: natural web text, code (stack-edu, cranecode; FIM markers stripped),
	math, and reference text — the mix's native proportions minus the exclusions.
	- Tokenizer: a custom 32K BPE. After tokenization this gives
	101.6B training tokens.

	Post-training data: [smol-smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smol-smoltalk)
	+ [no_robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots) (SFT), and
	synthetic ChatML distilled from LFM2.5-1.2B-Instruct (SeqKD teacher).

	### How we accelerated pretraining (the interesting part)

	Pretraining ran on two AMD Ryzen AI MAX+ 395 (Strix Halo, gfx1151 / RDNA 3.5)
	mini-PCs (128 GB unified LPDDR5X each), linked over Thunderbolt 4, with DDP on
	the gloo backend. Three techniques compounded:

	1. TST - Token Superposition Training (bag-size 4). Early in training, every
	position is the average of 4 stochastic sub-word tokenizations of the same
	text, so the model digests ~4x the tokens per step. The bag size is annealed
	4 -> 2 -> 1 over training so the model finishes on ordinary single-token streams.
	2. DiffusionBlocks (B=4). The 16 layers are split into 4 blocks of 4 layers,
	each trained to denoise its input representation. Crucially, the blocks are
	**trained block-parallel across the two machines with essentially no gradient
	all-reduce** - Machine A owns blocks 1-2, Machine B owns blocks 3-4.
	3. Two-machine DDP over Thunderbolt 4. Unified memory means `gloo` keeps pace,
	and DiffusionBlocks' block independence hides the modest interconnect bandwidth.

	Combined, the TST + DiffusionBlocks + dual-machine phase trained **roughly
	10-20x faster** than a conventional end-to-end autoregressive pass on the *same two
	machines* (and dramatically faster than a single accelerator) - which is what made
	a 101.6B-token pretrain feasible in days on consumer hardware. A final, shorter
	standard end-to-end phase then restores ordinary left-to-right generation; the
	released base weights come from that phase (EMA, decay 0.999).

	### Optimization

	- Optimizer: NorMuon (2D weight matrices, fp16 Newton-Schulz) + AdamW (1D params / embeddings)
	- Precision: fp16 + GradScaler (bf16 is slower / unstable on gfx1151)
	- Stabilization: z-loss 1e-4, attention soft-cap 50, EMA 0.999
	- Compile: `torch.compile` (max-autotune-no-cudagraphs)

	### Post-training

	1. SFT (full-parameter, cross-entropy) on smol-smoltalk + no_robots.
	2. SeqKD: a second SFT pass on ~10k ChatML responses generated by
	LFM2.5-1.2B-Instruct, which teaches the small student a cleaner, more direct
	answer style.

	LiNeS layer-scaling and DPO were evaluated and dropped: at 138M, aggressive
	LiNeS removed instruction-following and DPO over-optimized into incoherence. Plain
	SFT + SeqKD gave the best behavior.

	## Limitations

	- Small model: limited reasoning and factual recall; it will state wrong facts
	confidently. Not for factual QA or safety-sensitive use.
	- 2,048-token context in the released inference code.
	- English-focused.
	- No RLHF / safety tuning.
	- Benchmarks above are preliminary and harness-dependent; run your own eval.

	## Citation

	```bibtex
	@misc{odinnext_138m_instruct_2026,
	title = {OdinNext-138M-Instruct},
	author = {Wang, Joel},
	year = {2026},
	howpublished = {\url{https://huggingface.co/joelhenwang/OdinNext-138M-Instruct}},
	note = {138M HGRN2 recurrent instruction model; TST + DiffusionBlocks +
	dual-machine DDP pretraining on AMD Strix Halo, then SFT + SeqKD}
	}
	```

	## References

	- Zhen Qin et al. HGRN2: Gated Linear RNNs with State Expansion. arXiv:2404.07904.
	- Bowen Peng et al. Token Superposition Training. arXiv:2605.06546.
	- Chenze Shao et al. Patch-Level Training for Large Language Models. arXiv:2407.12665.
	- Makoto Shing et al. DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation. arXiv:2506.14202.
	- Comparison numbers and card structure inspired by Axiomic Labs' GPT-X2-125M.

	Trained on AMD Strix Halo (gfx1151, RDNA 3.5), ROCm 7.13.