joelhenwang's picture
Update README.md
77b819b verified
---
license: apache-2.0
language:
- en
library_name: transformers
pipeline_tag: text-generation
base_model: joelhenwang/OdinNext-138M-Base
tags:
- odinnext
- hgrn2
- linear-attention
- recurrent
- instruct
- chatml
- amd
- rocm
- custom_code
- arxiv:2404.07904
- arxiv:2605.06546
- arxiv:2407.12665
- arxiv:2506.14202
---
# OdinNext-138M-Instruct
A **138.4M-parameter** instruction-tuned language model that replaces softmax
self-attention with an **HGRN2 gated linear recurrence**. Fine-tuned from
[OdinNext-138M-Base](https://huggingface.co/joelhenwang/OdinNext-138M-Base),
which was pretrained **from scratch on 101.6B tokens** on **two AMD Ryzen AI
MAX+ 395 (Strix Halo) mini-PCs** — using a TST + DiffusionBlocks + dual-machine
DDP stack that trained **roughly 10-20x faster** than a conventional
end-to-end pass on the same hardware.
This is a small model. It follows instructions and writes fluent, assistant-style
answers (markdown, step-by-step), but its **factual accuracy is limited by scale**.
Treat it as a lightweight assistant and a research artifact, not a knowledge base.
> Uses custom Transformers code. `trust_remote_code=True` runs Python from this
> repo — review the files or pin a commit before trusting it.
## Results
Zero-shot, on three widely-reported public benchmarks. **OdinNext rows were
measured with our own harness** (`scripts/eval_benchmarks.py`; HellaSwag = acc_norm,
ARC = mean of Easy+Challenge acc, PIQA = acc); the other rows are **as reported by
Axiomic Labs** on the [GPT-X2-125M](https://huggingface.co/AxiomicLabs/GPT-X2-125M)
card, so numbers are **not perfectly comparable across harnesses**.
| Company | Model | HellaSwag | ARC (avg) | PIQA | Training tokens |
|---|---|---|---|---|---|
| HuggingFace | SmolLM2-135M | 43.22% | 44.62% | 67.52% | 2T |
| Axiomic Labs | GPT-X2-125M | 40.55% | 39.90% | 66.97% | 75B |
| HuggingFace | SmolLM-135M | 42.70% | 43.17% | 67.19% | 600B |
| Facebook | MobileLLM-R1-140M-base | 33.91% | 37.47% | 62.79% | 4.2T |
| Axiomic Labs | GPT-X-125M | 36.57% | 38.84% | 65.72% | 15B |
| Facebook | MobileLLM-125M | 38.90% | 35.50% | 65.30% | 1T |
| OpenAI | GPT-2 (124M) | 31.49% | 31.40% | 63.28% | ~10B |
| EleutherAI | Pythia-160M | 30.46% | 29.95% | 57.94% | ~225B |
| Facebook | OPT-125M | 31.39% | 31.53% | 62.02% | 180B |
| EleutherAI | GPT-Neo-125M | 30.55% | 31.43% | 61.75% | 300B |
| **This work** | **OdinNext-138M-Base** | **33.05%** | **34.29%** | **58.81%** | **101.6B** |
| **This work** | **OdinNext-138M-Instruct** | **32.85%** | **33.14%** | **59.25%** | **101.6B + SFT/SeqKD** |
## Usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
repo = "joelhenwang/OdinNext-138M-Instruct"
tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
repo, trust_remote_code=True, torch_dtype=torch.float16,
).to("cuda").eval()
msgs = [{"role": "user", "content": "Explain photosynthesis in two sentences."}]
ids = tok.apply_chat_template(msgs, return_tensors="pt", add_generation_prompt=True).to("cuda")
out = model.generate(ids, max_new_tokens=200, do_sample=True, temperature=0.7,
top_p=0.9, repetition_penalty=1.3)
print(tok.decode(out[0, ids.shape[1]:], skip_special_tokens=True))
```
Uses **ChatML** (`<|im_start|>role\n...<|im_end|>`). A `repetition_penalty`
around 1.2-1.3 is recommended at this scale.
## Architecture
Decoder-only causal LM, 16 pre-norm blocks:
```text
x = x + sigmoid(gate_attn) * HGRN2(ZCRMSNorm(x))
x = x + sigmoid(gate_ffn) * SwiGLU2(ZCRMSNorm(x))
```
| Item | Value |
|---|---|
| Parameters | 138.4M (113.3M non-embedding) |
| Layers / hidden / heads | 16 / 768 / 6 |
| Per-head recurrent state | 128 x 128 |
| FFN inner | 2,048 |
| Vocabulary | 32,770 (custom 32K BPE + 2 ChatML tokens) |
| Max sequence length | 2,048 |
| Mixer | HGRN2 gated linear recurrence; RoPE (theta=100K) on even layers, position-free on odd |
| Decoding state | **fixed-size recurrent state** (O(1)/token), not a growing KV cache |
The HGRN2 state `S_t = diag(exp(g_t)) S_{t-1} + k_t (x) v_t` is **constant in size
w.r.t. context length** (~3 MiB fp16 at batch 1) — unlike a Transformer KV cache
that grows linearly with tokens.
## Training
### Data
Pretraining used the **Dolmino mix** ([`allenai/dolma3_dolmino_mix-100B-1025`](https://huggingface.co/datasets/allenai/dolma3_dolmino_mix-100B-1025)),
curated by **dropping the synthetic and noisy partitions** and keeping the natural
text + code:
- **Excluded:** all synthetic reasoning-trace subsets (Gemini / QwQ / R1 /
OpenThoughts2 / Llama-Nemotron, math- and code-meta-reasoning, omr-rewrite,
verifiable GPT-4.1 / o4-mini), adult content, and OCR'd science PDFs.
- **Kept:** natural web text, code (stack-edu, cranecode; FIM markers stripped),
math, and reference text — the mix's native proportions minus the exclusions.
- **Tokenizer:** a **custom 32K BPE**. After tokenization this gives
**101.6B training tokens**.
Post-training data: [smol-smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smol-smoltalk)
+ [no_robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots) (SFT), and
synthetic ChatML distilled from **LFM2.5-1.2B-Instruct** (SeqKD teacher).
### How we accelerated pretraining (the interesting part)
Pretraining ran on **two AMD Ryzen AI MAX+ 395 (Strix Halo, gfx1151 / RDNA 3.5)**
mini-PCs (128 GB unified LPDDR5X each), linked over **Thunderbolt 4**, with DDP on
the **gloo** backend. Three techniques compounded:
1. **TST - Token Superposition Training** (bag-size 4). Early in training, every
position is the average of **4 stochastic sub-word tokenizations** of the same
text, so the model digests **~4x the tokens per step**. The bag size is annealed
4 -> 2 -> 1 over training so the model finishes on ordinary single-token streams.
2. **DiffusionBlocks** (B=4). The 16 layers are split into 4 blocks of 4 layers,
each trained to **denoise** its input representation. Crucially, the blocks are
**trained block-parallel across the two machines with essentially no gradient
all-reduce** - Machine A owns blocks 1-2, Machine B owns blocks 3-4.
3. **Two-machine DDP** over Thunderbolt 4. Unified memory means `gloo` keeps pace,
and DiffusionBlocks' block independence hides the modest interconnect bandwidth.
Combined, the **TST + DiffusionBlocks + dual-machine** phase trained **roughly
10-20x faster** than a conventional end-to-end autoregressive pass on the *same two
machines* (and dramatically faster than a single accelerator) - which is what made
a 101.6B-token pretrain feasible in days on consumer hardware. A final, shorter
**standard end-to-end phase** then restores ordinary left-to-right generation; the
released base weights come from that phase (EMA, decay 0.999).
### Optimization
- **Optimizer:** NorMuon (2D weight matrices, fp16 Newton-Schulz) + AdamW (1D params / embeddings)
- **Precision:** fp16 + GradScaler (bf16 is slower / unstable on gfx1151)
- **Stabilization:** z-loss 1e-4, attention soft-cap 50, EMA 0.999
- **Compile:** `torch.compile` (max-autotune-no-cudagraphs)
### Post-training
1. **SFT** (full-parameter, cross-entropy) on smol-smoltalk + no_robots.
2. **SeqKD**: a second SFT pass on ~10k ChatML responses generated by
LFM2.5-1.2B-Instruct, which teaches the small student a cleaner, more direct
answer style.
LiNeS layer-scaling and DPO were evaluated and **dropped**: at 138M, aggressive
LiNeS removed instruction-following and DPO over-optimized into incoherence. Plain
SFT + SeqKD gave the best behavior.
## Limitations
- **Small model:** limited reasoning and factual recall; it will state wrong facts
confidently. Not for factual QA or safety-sensitive use.
- **2,048-token context** in the released inference code.
- **English-focused.**
- **No RLHF / safety tuning.**
- Benchmarks above are preliminary and harness-dependent; run your own eval.
## Citation
```bibtex
@misc{odinnext_138m_instruct_2026,
title = {OdinNext-138M-Instruct},
author = {Wang, Joel},
year = {2026},
howpublished = {\url{https://huggingface.co/joelhenwang/OdinNext-138M-Instruct}},
note = {138M HGRN2 recurrent instruction model; TST + DiffusionBlocks +
dual-machine DDP pretraining on AMD Strix Halo, then SFT + SeqKD}
}
```
## References
- Zhen Qin et al. **HGRN2: Gated Linear RNNs with State Expansion.** arXiv:2404.07904.
- Bowen Peng et al. **Token Superposition Training.** arXiv:2605.06546.
- Chenze Shao et al. **Patch-Level Training for Large Language Models.** arXiv:2407.12665.
- Makoto Shing et al. **DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation.** arXiv:2506.14202.
- Comparison numbers and card structure inspired by Axiomic Labs' GPT-X2-125M.
Trained on AMD Strix Halo (gfx1151, RDNA 3.5), ROCm 7.13.