Text Generation
Transformers
Safetensors
English
odinnext
hgrn2
linear-attention
recurrent
instruct
chatml
amd
rocm
custom_code
conversational
Instructions to use joelhenwang/OdinNext-138M-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use joelhenwang/OdinNext-138M-Instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="joelhenwang/OdinNext-138M-Instruct", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("joelhenwang/OdinNext-138M-Instruct", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use joelhenwang/OdinNext-138M-Instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "joelhenwang/OdinNext-138M-Instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "joelhenwang/OdinNext-138M-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/joelhenwang/OdinNext-138M-Instruct
- SGLang
How to use joelhenwang/OdinNext-138M-Instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "joelhenwang/OdinNext-138M-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "joelhenwang/OdinNext-138M-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "joelhenwang/OdinNext-138M-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "joelhenwang/OdinNext-138M-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use joelhenwang/OdinNext-138M-Instruct with Docker Model Runner:
docker model run hf.co/joelhenwang/OdinNext-138M-Instruct
| license: apache-2.0 | |
| language: | |
| - en | |
| library_name: transformers | |
| pipeline_tag: text-generation | |
| base_model: joelhenwang/OdinNext-138M-Base | |
| tags: | |
| - odinnext | |
| - hgrn2 | |
| - linear-attention | |
| - recurrent | |
| - instruct | |
| - chatml | |
| - amd | |
| - rocm | |
| - custom_code | |
| - arxiv:2404.07904 | |
| - arxiv:2605.06546 | |
| - arxiv:2407.12665 | |
| - arxiv:2506.14202 | |
| # OdinNext-138M-Instruct | |
| A **138.4M-parameter** instruction-tuned language model that replaces softmax | |
| self-attention with an **HGRN2 gated linear recurrence**. Fine-tuned from | |
| [OdinNext-138M-Base](https://huggingface.co/joelhenwang/OdinNext-138M-Base), | |
| which was pretrained **from scratch on 101.6B tokens** on **two AMD Ryzen AI | |
| MAX+ 395 (Strix Halo) mini-PCs** — using a TST + DiffusionBlocks + dual-machine | |
| DDP stack that trained **roughly 10-20x faster** than a conventional | |
| end-to-end pass on the same hardware. | |
| This is a small model. It follows instructions and writes fluent, assistant-style | |
| answers (markdown, step-by-step), but its **factual accuracy is limited by scale**. | |
| Treat it as a lightweight assistant and a research artifact, not a knowledge base. | |
| > Uses custom Transformers code. `trust_remote_code=True` runs Python from this | |
| > repo — review the files or pin a commit before trusting it. | |
| ## Results | |
| Zero-shot, on three widely-reported public benchmarks. **OdinNext rows were | |
| measured with our own harness** (`scripts/eval_benchmarks.py`; HellaSwag = acc_norm, | |
| ARC = mean of Easy+Challenge acc, PIQA = acc); the other rows are **as reported by | |
| Axiomic Labs** on the [GPT-X2-125M](https://huggingface.co/AxiomicLabs/GPT-X2-125M) | |
| card, so numbers are **not perfectly comparable across harnesses**. | |
| | Company | Model | HellaSwag | ARC (avg) | PIQA | Training tokens | | |
| |---|---|---|---|---|---| | |
| | HuggingFace | SmolLM2-135M | 43.22% | 44.62% | 67.52% | 2T | | |
| | Axiomic Labs | GPT-X2-125M | 40.55% | 39.90% | 66.97% | 75B | | |
| | HuggingFace | SmolLM-135M | 42.70% | 43.17% | 67.19% | 600B | | |
| | Facebook | MobileLLM-R1-140M-base | 33.91% | 37.47% | 62.79% | 4.2T | | |
| | Axiomic Labs | GPT-X-125M | 36.57% | 38.84% | 65.72% | 15B | | |
| | Facebook | MobileLLM-125M | 38.90% | 35.50% | 65.30% | 1T | | |
| | OpenAI | GPT-2 (124M) | 31.49% | 31.40% | 63.28% | ~10B | | |
| | EleutherAI | Pythia-160M | 30.46% | 29.95% | 57.94% | ~225B | | |
| | Facebook | OPT-125M | 31.39% | 31.53% | 62.02% | 180B | | |
| | EleutherAI | GPT-Neo-125M | 30.55% | 31.43% | 61.75% | 300B | | |
| | **This work** | **OdinNext-138M-Base** | **33.05%** | **34.29%** | **58.81%** | **101.6B** | | |
| | **This work** | **OdinNext-138M-Instruct** | **32.85%** | **33.14%** | **59.25%** | **101.6B + SFT/SeqKD** | | |
| ## Usage | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForCausalLM | |
| import torch | |
| repo = "joelhenwang/OdinNext-138M-Instruct" | |
| tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| repo, trust_remote_code=True, torch_dtype=torch.float16, | |
| ).to("cuda").eval() | |
| msgs = [{"role": "user", "content": "Explain photosynthesis in two sentences."}] | |
| ids = tok.apply_chat_template(msgs, return_tensors="pt", add_generation_prompt=True).to("cuda") | |
| out = model.generate(ids, max_new_tokens=200, do_sample=True, temperature=0.7, | |
| top_p=0.9, repetition_penalty=1.3) | |
| print(tok.decode(out[0, ids.shape[1]:], skip_special_tokens=True)) | |
| ``` | |
| Uses **ChatML** (`<|im_start|>role\n...<|im_end|>`). A `repetition_penalty` | |
| around 1.2-1.3 is recommended at this scale. | |
| ## Architecture | |
| Decoder-only causal LM, 16 pre-norm blocks: | |
| ```text | |
| x = x + sigmoid(gate_attn) * HGRN2(ZCRMSNorm(x)) | |
| x = x + sigmoid(gate_ffn) * SwiGLU2(ZCRMSNorm(x)) | |
| ``` | |
| | Item | Value | | |
| |---|---| | |
| | Parameters | 138.4M (113.3M non-embedding) | | |
| | Layers / hidden / heads | 16 / 768 / 6 | | |
| | Per-head recurrent state | 128 x 128 | | |
| | FFN inner | 2,048 | | |
| | Vocabulary | 32,770 (custom 32K BPE + 2 ChatML tokens) | | |
| | Max sequence length | 2,048 | | |
| | Mixer | HGRN2 gated linear recurrence; RoPE (theta=100K) on even layers, position-free on odd | | |
| | Decoding state | **fixed-size recurrent state** (O(1)/token), not a growing KV cache | | |
| The HGRN2 state `S_t = diag(exp(g_t)) S_{t-1} + k_t (x) v_t` is **constant in size | |
| w.r.t. context length** (~3 MiB fp16 at batch 1) — unlike a Transformer KV cache | |
| that grows linearly with tokens. | |
| ## Training | |
| ### Data | |
| Pretraining used the **Dolmino mix** ([`allenai/dolma3_dolmino_mix-100B-1025`](https://huggingface.co/datasets/allenai/dolma3_dolmino_mix-100B-1025)), | |
| curated by **dropping the synthetic and noisy partitions** and keeping the natural | |
| text + code: | |
| - **Excluded:** all synthetic reasoning-trace subsets (Gemini / QwQ / R1 / | |
| OpenThoughts2 / Llama-Nemotron, math- and code-meta-reasoning, omr-rewrite, | |
| verifiable GPT-4.1 / o4-mini), adult content, and OCR'd science PDFs. | |
| - **Kept:** natural web text, code (stack-edu, cranecode; FIM markers stripped), | |
| math, and reference text — the mix's native proportions minus the exclusions. | |
| - **Tokenizer:** a **custom 32K BPE**. After tokenization this gives | |
| **101.6B training tokens**. | |
| Post-training data: [smol-smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smol-smoltalk) | |
| + [no_robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots) (SFT), and | |
| synthetic ChatML distilled from **LFM2.5-1.2B-Instruct** (SeqKD teacher). | |
| ### How we accelerated pretraining (the interesting part) | |
| Pretraining ran on **two AMD Ryzen AI MAX+ 395 (Strix Halo, gfx1151 / RDNA 3.5)** | |
| mini-PCs (128 GB unified LPDDR5X each), linked over **Thunderbolt 4**, with DDP on | |
| the **gloo** backend. Three techniques compounded: | |
| 1. **TST - Token Superposition Training** (bag-size 4). Early in training, every | |
| position is the average of **4 stochastic sub-word tokenizations** of the same | |
| text, so the model digests **~4x the tokens per step**. The bag size is annealed | |
| 4 -> 2 -> 1 over training so the model finishes on ordinary single-token streams. | |
| 2. **DiffusionBlocks** (B=4). The 16 layers are split into 4 blocks of 4 layers, | |
| each trained to **denoise** its input representation. Crucially, the blocks are | |
| **trained block-parallel across the two machines with essentially no gradient | |
| all-reduce** - Machine A owns blocks 1-2, Machine B owns blocks 3-4. | |
| 3. **Two-machine DDP** over Thunderbolt 4. Unified memory means `gloo` keeps pace, | |
| and DiffusionBlocks' block independence hides the modest interconnect bandwidth. | |
| Combined, the **TST + DiffusionBlocks + dual-machine** phase trained **roughly | |
| 10-20x faster** than a conventional end-to-end autoregressive pass on the *same two | |
| machines* (and dramatically faster than a single accelerator) - which is what made | |
| a 101.6B-token pretrain feasible in days on consumer hardware. A final, shorter | |
| **standard end-to-end phase** then restores ordinary left-to-right generation; the | |
| released base weights come from that phase (EMA, decay 0.999). | |
| ### Optimization | |
| - **Optimizer:** NorMuon (2D weight matrices, fp16 Newton-Schulz) + AdamW (1D params / embeddings) | |
| - **Precision:** fp16 + GradScaler (bf16 is slower / unstable on gfx1151) | |
| - **Stabilization:** z-loss 1e-4, attention soft-cap 50, EMA 0.999 | |
| - **Compile:** `torch.compile` (max-autotune-no-cudagraphs) | |
| ### Post-training | |
| 1. **SFT** (full-parameter, cross-entropy) on smol-smoltalk + no_robots. | |
| 2. **SeqKD**: a second SFT pass on ~10k ChatML responses generated by | |
| LFM2.5-1.2B-Instruct, which teaches the small student a cleaner, more direct | |
| answer style. | |
| LiNeS layer-scaling and DPO were evaluated and **dropped**: at 138M, aggressive | |
| LiNeS removed instruction-following and DPO over-optimized into incoherence. Plain | |
| SFT + SeqKD gave the best behavior. | |
| ## Limitations | |
| - **Small model:** limited reasoning and factual recall; it will state wrong facts | |
| confidently. Not for factual QA or safety-sensitive use. | |
| - **2,048-token context** in the released inference code. | |
| - **English-focused.** | |
| - **No RLHF / safety tuning.** | |
| - Benchmarks above are preliminary and harness-dependent; run your own eval. | |
| ## Citation | |
| ```bibtex | |
| @misc{odinnext_138m_instruct_2026, | |
| title = {OdinNext-138M-Instruct}, | |
| author = {Wang, Joel}, | |
| year = {2026}, | |
| howpublished = {\url{https://huggingface.co/joelhenwang/OdinNext-138M-Instruct}}, | |
| note = {138M HGRN2 recurrent instruction model; TST + DiffusionBlocks + | |
| dual-machine DDP pretraining on AMD Strix Halo, then SFT + SeqKD} | |
| } | |
| ``` | |
| ## References | |
| - Zhen Qin et al. **HGRN2: Gated Linear RNNs with State Expansion.** arXiv:2404.07904. | |
| - Bowen Peng et al. **Token Superposition Training.** arXiv:2605.06546. | |
| - Chenze Shao et al. **Patch-Level Training for Large Language Models.** arXiv:2407.12665. | |
| - Makoto Shing et al. **DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation.** arXiv:2506.14202. | |
| - Comparison numbers and card structure inspired by Axiomic Labs' GPT-X2-125M. | |
| Trained on AMD Strix Halo (gfx1151, RDNA 3.5), ROCm 7.13. | |