OdinNext-138M-Base

OdinNext is a 138M-parameter causal language model that replaces softmax self-attention with an HGRN2-style gated linear recurrence. This repository is the base pretrained model — trained from scratch on ~101.6B tokens of curated data (the Dolmino mix) on two AMD Strix Halo (gfx1151) machines.

This is a base model: it completes and continues text. It is not an instruction-tuned or chat model — no SFT, DPO, RLHF, or chat template. An instruction-tuned variant is available at joelhenwang/OdinNext-138M-Instruct.

  • Repo: joelhenwang/OdinNext-138M-Base
  • main: EMA-shadowed weights (decay 0.999), recommended.
  • live: raw training weights at the same step.
  • Context window: 2,048 tokens in the released inference code.
  • License: Apache-2.0.

Uses custom Transformers code. Loading with trust_remote_code=True executes Python from this repo. Review the files or pin a commit before trusting it.

At a glance

Item Value
Unique tied parameters 138,449,696
Non-embedding parameters 113,283,872
Layers 16
Hidden size 768
Heads 6
Head state dims 128 × 128 per head
FFN inner size 2,048
Vocabulary 32,768 custom BPE tokens
Max sequence length 2,048
Checkpoint dtype fp16
Architecture HGRN2 recurrence + alternating RoPE + SwiGLU² FFN + ZCRMSNorm
Cache type Fixed-size recurrent state, not a growing KV cache

Architecture

Decoder-only causal LM, 16 identical pre-norm blocks:

x = x + sigmoid(gate_attn) * HGRN2(ZCRMSNorm(x))
x = x + sigmoid(gate_ffn)  * SwiGLU²(ZCRMSNorm(x))

The HGRN2 recurrent state updates per token as:

S_t = diag(exp(g_t)) S_{t-1} + k_t ⊗ v_t
o_t = q_t S_t

with a per-layer state shaped [B, n_heads, head_f_dim, head_i_dim] = [B, 6, 128, 128]. This state is constant in size with respect to context length, giving O(1)-per-token decoding rather than a growing KV cache.

Hybrid RoPE: even layers (0, 2, …, 14) apply RoPE to q/k (θ = 100,000); odd layers are position-free. Tied embedding / LM head. No linear biases.

Memory: recurrent state vs Transformer KV cache

For batch size 1 in fp16 the recurrent state is constant:

layers × heads × head_f_dim × head_i_dim × bytes
= 16 × 6 × 128 × 128 × 2 = 3,145,728 bytes ≈ 3.0 MiB

independent of generated length (the pure-PyTorch fallback promotes the scan state to fp32, ≈ 6.0 MiB). A same-depth fp16 Transformer KV cache would grow linearly (≈ 48 MiB at 1K tokens, ≈ 768 MiB at 16K). This is a cache-state comparison only, not a claim about total memory or usable context.

Training snapshot

Field Value
Data Dolmino mix (~101.6B tokens, odin-32k tokenizer)
Hardware 2× AMD Strix Halo / gfx1151, ROCm 7.13
Interconnect Thunderbolt 4, DDP over gloo
Precision fp16 + GradScaler
Optimizers NorMuon (2D tensors) + AdamW (1D / embeddings)
LR peak 8e-4, warmup, cosine decay
Stabilization z-loss 1e-4, attention soft-cap 50, EMA decay 0.999
Curriculum Phase 1: Token-Superposition Training (bag-size 4) + DiffusionBlocks (block-wise) for ~24K steps; Phase 2: standard end-to-end autoregressive recovery
Released weights main = ema_state_dict; live = raw online weights

The two-phase curriculum trains most of the budget under a block-wise DiffusionBlocks + token-superposition objective for throughput, then recovers ordinary left-to-right generation with a standard end-to-end phase. The released weights are from the end-to-end recovery phase and produce coherent continuations.

Data & curation

Pretraining used the Dolmino mix (allenai/dolma3_dolmino_mix-100B-1025), curated by dropping the synthetic / noisy partitions and keeping the natural text + code:

  • Excluded: all synthetic reasoning-trace subsets (Gemini / QwQ / R1 / OpenThoughts2 / Llama-Nemotron, math- and code-meta-reasoning, omr-rewrite, verifiable GPT-4.1 / o4-mini), adult content, and OCR'd science PDFs.
  • Kept: natural web text, code (stack-edu, cranecode; FIM markers stripped), math, and reference text — the mix's native proportions minus the exclusions.
  • Tokenizer: custom 32K BPE (odin-32k); ~101.6B tokens after tokenization.

How we accelerated pretraining

Pretraining ran on two AMD Ryzen AI MAX+ 395 (Strix Halo, gfx1151 / RDNA 3.5) mini-PCs (128 GB unified LPDDR5X each), over Thunderbolt 4 with DDP on the gloo backend. Three techniques compounded:

  1. TST — Token Superposition Training (bag-size 4): each position is the mean of 4 stochastic sub-word tokenizations of the same text, so the model digests ~4× the tokens per step; the bag size anneals 4 → 2 → 1 over training.
  2. DiffusionBlocks (B=4): the 16 layers form 4 four-layer blocks trained to denoise their input, block-parallel across the two machines with essentially no gradient all-reduce (Machine A: blocks 1–2; Machine B: blocks 3–4) — ideal for a single Thunderbolt link.
  3. Two-machine DDP over TB4: unified memory lets gloo keep pace, and the block independence hides the modest interconnect bandwidth.

Together this phase trained roughly 10–20× faster than a conventional end-to-end autoregressive pass on the same two machines (and far faster than a single accelerator) — which is what made a 101.6B-token pretrain feasible in days on consumer hardware. A final, shorter standard end-to-end phase then restores ordinary generation; the released weights (EMA, decay 0.999) come from it.

Results

Zero-shot, our own harness (scripts/eval_benchmarks.py; HellaSwag = acc_norm, ARC = mean of Easy + Challenge acc, PIQA = acc). Other rows are as reported by Axiomic Labs on the GPT-X2-125M card and are not perfectly comparable (different harness).

Company Model HellaSwag ARC (avg) PIQA Training tokens
HuggingFace SmolLM2-135M 43.22% 44.62% 67.52% 2T
Axiomic Labs GPT-X2-125M 40.55% 39.90% 66.97% 75B
OpenAI GPT-2 (124M) 31.49% 31.40% 63.28% ~10B
EleutherAI Pythia-160M 30.46% 29.95% 57.94% ~225B
Facebook OPT-125M 31.39% 31.53% 62.02% 180B
EleutherAI GPT-Neo-125M 30.55% 31.43% 61.75% 300B
This work OdinNext-138M-Base 33.05% 34.29% 58.81% 101.6B

OdinNext lands in the GPT-2 / OPT / Pythia / GPT-Neo tier here — below the SmolLM / GPT-X2 frontier, but trained end-to-end on two consumer AMD mini-PCs. An instruction-tuned variant is at joelhenwang/OdinNext-138M-Instruct.

What this model is good for

  • Text continuation and completion in English.
  • Research on compact recurrent / linear-attention LMs and fixed-state decoding.
  • A base for instruction tuning, alignment, and context extension.

Do not use it for chat / instruction following (not tuned yet), safety- sensitive generation, or benchmark claims without running your own evaluation.

Usage

pip install "transformers>=4.46" torch safetensors
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "joelhenwang/OdinNext-138M-Base"
revision = "main"  # EMA weights; pin a commit for reproducibility

device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.float16 if device == "cuda" else torch.float32

tok = AutoTokenizer.from_pretrained(repo, revision=revision)
model = AutoModelForCausalLM.from_pretrained(
    repo, revision=revision, trust_remote_code=True, torch_dtype=dtype,
).to(device).eval()

prompt = "The discovery of penicillin"
inputs = tok(prompt, return_tensors="pt").to(device)
remaining = model.config.max_position_embeddings - inputs.input_ids.shape[1]
with torch.inference_mode():
    out = model.generate(
        **inputs,
        max_new_tokens=max(0, min(100, remaining)),
        do_sample=True, temperature=0.8, top_p=0.95, repetition_penalty=1.1,
        pad_token_id=tok.pad_token_id, use_cache=True,
    )
print(tok.decode(out[0], skip_special_tokens=True))

Batching guidance

The recurrent scan does not apply an attention mask. For correct batched generation: avoid left padding, prefer same-length prompts, and verify batched output against single-sample output before relying on it. Single-prompt generation is the safest path.

Limitations

  • Base model only: no instruction tuning, alignment, or chat template.
  • No safety training: outputs can be biased, false, or incoherent.
  • Hard 2,048-token cap: recurrent state is constant, but the released RoPE cache limits cumulative positions to 2,048.
  • attention_mask ignored in the backbone; padding affects recurrent state.
  • English-focused; multilingual / code ability is uncharacterized.
  • Benchmarks above are zero-shot on our own harness and not perfectly comparable across tooling — run your own evaluation.

Revisions

  • main: EMA-shadowed weights (decay 0.999), recommended for evaluation.
  • live: raw training weights at the same step.

Pin a commit hash rather than a moving branch for reproducible experiments.

Citation

@misc{odinnext_138m_base_2026,
  title        = {OdinNext-138M-Base},
  author       = {Wang, Joel},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/joelhenwang/OdinNext-138M-Base}},
  note         = {138M HGRN2 recurrent language-model base checkpoint}
}

References

  • Zhen Qin et al. HGRN2: Gated Linear RNNs with State Expansion. arXiv:2404.07904.
  • Bowen Peng et al. Efficient Pre-Training with Token Superposition. arXiv:2605.06546.
  • Chenze Shao et al. Patch-Level Training for Large Language Models. arXiv:2407.12665.
  • Makoto Shing et al. DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation. arXiv:2506.14202.
Downloads last month
60
Safetensors
Model size
0.1B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for joelhenwang/OdinNext-138M-Base

Finetunes
1 model

Space using joelhenwang/OdinNext-138M-Base 1

Papers for joelhenwang/OdinNext-138M-Base