OdinNext-138M-Early-Checkpoint

Early research checkpoint of OdinNext, a 138M-parameter causal language model using an HGRN2-style gated linear recurrence instead of softmax self-attention.

This is not a chat model and not a production release. It is an early pretraining checkpoint intended for architecture inspection, qualitative sampling, and continued research.

  • Repo: joelhenwang/OdinNext-138M-Early-Checkpoint
  • Recommended revision: main / EMA-shadowed weights
  • Training status: early checkpoint at step 3,259
  • Context window: 2,048 tokens in the released inference code
  • License: Apache-2.0

The model uses custom Transformers code. Loading it with trust_remote_code=True executes Python code from this repository. Only do this after reviewing the files or pinning a known commit.

At a glance

Item Value
Unique tied parameters 138,449,696
Non-embedding parameters 113,283,872
Layers 16
Hidden size 768
Heads 6
Head state dims 128 × 128 per head
FFN inner size 2,048
Vocabulary 32,768 custom BPE tokens
Max sequence length 2,048
Checkpoint dtype fp16
Architecture HGRN2 recurrence + alternating RoPE + SwiGLU² FFN + RMSNorm-style normalization
Cache type Fixed recurrent state, not a growing Transformer KV cache

What this checkpoint is good for

Use this checkpoint for:

  • inspecting a compact recurrent/linear-attention LM implementation;
  • testing HGRN2-style recurrent decoding inside the Hugging Face generate() API;
  • studying fixed-state decoding memory behavior;
  • continuing pretraining or running controlled ablations.

Do not use it for:

  • chat, instruction following, or agentic tasks;
  • safety-sensitive output generation;
  • benchmark claims without running your own evaluation;
  • multilingual, coding, or long-context claims.

Architecture

OdinNext is a decoder-only causal LM. Each block uses a pre-norm residual layout:

x = x + sigmoid(gate_attn) * HGRN2(norm(x))
x = x + sigmoid(gate_ffn)  * SwiGLU²(norm(x))

The HGRN2-style recurrent state is updated per token as:

S_t = diag(exp(g_t)) S_{t-1} + k_t ⊗ v_t
o_t = q_t S_t

where each layer keeps a per-batch recurrent state shaped:

[B, n_heads, head_f_dim, head_i_dim]

For this checkpoint:

n_heads    = 6
head_f_dim = 128
head_i_dim = 128

Even-numbered layers apply RoPE to q and k; odd-numbered layers are position-free. The current inference implementation still enforces a hard 2,048-token cumulative position limit because the RoPE cache is built for max_seq_len = 2048.

Important implementation details

  • The exported Hugging Face code contains only the inference path. Training-time machinery is not part of this repository.
  • past_key_values is an OdinNextCache, a list of recurrent states. It is not a Transformer KV cache.
  • attention_mask is accepted for API compatibility but ignored by the backbone. Left-padding is not supported.
  • Batched generation is safest when all prompts have the same valid length. Padding tokens are still tokens to the recurrence if they are processed.
  • use_cache=True is important for generation. Without it, every generation step reprocesses the full prefix.

Parameter accounting

The 138M headline is the unique tied-parameter runtime count. The input embedding and LM head are tied and should be counted once for model-capacity comparisons.

Hugging Face or file-size-derived parameter summaries may round this checkpoint near 0.2B because stored checkpoint tensors and tied runtime parameters are not always counted the same way.

Memory: recurrent state vs Transformer KV cache

For batch size 1 in fp16, OdinNext's recurrent state size is:

layers × heads × head_f_dim × head_i_dim × bytes
= 16 × 6 × 128 × 128 × 2
= 3,145,728 bytes ≈ 3.0 MiB

That state is constant with respect to generated context length. It scales linearly with batch size and with dtype size. In the pure-PyTorch fallback path, the scan state is promoted to fp32, so the returned recurrent state can be about 6.0 MiB per sequence instead of 3.0 MiB.

A same-depth 16-layer, d_model = 768, fp16 Transformer with full multi-head K/V cache would use approximately:

layers × 2(K,V) × hidden_size × context_tokens × bytes
= 16 × 2 × 768 × T × 2
Context tokens Typical Transformer KV cache OdinNext recurrent state
1,024 48 MiB ~3 MiB fp16 / ~6 MiB fp32 fallback
4,096 192 MiB ~3 MiB fp16 / ~6 MiB fp32 fallback
16,384 768 MiB ~3 MiB fp16 / ~6 MiB fp32 fallback
65,536 3,072 MiB ~3 MiB fp16 / ~6 MiB fp32 fallback

This table is a cache-state comparison only. It is not a claim about total GPU memory, throughput, benchmark quality, or usable context length. The released OdinNext code is still limited to 2,048 cumulative positions.

Training snapshot

Values verified from the public config:

Field Value
_training_step 3,259
_total_tokens 6,835,666,944
_weights_source ema_state_dict
torch_dtype float16
max_position_embeddings 2,048

Author-reported training notes for this early checkpoint:

Item Value
Hardware 2× AMD Strix Halo / gfx1151, ROCm stack
Training precision fp16 + GradScaler
Optimizers NorMuon for 2D tensors; AdamW for 1D/embed tensors
LR schedule WSD, peak 8e-4, warmup 500, min LR 0.1× peak
Stabilization z-loss 1e-4, attention soft-cap 50, EMA decay 0.999
Curriculum TST-style bag-size-4 phase active at this checkpoint
Public benchmarks not yet provided

Token accounting note

The public config records _total_tokens = 6,835,666,944. Do not reinterpret that as plain next-token positions from:

3,259 optimizer steps × 256 effective sequences × 2,048 tokens
= 1,708,916,224 position tokens

The 6.84B figure appears to be token-superposition/original-token-equivalent accounting rather than simple next-token position accounting. A full reproducibility report should define whether the total counts original text tokens, bagged targets, loss terms, or optimizer-position tokens.

TST note

The cited Token-Superposition Training paper defines TST as a two-phase method: a superposition phase that combines contiguous tokens into bags and uses a multi-hot cross-entropy objective, followed by a recovery phase that returns to ordinary next-token training.

This checkpoint is described as still being in a bag-size-4 phase. That means ordinary single-stream autoregressive inference is not necessarily the final intended training distribution. Treat quality as preliminary until a bag-size-1 recovery checkpoint and benchmark results are published.

Usage with Transformers

Install the basics:

pip install "transformers>=4.46" torch safetensors

Optional: install flash-linear-attention if your platform supports it. Without it, the model falls back to a pure-PyTorch reference implementation that is useful for correctness and portability but slower for long prompts.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "joelhenwang/OdinNext-138M-Early-Checkpoint"
# For reproducible experiments, replace "main" with a specific commit hash.
revision = "main"

device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.float16 if device == "cuda" else torch.float32

tok = AutoTokenizer.from_pretrained(repo, revision=revision)
model = AutoModelForCausalLM.from_pretrained(
    repo,
    revision=revision,
    trust_remote_code=True,
    torch_dtype=dtype,
).to(device).eval()

prompt = "The night was quiet and the streets were empty"
inputs = tok(prompt, return_tensors="pt").to(device)

# The released code is capped at 2,048 cumulative positions.
remaining = model.config.max_position_embeddings - inputs.input_ids.shape[1]
max_new_tokens = max(0, min(80, remaining))

with torch.inference_mode():
    out = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        temperature=0.8,
        top_p=0.95,
        repetition_penalty=1.1,
        pad_token_id=tok.pad_token_id,
        use_cache=True,
    )

print(tok.decode(out[0], skip_special_tokens=True))

Batching guidance

The model's recurrent scan does not apply an attention mask. For correct batched generation:

  • avoid left padding;
  • prefer same-length prompts in a batch;
  • avoid processing pad tokens as if they were real prompt tokens;
  • test batched output against single-sample output before relying on batched generation.

Single-prompt generation is the safest path for basic use.

Known limitations

  • No instruction tuning: no SFT, DPO, RLHF, RLAIF, or chat template.
  • No safety training: outputs can be unsafe, biased, false, or incoherent.
  • Early quality: this is about 3% of the planned pretraining budget according to the original release notes.
  • No formal benchmarks yet: HellaSwag, ARC, MMLU, perplexity suites, and long-context tests are not provided here.
  • Hard 2,048-token cap: recurrent cache size is constant, but the released RoPE cache still limits positions.
  • Masking caveat: attention_mask is ignored in the backbone; padding can affect recurrent state.
  • English-focused: multilingual and code generation should be assumed weak unless tested.
  • bf16 unvalidated: fp16 is the intended inference dtype for this checkpoint; CPU fallback should use fp32 for portability.
  • Training data not fully documented in this card: treat data provenance, memorization risk, and bias profile as uncharacterized unless separately documented.

Revisions

  • main: EMA-shadowed weights from _weights_source = ema_state_dict; recommended for evaluation.
  • live: raw training weights at step 3,259, if this branch is retained.

For reproducible experiments, pin a commit hash rather than a moving branch name.

Citation

@misc{odinnext_138m_early_2026,
  title        = {OdinNext-138M-Early-Checkpoint},
  author       = {Wang, Joel},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/joelhenwang/OdinNext-138M-Early-Checkpoint}},
  note         = {Early HGRN2 recurrent language-model checkpoint}
}

References

Downloads last month
49
Safetensors
Model size
0.2B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for joelhenwang/OdinNext-138M-Early-Checkpoint