Instructions to use joelhenwang/OdinNext-138M-Early-Checkpoint with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use joelhenwang/OdinNext-138M-Early-Checkpoint with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="joelhenwang/OdinNext-138M-Early-Checkpoint", trust_remote_code=True)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("joelhenwang/OdinNext-138M-Early-Checkpoint", trust_remote_code=True, device_map="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use joelhenwang/OdinNext-138M-Early-Checkpoint with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "joelhenwang/OdinNext-138M-Early-Checkpoint"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "joelhenwang/OdinNext-138M-Early-Checkpoint",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/joelhenwang/OdinNext-138M-Early-Checkpoint

SGLang

How to use joelhenwang/OdinNext-138M-Early-Checkpoint with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "joelhenwang/OdinNext-138M-Early-Checkpoint" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "joelhenwang/OdinNext-138M-Early-Checkpoint",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "joelhenwang/OdinNext-138M-Early-Checkpoint" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "joelhenwang/OdinNext-138M-Early-Checkpoint",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use joelhenwang/OdinNext-138M-Early-Checkpoint with Docker Model Runner:
```
docker model run hf.co/joelhenwang/OdinNext-138M-Early-Checkpoint
```

OdinNext-138M-Early-Checkpoint

Early research checkpoint of OdinNext, a 138M-parameter causal language model using an HGRN2-style gated linear recurrence instead of softmax self-attention.

This is not a chat model and not a production release. It is an early pretraining checkpoint intended for architecture inspection, qualitative sampling, and continued research.

Repo: joelhenwang/OdinNext-138M-Early-Checkpoint
Recommended revision: main / EMA-shadowed weights
Training status: early checkpoint at step 3,259
Context window: 2,048 tokens in the released inference code
License: Apache-2.0

The model uses custom Transformers code. Loading it with trust_remote_code=True executes Python code from this repository. Only do this after reviewing the files or pinning a known commit.

At a glance

Item	Value
Unique tied parameters	138,449,696
Non-embedding parameters	113,283,872
Layers	16
Hidden size	768
Heads	6
Head state dims	128 × 128 per head
FFN inner size	2,048
Vocabulary	32,768 custom BPE tokens
Max sequence length	2,048
Checkpoint dtype	fp16
Architecture	HGRN2 recurrence + alternating RoPE + SwiGLU² FFN + RMSNorm-style normalization
Cache type	Fixed recurrent state, not a growing Transformer KV cache

What this checkpoint is good for

Use this checkpoint for:

inspecting a compact recurrent/linear-attention LM implementation;
testing HGRN2-style recurrent decoding inside the Hugging Face generate() API;
studying fixed-state decoding memory behavior;
continuing pretraining or running controlled ablations.

Do not use it for:

chat, instruction following, or agentic tasks;
safety-sensitive output generation;
benchmark claims without running your own evaluation;
multilingual, coding, or long-context claims.

Architecture

OdinNext is a decoder-only causal LM. Each block uses a pre-norm residual layout:

x = x + sigmoid(gate_attn) * HGRN2(norm(x))
x = x + sigmoid(gate_ffn)  * SwiGLU²(norm(x))

The HGRN2-style recurrent state is updated per token as:

S_t = diag(exp(g_t)) S_{t-1} + k_t ⊗ v_t
o_t = q_t S_t

where each layer keeps a per-batch recurrent state shaped:

[B, n_heads, head_f_dim, head_i_dim]

For this checkpoint:

n_heads    = 6
head_f_dim = 128
head_i_dim = 128

Even-numbered layers apply RoPE to q and k; odd-numbered layers are position-free. The current inference implementation still enforces a hard 2,048-token cumulative position limit because the RoPE cache is built for max_seq_len = 2048.

Important implementation details

The exported Hugging Face code contains only the inference path. Training-time machinery is not part of this repository.
past_key_values is an OdinNextCache, a list of recurrent states. It is not a Transformer KV cache.
attention_mask is accepted for API compatibility but ignored by the backbone. Left-padding is not supported.
Batched generation is safest when all prompts have the same valid length. Padding tokens are still tokens to the recurrence if they are processed.
use_cache=True is important for generation. Without it, every generation step reprocesses the full prefix.

Parameter accounting

The 138M headline is the unique tied-parameter runtime count. The input embedding and LM head are tied and should be counted once for model-capacity comparisons.

Hugging Face or file-size-derived parameter summaries may round this checkpoint near 0.2B because stored checkpoint tensors and tied runtime parameters are not always counted the same way.

Memory: recurrent state vs Transformer KV cache

For batch size 1 in fp16, OdinNext's recurrent state size is:

layers × heads × head_f_dim × head_i_dim × bytes
= 16 × 6 × 128 × 128 × 2
= 3,145,728 bytes ≈ 3.0 MiB

That state is constant with respect to generated context length. It scales linearly with batch size and with dtype size. In the pure-PyTorch fallback path, the scan state is promoted to fp32, so the returned recurrent state can be about 6.0 MiB per sequence instead of 3.0 MiB.

A same-depth 16-layer, d_model = 768, fp16 Transformer with full multi-head K/V cache would use approximately:

layers × 2(K,V) × hidden_size × context_tokens × bytes
= 16 × 2 × 768 × T × 2

Context tokens	Typical Transformer KV cache	OdinNext recurrent state
1,024	48 MiB	~3 MiB fp16 / ~6 MiB fp32 fallback
4,096	192 MiB	~3 MiB fp16 / ~6 MiB fp32 fallback
16,384	768 MiB	~3 MiB fp16 / ~6 MiB fp32 fallback
65,536	3,072 MiB	~3 MiB fp16 / ~6 MiB fp32 fallback

This table is a cache-state comparison only. It is not a claim about total GPU memory, throughput, benchmark quality, or usable context length. The released OdinNext code is still limited to 2,048 cumulative positions.

Training snapshot

Values verified from the public config:

Field	Value
`_training_step`	3,259
`_total_tokens`	6,835,666,944
`_weights_source`	`ema_state_dict`
`torch_dtype`	`float16`
`max_position_embeddings`	2,048

Author-reported training notes for this early checkpoint:

Item	Value
Hardware	2× AMD Strix Halo / gfx1151, ROCm stack
Training precision	fp16 + GradScaler
Optimizers	NorMuon for 2D tensors; AdamW for 1D/embed tensors
LR schedule	WSD, peak `8e-4`, warmup 500, min LR 0.1× peak
Stabilization	z-loss `1e-4`, attention soft-cap 50, EMA decay 0.999
Curriculum	TST-style bag-size-4 phase active at this checkpoint
Public benchmarks	not yet provided

Token accounting note

The public config records _total_tokens = 6,835,666,944. Do not reinterpret that as plain next-token positions from:

3,259 optimizer steps × 256 effective sequences × 2,048 tokens
= 1,708,916,224 position tokens

The 6.84B figure appears to be token-superposition/original-token-equivalent accounting rather than simple next-token position accounting. A full reproducibility report should define whether the total counts original text tokens, bagged targets, loss terms, or optimizer-position tokens.

TST note

The cited Token-Superposition Training paper defines TST as a two-phase method: a superposition phase that combines contiguous tokens into bags and uses a multi-hot cross-entropy objective, followed by a recovery phase that returns to ordinary next-token training.

This checkpoint is described as still being in a bag-size-4 phase. That means ordinary single-stream autoregressive inference is not necessarily the final intended training distribution. Treat quality as preliminary until a bag-size-1 recovery checkpoint and benchmark results are published.

Usage with Transformers

Install the basics:

pip install "transformers>=4.46" torch safetensors

Optional: install flash-linear-attention if your platform supports it. Without it, the model falls back to a pure-PyTorch reference implementation that is useful for correctness and portability but slower for long prompts.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "joelhenwang/OdinNext-138M-Early-Checkpoint"
# For reproducible experiments, replace "main" with a specific commit hash.
revision = "main"

device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.float16 if device == "cuda" else torch.float32

tok = AutoTokenizer.from_pretrained(repo, revision=revision)
model = AutoModelForCausalLM.from_pretrained(
    repo,
    revision=revision,
    trust_remote_code=True,
    torch_dtype=dtype,
).to(device).eval()

prompt = "The night was quiet and the streets were empty"
inputs = tok(prompt, return_tensors="pt").to(device)

# The released code is capped at 2,048 cumulative positions.
remaining = model.config.max_position_embeddings - inputs.input_ids.shape[1]
max_new_tokens = max(0, min(80, remaining))

with torch.inference_mode():
    out = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        temperature=0.8,
        top_p=0.95,
        repetition_penalty=1.1,
        pad_token_id=tok.pad_token_id,
        use_cache=True,
    )

print(tok.decode(out[0], skip_special_tokens=True))

Batching guidance

The model's recurrent scan does not apply an attention mask. For correct batched generation:

avoid left padding;
prefer same-length prompts in a batch;
avoid processing pad tokens as if they were real prompt tokens;
test batched output against single-sample output before relying on batched generation.

Single-prompt generation is the safest path for basic use.

Known limitations

No instruction tuning: no SFT, DPO, RLHF, RLAIF, or chat template.
No safety training: outputs can be unsafe, biased, false, or incoherent.
Early quality: this is about 3% of the planned pretraining budget according to the original release notes.
No formal benchmarks yet: HellaSwag, ARC, MMLU, perplexity suites, and long-context tests are not provided here.
Hard 2,048-token cap: recurrent cache size is constant, but the released RoPE cache still limits positions.
Masking caveat: attention_mask is ignored in the backbone; padding can affect recurrent state.
English-focused: multilingual and code generation should be assumed weak unless tested.
bf16 unvalidated: fp16 is the intended inference dtype for this checkpoint; CPU fallback should use fp32 for portability.
Training data not fully documented in this card: treat data provenance, memorization risk, and bias profile as uncharacterized unless separately documented.

Revisions

main: EMA-shadowed weights from _weights_source = ema_state_dict; recommended for evaluation.
live: raw training weights at step 3,259, if this branch is retained.

For reproducible experiments, pin a commit hash rather than a moving branch name.

Citation

@misc{odinnext_138m_early_2026,
  title        = {OdinNext-138M-Early-Checkpoint},
  author       = {Wang, Joel},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/joelhenwang/OdinNext-138M-Early-Checkpoint}},
  note         = {Early HGRN2 recurrent language-model checkpoint}
}

References

Zhen Qin, Songlin Yang, Weixuan Sun, Xuyang Shen, Dong Li, Weigao Sun, Yiran Zhong. HGRN2: Gated Linear RNNs with State Expansion. arXiv:2404.07904. https://arxiv.org/abs/2404.07904
Bowen Peng, Théo Gigant, Jeffrey Quesnelle. Efficient Pre-Training with Token Superposition. arXiv:2605.06546. https://arxiv.org/abs/2605.06546
Chenze Shao, Fandong Meng, Jie Zhou. Patch-Level Training for Large Language Models. arXiv:2407.12665. https://arxiv.org/abs/2407.12665
Makoto Shing, Masanori Koyama, Takuya Akiba. DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation. arXiv:2506.14202. https://arxiv.org/abs/2506.14202
Hugging Face Transformers custom-model documentation: https://huggingface.co/docs/transformers/custom_models
vLLM custom/Transformers backend documentation: https://docs.vllm.ai/en/latest/models/supported_models/
SGLang Transformers backend documentation: https://huggingface.co/docs/transformers/en/community_integrations/sglang

Downloads last month: 8

Safetensors

Model size

0.2B params

Tensor type

F16

Papers for joelhenwang/OdinNext-138M-Early-Checkpoint

Efficient Pre-Training with Token Superposition

Paper • 2605.06546 • Published May 7 • 48

DiffusionBlocks: Blockwise Training for Generative Models via Score-Based Diffusion

Paper • 2506.14202 • Published Jun 17, 2025 • 5

Patch-Level Training for Large Language Models

Paper • 2407.12665 • Published Jul 17, 2024 • 17

HGRN2: Gated Linear RNNs with State Expansion

Paper • 2404.07904 • Published Apr 11, 2024 • 21