Instructions to use joelhenwang/OdinNext-138M-Early-Checkpoint with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use joelhenwang/OdinNext-138M-Early-Checkpoint with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="joelhenwang/OdinNext-138M-Early-Checkpoint", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("joelhenwang/OdinNext-138M-Early-Checkpoint", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use joelhenwang/OdinNext-138M-Early-Checkpoint with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "joelhenwang/OdinNext-138M-Early-Checkpoint" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "joelhenwang/OdinNext-138M-Early-Checkpoint", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/joelhenwang/OdinNext-138M-Early-Checkpoint
- SGLang
How to use joelhenwang/OdinNext-138M-Early-Checkpoint with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "joelhenwang/OdinNext-138M-Early-Checkpoint" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "joelhenwang/OdinNext-138M-Early-Checkpoint", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "joelhenwang/OdinNext-138M-Early-Checkpoint" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "joelhenwang/OdinNext-138M-Early-Checkpoint", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use joelhenwang/OdinNext-138M-Early-Checkpoint with Docker Model Runner:
docker model run hf.co/joelhenwang/OdinNext-138M-Early-Checkpoint
OdinNext-138M-Early-Checkpoint
Early research checkpoint of OdinNext, a 138M-parameter causal language model using an HGRN2-style gated linear recurrence instead of softmax self-attention.
This is not a chat model and not a production release. It is an early pretraining checkpoint intended for architecture inspection, qualitative sampling, and continued research.
- Repo:
joelhenwang/OdinNext-138M-Early-Checkpoint - Recommended revision:
main/ EMA-shadowed weights - Training status: early checkpoint at step 3,259
- Context window: 2,048 tokens in the released inference code
- License: Apache-2.0
The model uses custom Transformers code. Loading it with
trust_remote_code=Trueexecutes Python code from this repository. Only do this after reviewing the files or pinning a known commit.
At a glance
| Item | Value |
|---|---|
| Unique tied parameters | 138,449,696 |
| Non-embedding parameters | 113,283,872 |
| Layers | 16 |
| Hidden size | 768 |
| Heads | 6 |
| Head state dims | 128 × 128 per head |
| FFN inner size | 2,048 |
| Vocabulary | 32,768 custom BPE tokens |
| Max sequence length | 2,048 |
| Checkpoint dtype | fp16 |
| Architecture | HGRN2 recurrence + alternating RoPE + SwiGLU² FFN + RMSNorm-style normalization |
| Cache type | Fixed recurrent state, not a growing Transformer KV cache |
What this checkpoint is good for
Use this checkpoint for:
- inspecting a compact recurrent/linear-attention LM implementation;
- testing HGRN2-style recurrent decoding inside the Hugging Face
generate()API; - studying fixed-state decoding memory behavior;
- continuing pretraining or running controlled ablations.
Do not use it for:
- chat, instruction following, or agentic tasks;
- safety-sensitive output generation;
- benchmark claims without running your own evaluation;
- multilingual, coding, or long-context claims.
Architecture
OdinNext is a decoder-only causal LM. Each block uses a pre-norm residual layout:
x = x + sigmoid(gate_attn) * HGRN2(norm(x))
x = x + sigmoid(gate_ffn) * SwiGLU²(norm(x))
The HGRN2-style recurrent state is updated per token as:
S_t = diag(exp(g_t)) S_{t-1} + k_t ⊗ v_t
o_t = q_t S_t
where each layer keeps a per-batch recurrent state shaped:
[B, n_heads, head_f_dim, head_i_dim]
For this checkpoint:
n_heads = 6
head_f_dim = 128
head_i_dim = 128
Even-numbered layers apply RoPE to q and k; odd-numbered layers are position-free. The current inference implementation still enforces a hard 2,048-token cumulative position limit because the RoPE cache is built for max_seq_len = 2048.
Important implementation details
- The exported Hugging Face code contains only the inference path. Training-time machinery is not part of this repository.
past_key_valuesis anOdinNextCache, a list of recurrent states. It is not a Transformer KV cache.attention_maskis accepted for API compatibility but ignored by the backbone. Left-padding is not supported.- Batched generation is safest when all prompts have the same valid length. Padding tokens are still tokens to the recurrence if they are processed.
use_cache=Trueis important for generation. Without it, every generation step reprocesses the full prefix.
Parameter accounting
The 138M headline is the unique tied-parameter runtime count. The input embedding and LM head are tied and should be counted once for model-capacity comparisons.
Hugging Face or file-size-derived parameter summaries may round this checkpoint near 0.2B because stored checkpoint tensors and tied runtime parameters are not always counted the same way.
Memory: recurrent state vs Transformer KV cache
For batch size 1 in fp16, OdinNext's recurrent state size is:
layers × heads × head_f_dim × head_i_dim × bytes
= 16 × 6 × 128 × 128 × 2
= 3,145,728 bytes ≈ 3.0 MiB
That state is constant with respect to generated context length. It scales linearly with batch size and with dtype size. In the pure-PyTorch fallback path, the scan state is promoted to fp32, so the returned recurrent state can be about 6.0 MiB per sequence instead of 3.0 MiB.
A same-depth 16-layer, d_model = 768, fp16 Transformer with full multi-head K/V cache would use approximately:
layers × 2(K,V) × hidden_size × context_tokens × bytes
= 16 × 2 × 768 × T × 2
| Context tokens | Typical Transformer KV cache | OdinNext recurrent state |
|---|---|---|
| 1,024 | 48 MiB | ~3 MiB fp16 / ~6 MiB fp32 fallback |
| 4,096 | 192 MiB | ~3 MiB fp16 / ~6 MiB fp32 fallback |
| 16,384 | 768 MiB | ~3 MiB fp16 / ~6 MiB fp32 fallback |
| 65,536 | 3,072 MiB | ~3 MiB fp16 / ~6 MiB fp32 fallback |
This table is a cache-state comparison only. It is not a claim about total GPU memory, throughput, benchmark quality, or usable context length. The released OdinNext code is still limited to 2,048 cumulative positions.
Training snapshot
Values verified from the public config:
| Field | Value |
|---|---|
_training_step |
3,259 |
_total_tokens |
6,835,666,944 |
_weights_source |
ema_state_dict |
torch_dtype |
float16 |
max_position_embeddings |
2,048 |
Author-reported training notes for this early checkpoint:
| Item | Value |
|---|---|
| Hardware | 2× AMD Strix Halo / gfx1151, ROCm stack |
| Training precision | fp16 + GradScaler |
| Optimizers | NorMuon for 2D tensors; AdamW for 1D/embed tensors |
| LR schedule | WSD, peak 8e-4, warmup 500, min LR 0.1× peak |
| Stabilization | z-loss 1e-4, attention soft-cap 50, EMA decay 0.999 |
| Curriculum | TST-style bag-size-4 phase active at this checkpoint |
| Public benchmarks | not yet provided |
Token accounting note
The public config records _total_tokens = 6,835,666,944. Do not reinterpret that as plain next-token positions from:
3,259 optimizer steps × 256 effective sequences × 2,048 tokens
= 1,708,916,224 position tokens
The 6.84B figure appears to be token-superposition/original-token-equivalent accounting rather than simple next-token position accounting. A full reproducibility report should define whether the total counts original text tokens, bagged targets, loss terms, or optimizer-position tokens.
TST note
The cited Token-Superposition Training paper defines TST as a two-phase method: a superposition phase that combines contiguous tokens into bags and uses a multi-hot cross-entropy objective, followed by a recovery phase that returns to ordinary next-token training.
This checkpoint is described as still being in a bag-size-4 phase. That means ordinary single-stream autoregressive inference is not necessarily the final intended training distribution. Treat quality as preliminary until a bag-size-1 recovery checkpoint and benchmark results are published.
Usage with Transformers
Install the basics:
pip install "transformers>=4.46" torch safetensors
Optional: install flash-linear-attention if your platform supports it. Without it, the model falls back to a pure-PyTorch reference implementation that is useful for correctness and portability but slower for long prompts.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
repo = "joelhenwang/OdinNext-138M-Early-Checkpoint"
# For reproducible experiments, replace "main" with a specific commit hash.
revision = "main"
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.float16 if device == "cuda" else torch.float32
tok = AutoTokenizer.from_pretrained(repo, revision=revision)
model = AutoModelForCausalLM.from_pretrained(
repo,
revision=revision,
trust_remote_code=True,
torch_dtype=dtype,
).to(device).eval()
prompt = "The night was quiet and the streets were empty"
inputs = tok(prompt, return_tensors="pt").to(device)
# The released code is capped at 2,048 cumulative positions.
remaining = model.config.max_position_embeddings - inputs.input_ids.shape[1]
max_new_tokens = max(0, min(80, remaining))
with torch.inference_mode():
out = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=0.8,
top_p=0.95,
repetition_penalty=1.1,
pad_token_id=tok.pad_token_id,
use_cache=True,
)
print(tok.decode(out[0], skip_special_tokens=True))
Batching guidance
The model's recurrent scan does not apply an attention mask. For correct batched generation:
- avoid left padding;
- prefer same-length prompts in a batch;
- avoid processing pad tokens as if they were real prompt tokens;
- test batched output against single-sample output before relying on batched generation.
Single-prompt generation is the safest path for basic use.
Known limitations
- No instruction tuning: no SFT, DPO, RLHF, RLAIF, or chat template.
- No safety training: outputs can be unsafe, biased, false, or incoherent.
- Early quality: this is about 3% of the planned pretraining budget according to the original release notes.
- No formal benchmarks yet: HellaSwag, ARC, MMLU, perplexity suites, and long-context tests are not provided here.
- Hard 2,048-token cap: recurrent cache size is constant, but the released RoPE cache still limits positions.
- Masking caveat:
attention_maskis ignored in the backbone; padding can affect recurrent state. - English-focused: multilingual and code generation should be assumed weak unless tested.
- bf16 unvalidated: fp16 is the intended inference dtype for this checkpoint; CPU fallback should use fp32 for portability.
- Training data not fully documented in this card: treat data provenance, memorization risk, and bias profile as uncharacterized unless separately documented.
Revisions
main: EMA-shadowed weights from_weights_source = ema_state_dict; recommended for evaluation.live: raw training weights at step 3,259, if this branch is retained.
For reproducible experiments, pin a commit hash rather than a moving branch name.
Citation
@misc{odinnext_138m_early_2026,
title = {OdinNext-138M-Early-Checkpoint},
author = {Wang, Joel},
year = {2026},
howpublished = {\url{https://huggingface.co/joelhenwang/OdinNext-138M-Early-Checkpoint}},
note = {Early HGRN2 recurrent language-model checkpoint}
}
References
- Zhen Qin, Songlin Yang, Weixuan Sun, Xuyang Shen, Dong Li, Weigao Sun, Yiran Zhong. HGRN2: Gated Linear RNNs with State Expansion. arXiv:2404.07904. https://arxiv.org/abs/2404.07904
- Bowen Peng, Théo Gigant, Jeffrey Quesnelle. Efficient Pre-Training with Token Superposition. arXiv:2605.06546. https://arxiv.org/abs/2605.06546
- Chenze Shao, Fandong Meng, Jie Zhou. Patch-Level Training for Large Language Models. arXiv:2407.12665. https://arxiv.org/abs/2407.12665
- Makoto Shing, Masanori Koyama, Takuya Akiba. DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation. arXiv:2506.14202. https://arxiv.org/abs/2506.14202
- Hugging Face Transformers custom-model documentation: https://huggingface.co/docs/transformers/custom_models
- vLLM custom/Transformers backend documentation: https://docs.vllm.ai/en/latest/models/supported_models/
- SGLang Transformers backend documentation: https://huggingface.co/docs/transformers/en/community_integrations/sglang
- Downloads last month
- 49