Instructions to use joelhenwang/OdinNext-138M-Base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use joelhenwang/OdinNext-138M-Base with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="joelhenwang/OdinNext-138M-Base", trust_remote_code=True)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("joelhenwang/OdinNext-138M-Base", trust_remote_code=True, device_map="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use joelhenwang/OdinNext-138M-Base with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "joelhenwang/OdinNext-138M-Base"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "joelhenwang/OdinNext-138M-Base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/joelhenwang/OdinNext-138M-Base

SGLang

How to use joelhenwang/OdinNext-138M-Base with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "joelhenwang/OdinNext-138M-Base" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "joelhenwang/OdinNext-138M-Base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "joelhenwang/OdinNext-138M-Base" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "joelhenwang/OdinNext-138M-Base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use joelhenwang/OdinNext-138M-Base with Docker Model Runner:
```
docker model run hf.co/joelhenwang/OdinNext-138M-Base
```

joelhenwang commited on Jun 6

Commit

0ef192a

verified ·

1 Parent(s): cb28e4d

OdinNext-138M-Base: EMA weights (101.6B-token dolmino base)

Browse files

Files changed (10) hide show

README.md +203 -0
_hgrn2_fallback.py +101 -0
config.json +32 -0
configuration_odinnext.py +120 -0
generation_config.json +11 -0
model.safetensors +3 -0
modeling_odinnext.py +617 -0
special_tokens_map.json +5 -0
tokenizer.json +0 -0
tokenizer_config.json +8 -0

README.md ADDED Viewed

	@@ -0,0 +1,203 @@

+---
+license: apache-2.0
+language:
+  - en
+library_name: transformers
+pipeline_tag: text-generation
+tags:
+  - odinnext
+  - hgrn2
+  - linear-attention
+  - recurrent
+  - causal-lm
+  - custom_code
+  - base-model
+  - fp16
+  - amd
+  - rocm
+  - arxiv:2404.07904
+  - arxiv:2605.06546
+  - arxiv:2407.12665
+  - arxiv:2506.14202
+---
+# OdinNext-138M-Base
+**OdinNext** is a 138M-parameter causal language model that replaces softmax
+self-attention with an **HGRN2-style gated linear recurrence**. This repository
+is the **base pretrained model** — trained from scratch on ~101.6B tokens of
+curated data (the Dolmino mix) on two AMD Strix Halo (gfx1151) machines.
+This is a **base model**: it completes and continues text. It is **not** an
+instruction-tuned or chat model — no SFT, DPO, RLHF, or chat template. Those
+stages are in progress and will ship as a separate `*-Instruct` repository.
+- **Repo:** `joelhenwang/OdinNext-138M-Base`
+- **`main`:** EMA-shadowed weights (decay 0.999), recommended.
+- **`live`:** raw training weights at the same step.
+- **Context window:** 2,048 tokens in the released inference code.
+- **License:** Apache-2.0.
+> Uses custom Transformers code. Loading with `trust_remote_code=True` executes
+> Python from this repo. Review the files or pin a commit before trusting it.
+## At a glance
+| Item | Value |
+|---|---:|
+| Unique tied parameters | **138,449,696** |
+| Non-embedding parameters | **113,283,872** |
+| Layers | 16 |
+| Hidden size | 768 |
+| Heads | 6 |
+| Head state dims | 128 × 128 per head |
+| FFN inner size | 2,048 |
+| Vocabulary | 32,768 custom BPE tokens |
+| Max sequence length | 2,048 |
+| Checkpoint dtype | fp16 |
+| Architecture | HGRN2 recurrence + alternating RoPE + SwiGLU² FFN + ZCRMSNorm |
+| Cache type | Fixed-size recurrent state, not a growing KV cache |
+## Architecture
+Decoder-only causal LM, 16 identical pre-norm blocks:
+```text
+x = x + sigmoid(gate_attn) * HGRN2(ZCRMSNorm(x))
+x = x + sigmoid(gate_ffn)  * SwiGLU²(ZCRMSNorm(x))
+```
+The HGRN2 recurrent state updates per token as:
+```text
+S_t = diag(exp(g_t)) S_{t-1} + k_t ⊗ v_t
+o_t = q_t S_t
+```
+with a per-layer state shaped `[B, n_heads, head_f_dim, head_i_dim]` =
+`[B, 6, 128, 128]`. This state is **constant in size with respect to context
+length**, giving O(1)-per-token decoding rather than a growing KV cache.
+**Hybrid RoPE:** even layers (0, 2, …, 14) apply RoPE to q/k (θ = 100,000);
+odd layers are position-free. Tied embedding / LM head. No linear biases.
+## Memory: recurrent state vs Transformer KV cache
+For batch size 1 in fp16 the recurrent state is constant:
+```text
+layers × heads × head_f_dim × head_i_dim × bytes
+= 16 × 6 × 128 × 128 × 2 = 3,145,728 bytes ≈ 3.0 MiB
+```
+independent of generated length (the pure-PyTorch fallback promotes the scan
+state to fp32, ≈ 6.0 MiB). A same-depth fp16 Transformer KV cache would grow
+linearly (≈ 48 MiB at 1K tokens, ≈ 768 MiB at 16K). This is a cache-state
+comparison only, not a claim about total memory or usable context.
+## Training snapshot
+| Field | Value |
+|---|---|
+| Data | Dolmino mix (~101.6B tokens, odin-32k tokenizer) |
+| Hardware | 2× AMD Strix Halo / gfx1151, ROCm 7.13 |
+| Interconnect | Thunderbolt 4, DDP over gloo |
+| Precision | fp16 + GradScaler |
+| Optimizers | NorMuon (2D tensors) + AdamW (1D / embeddings) |
+| LR | peak 8e-4, warmup, cosine decay |
+| Stabilization | z-loss 1e-4, attention soft-cap 50, EMA decay 0.999 |
+| Curriculum | Phase 1: Token-Superposition Training (bag-size 4) + DiffusionBlocks (block-wise) for ~24K steps; Phase 2: standard end-to-end autoregressive recovery |
+| Released weights | `main` = `ema_state_dict`; `live` = raw online weights |
+The two-phase curriculum trains most of the budget under a block-wise
+DiffusionBlocks + token-superposition objective for throughput, then recovers
+ordinary left-to-right generation with a standard end-to-end phase. The
+released weights are from the end-to-end recovery phase and produce coherent
+continuations.
+## What this model is good for
+- Text continuation and completion in English.
+- Research on compact recurrent / linear-attention LMs and fixed-state decoding.
+- A base for instruction tuning, alignment, and context extension.
+Do **not** use it for chat / instruction following (not tuned yet), safety-
+sensitive generation, or benchmark claims without running your own evaluation.
+## Usage
+```bash
+pip install "transformers>=4.46" torch safetensors
+```
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+repo = "joelhenwang/OdinNext-138M-Base"
+revision = "main"  # EMA weights; pin a commit for reproducibility
+device = "cuda" if torch.cuda.is_available() else "cpu"
+dtype = torch.float16 if device == "cuda" else torch.float32
+tok = AutoTokenizer.from_pretrained(repo, revision=revision)
+model = AutoModelForCausalLM.from_pretrained(
+    repo, revision=revision, trust_remote_code=True, torch_dtype=dtype,
+).to(device).eval()
+prompt = "The discovery of penicillin"
+inputs = tok(prompt, return_tensors="pt").to(device)
+remaining = model.config.max_position_embeddings - inputs.input_ids.shape[1]
+with torch.inference_mode():
+    out = model.generate(
+        **inputs,
+        max_new_tokens=max(0, min(100, remaining)),
+        do_sample=True, temperature=0.8, top_p=0.95, repetition_penalty=1.1,
+        pad_token_id=tok.pad_token_id, use_cache=True,
+    )
+print(tok.decode(out[0], skip_special_tokens=True))
+```
+### Batching guidance
+The recurrent scan does not apply an attention mask. For correct batched
+generation: avoid left padding, prefer same-length prompts, and verify batched
+output against single-sample output before relying on it. Single-prompt
+generation is the safest path.
+## Limitations
+- **Base model only:** no instruction tuning, alignment, or chat template.
+- **No safety training:** outputs can be biased, false, or incoherent.
+- **Hard 2,048-token cap:** recurrent state is constant, but the released RoPE
+  cache limits cumulative positions to 2,048.
+- **`attention_mask` ignored** in the backbone; padding affects recurrent state.
+- **English-focused;** multilingual / code ability is uncharacterized.
+- **Formal benchmarks not published in this card yet.** Treat quality as
+  preliminary and run your own evaluation.
+## Revisions
+- `main`: EMA-shadowed weights (decay 0.999), recommended for evaluation.
+- `live`: raw training weights at the same step.
+Pin a commit hash rather than a moving branch for reproducible experiments.
+## Citation
+```bibtex
+@misc{odinnext_138m_base_2026,
+  title        = {OdinNext-138M-Base},
+  author       = {Wang, Joel},
+  year         = {2026},
+  howpublished = {\url{https://huggingface.co/joelhenwang/OdinNext-138M-Base}},
+  note         = {138M HGRN2 recurrent language-model base checkpoint}
+}
+```
+## References
+- Zhen Qin et al. **HGRN2: Gated Linear RNNs with State Expansion.** arXiv:2404.07904.
+- Bowen Peng et al. **Efficient Pre-Training with Token Superposition.** arXiv:2605.06546.
+- Chenze Shao et al. **Patch-Level Training for Large Language Models.** arXiv:2407.12665.
+- Makoto Shing et al. **DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation.** arXiv:2506.14202.

_hgrn2_fallback.py ADDED Viewed

	@@ -0,0 +1,101 @@

+# coding=utf-8
+# Copyright 2026 The OdinNext authors.
+# Licensed under the Apache License, Version 2.0.
+"""Pure-PyTorch HGRN2 recurrence — slow fallback when flash-linear-attention
+(`fla`) is unavailable.
+The `fla` library provides Triton/CUDA kernels for `chunk_gla` (chunk-wise
+parallel scan over T) and `fused_recurrent_gla` (token-by-token serial scan).
+On platforms without those kernels (CPU, non-CUDA/non-ROCm GPUs) we provide
+a reference implementation here.
+Speed: ~10-30x slower than `fla` at training shapes; comparable for
+single-token decode (since both are serial). Numerical match: bitwise on
+fp32, within fp16 noise on fp16.
+The recurrence (per head):
+    S_t = diag(exp(g_t)) @ S_{t-1} + k_t.unsqueeze(-1) @ v_t.unsqueeze(-2)
+    o_t = q_t @ S_t
+Shapes (matching `fla.ops.gla.chunk_gla`):
+    q: [B, T, H, K]   (K = head_f_dim, e.g. 128)
+    k: [B, T, H, K]
+    g: [B, T, H, K]   (already in log-space, expected to be <= 0)
+    v: [B, T, H, V]   (V = head_i_dim, e.g. 128)
+    -> o: [B, T, H, V]
+       final_state: [B, H, K, V] if output_final_state else None
+"""
+from typing import Optional, Tuple
+import torch
+def chunk_gla(
+    q: torch.Tensor,
+    k: torch.Tensor,
+    v: torch.Tensor,
+    g: torch.Tensor,
+    initial_state: Optional[torch.Tensor] = None,
+    output_final_state: bool = False,
+    **_unused,
+) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
+    """Pure-PyTorch chunk_gla replacement.
+    Implements a serial (token-by-token) scan. We promote internals to fp32
+    to keep the cumulative product of decays numerically sane over long T.
+    """
+    B, T, H, K = q.shape
+    V = v.shape[-1]
+    device = q.device
+    in_dtype = q.dtype
+    # Promote scan internals to fp32 for stability (matches fla behavior).
+    q32 = q.float()
+    k32 = k.float()
+    v32 = v.float()
+    g32 = g.float()
+    if initial_state is None:
+        S = torch.zeros(B, H, K, V, device=device, dtype=torch.float32)
+    else:
+        S = initial_state.to(dtype=torch.float32)
+    out = torch.empty(B, T, H, V, device=device, dtype=torch.float32)
+    # Serial scan. exp(g_t) decays state element-wise along K.
+    # k_t outer v_t -> [B, H, K, V] additive update.
+    for t in range(T):
+        decay = g32[:, t].exp().unsqueeze(-1)              # [B, H, K, 1]
+        S = decay * S + k32[:, t].unsqueeze(-1) * v32[:, t].unsqueeze(-2)
+        # o_t = q_t (1xK) @ S (KxV) per head
+        out[:, t] = (q32[:, t].unsqueeze(-2) @ S).squeeze(-2)  # [B, H, V]
+    out = out.to(in_dtype)
+    if output_final_state:
+        return out, S
+    return out, None
+def fused_recurrent_gla(
+    q: torch.Tensor,
+    k: torch.Tensor,
+    v: torch.Tensor,
+    gk: torch.Tensor,
+    initial_state: Optional[torch.Tensor] = None,
+    output_final_state: bool = True,
+    **_unused,
+) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
+    """Pure-PyTorch single-token (or short-T) recurrence.
+    `fla.ops.gla.fused_recurrent_gla` is what OdinNext.generate uses for
+    O(1) per-token decode. The signature matches: `gk` = log-decay (instead
+    of `g`). We reuse `chunk_gla` internals — they are mathematically the
+    same scan, just packaged with different defaults for kernel selection
+    in fla.
+    """
+    return chunk_gla(
+        q=q, k=k, v=v, g=gk,
+        initial_state=initial_state,
+        output_final_state=output_final_state,
+    )

config.json ADDED Viewed

	@@ -0,0 +1,32 @@

+{
+  "model_type": "odinnext",
+  "architectures": [
+    "OdinNextForCausalLM"
+  ],
+  "auto_map": {
+    "AutoConfig": "configuration_odinnext.OdinNextConfig",
+    "AutoModelForCausalLM": "modeling_odinnext.OdinNextForCausalLM"
+  },
+  "vocab_size": 32768,
+  "d_model": 768,
+  "n_layers": 16,
+  "n_heads": 6,
+  "ffn_inner": 2048,
+  "max_seq_len": 2048,
+  "rope_theta": 100000.0,
+  "tie_embeddings": true,
+  "tie_word_embeddings": true,
+  "use_cache": true,
+  "torch_dtype": "float16",
+  "bos_token_id": 0,
+  "eos_token_id": 0,
+  "pad_token_id": 1,
+  "hidden_size": 768,
+  "num_hidden_layers": 16,
+  "num_attention_heads": 6,
+  "intermediate_size": 2048,
+  "max_position_embeddings": 2048,
+  "_training_step": 5000,
+  "_total_tokens": 5243928576,
+  "_weights_source": "ema_state_dict"
+}

configuration_odinnext.py ADDED Viewed

	@@ -0,0 +1,120 @@

+# coding=utf-8
+# Copyright 2026 The OdinNext authors.
+# Licensed under the Apache License, Version 2.0.
+"""OdinNext model configuration."""
+from transformers import PretrainedConfig
+class OdinNextConfig(PretrainedConfig):
+    r"""Configuration class for [`OdinNextForCausalLM`].
+    OdinNext is a 138M-parameter HGRN2+RoPE hybrid causal language model.
+    The architecture interleaves two layer types:
+      * Even layers (0, 2, 4, ..., 14): HGRN2 gated linear recurrence with
+        rotary position embeddings (RoPE) on q/k.
+      * Odd layers (1, 3, 5, ..., 15): the same HGRN2 recurrence WITHOUT
+        positional encoding (position-free, generalizes to any length).
+    HGRN2 gives O(T) training and O(1) per-token inference: the per-layer
+    recurrent state has a fixed size independent of context length.
+    Args:
+        vocab_size (`int`, *optional*, defaults to 32768):
+            Vocabulary size of the OdinNext model.
+        d_model (`int`, *optional*, defaults to 768):
+            Hidden size of the residual stream.
+        n_layers (`int`, *optional*, defaults to 16):
+            Number of transformer-style blocks.
+        n_heads (`int`, *optional*, defaults to 6):
+            Number of recurrence heads. Per-head expand dim is
+            `d_model // n_heads = 128` for the default configuration.
+        ffn_inner (`int`, *optional*, defaults to 2048):
+            SwiGLU2 inner dimension.
+        max_seq_len (`int`, *optional*, defaults to 2048):
+            Maximum sequence length the RoPE cache covers. Generation past
+            this position raises (extend by raising and re-instantiating).
+        rope_theta (`float`, *optional*, defaults to 100000.0):
+            RoPE base frequency. Even layers only.
+        tie_embeddings (`bool`, *optional*, defaults to `True`):
+            Tie input embedding matrix and output LM-head weight.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            Unused at inference; recorded for parity with HF conventions.
+        bos_token_id (`int`, *optional*, defaults to 0):
+            Same as eos for this tokenizer (`<|endoftext|>`).
+        eos_token_id (`int`, *optional*, defaults to 0):
+            `<|endoftext|>` token id.
+        pad_token_id (`int`, *optional*, defaults to 1):
+            `<|pad|>` token id in the odin-32k tokenizer.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether to return per-layer recurrent states from `forward()`,
+            and whether `generate()` should consume them. The "cache" here
+            is a list of fixed-size HGRN2 states, NOT a growing KV cache.
+    Example:
+    ```python
+    >>> from transformers import AutoConfig
+    >>> config = AutoConfig.from_pretrained(
+    ...     "joelhenwang/OdinNext-138M-Early-Checkpoint",
+    ...     trust_remote_code=True,
+    ... )
+    >>> config.d_model
+    768
+    ```
+    """
+    model_type = "odinnext"
+    keys_to_ignore_at_inference = ["past_key_values"]
+    def __init__(
+        self,
+        vocab_size: int = 32768,
+        d_model: int = 768,
+        n_layers: int = 16,
+        n_heads: int = 6,
+        ffn_inner: int = 2048,
+        max_seq_len: int = 2048,
+        rope_theta: float = 100000.0,
+        tie_embeddings: bool = True,
+        initializer_range: float = 0.02,
+        bos_token_id: int = 0,
+        eos_token_id: int = 0,
+        pad_token_id: int = 1,
+        use_cache: bool = True,
+        **kwargs,
+    ):
+        self.vocab_size = vocab_size
+        self.d_model = d_model
+        self.n_layers = n_layers
+        self.n_heads = n_heads
+        self.ffn_inner = ffn_inner
+        self.max_seq_len = max_seq_len
+        self.rope_theta = rope_theta
+        self.tie_embeddings = tie_embeddings
+        self.initializer_range = initializer_range
+        self.use_cache = use_cache
+        # Common HF aliases — many libraries (lm-eval-harness, vLLM compat
+        # layers, etc.) reach for these names. Provide them as direct
+        # passthroughs so external tooling has a chance of working.
+        self.hidden_size = d_model
+        self.num_hidden_layers = n_layers
+        self.num_attention_heads = n_heads
+        self.intermediate_size = ffn_inner
+        self.max_position_embeddings = max_seq_len
+        # Strip keys we are about to pass explicitly so they don't double up
+        # via **kwargs (config.json may carry duplicates).
+        kwargs.pop("tie_word_embeddings", None)
+        kwargs.pop("bos_token_id", None)
+        kwargs.pop("eos_token_id", None)
+        kwargs.pop("pad_token_id", None)
+        super().__init__(
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            pad_token_id=pad_token_id,
+            tie_word_embeddings=tie_embeddings,
+            **kwargs,
+        )

generation_config.json ADDED Viewed

	@@ -0,0 +1,11 @@

+{
+  "bos_token_id": 0,
+  "eos_token_id": 0,
+  "pad_token_id": 1,
+  "max_new_tokens": 128,
+  "do_sample": true,
+  "temperature": 0.8,
+  "top_p": 0.95,
+  "repetition_penalty": 1.1,
+  "use_cache": true
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bfc1fdd190627224dedcaa0f8894b7efdcb4e8c2207fd86de2a649c1e1fa7f56
+size 276917608

modeling_odinnext.py ADDED Viewed

	@@ -0,0 +1,617 @@

+# coding=utf-8
+# Copyright 2026 The OdinNext authors.
+# Licensed under the Apache License, Version 2.0.
+"""OdinNext: 138M HGRN2+RoPE hybrid causal language model.
+This is a self-contained HuggingFace `trust_remote_code=True` port of the
+production OdinNext model used to train the 6.84B-token early checkpoint.
+The training-time machinery (DiffusionBlocks, TST, gate-absorption,
+torch.compile zone helpers) is dropped — only the inference path remains.
+Architecture summary:
+  * 16 layers, d=768, 6 heads, ffn=2048, vocab=32768.
+  * Even layers (0,2,...,14) get RoPE on q/k.
+  * Odd layers (1,3,...,15) are position-free recurrent.
+  * SwiGLU2 FFN: silu(gate)^2 * up.
+  * ZCRMSNorm normalization, gated residuals (frozen at training time).
+  * Tied input/output embeddings.
+  * HGRN2 recurrence: O(T) train, O(1) per-token decode.
+Hardware notes:
+  * Uses `flash-linear-attention` (`fla`) Triton kernels when available.
+    Falls back to a pure-PyTorch implementation (~10-30x slower) otherwise,
+    so the model loads on any backend including CPU.
+  * Trained in fp16 on AMD Strix Halo (gfx1151, RDNA 3.5, ROCm 7.13).
+    fp16 is the recommended inference dtype. bf16 was never validated on
+    this checkpoint.
+"""
+from __future__ import annotations
+import math
+from typing import List, Optional, Tuple, Union
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from transformers import PreTrainedModel
+from transformers.modeling_outputs import CausalLMOutputWithPast
+from .configuration_odinnext import OdinNextConfig
+# ---------------------------------------------------------------------------
+# HGRN2 kernel: prefer flash-linear-attention, fall back to pure PyTorch
+# ---------------------------------------------------------------------------
+try:
+    from fla.ops.gla import chunk_gla as _chunk_gla
+    from fla.ops.gla import fused_recurrent_gla as _fused_recurrent_gla
+    # `fla.ops.gla.chunk.ChunkGLAFunction` is decorated with
+    # @torch.compiler.disable. Marking it allow_in_graph lets Dynamo treat
+    # it as an opaque leaf op, preventing graph breaks if the user does
+    # `torch.compile(model)`. Best-effort, ignored if internals shift.
+    try:
+        from fla.ops.gla.chunk import ChunkGLAFunction
+        torch.compiler.allow_in_graph(ChunkGLAFunction)
+    except Exception:
+        pass
+    _HAS_FLA = True
+except Exception:  # ImportError, missing Triton, no CUDA/ROCm, ...
+    from ._hgrn2_fallback import chunk_gla as _chunk_gla
+    from ._hgrn2_fallback import fused_recurrent_gla as _fused_recurrent_gla
+    _HAS_FLA = False
+# ---------------------------------------------------------------------------
+# Building blocks
+# ---------------------------------------------------------------------------
+class ZCRMSNorm(nn.Module):
+    """Zero-Centered RMSNorm.
+    Stored weight is initialized to 1.0; F.rms_norm sees a leaf parameter
+    directly. Mathematically equivalent to RMSNorm with `gamma = weight - 1`.
+    """
+    def __init__(self, dim: int, eps: float = 1e-6):
+        super().__init__()
+        self.eps = eps
+        self.weight = nn.Parameter(torch.ones(dim))
+        self._normalized_shape = (dim,)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return F.rms_norm(x, self._normalized_shape, self.weight, self.eps)
+class SwiGLU2(nn.Module):
+    """SwiGLU squared FFN: silu(gate)^2 * up -> down."""
+    def __init__(self, d_model: int, ffn_inner: int):
+        super().__init__()
+        self.w_gate_up = nn.Linear(d_model, 2 * ffn_inner, bias=False)
+        self.w_down = nn.Linear(ffn_inner, d_model, bias=False)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        gate, up = self.w_gate_up(x).chunk(2, dim=-1)
+        return self.w_down(F.silu(gate).square() * up)
+def _apply_rope(
+    x: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor
+) -> torch.Tensor:
+    """Apply RoPE to x[B,T,H,D] using real arithmetic.
+    cos/sin: [1, T, 1, D/2] pre-broadcast.
+    """
+    x_even = x[..., 0::2]
+    x_odd = x[..., 1::2]
+    out_even = x_even * cos - x_odd * sin
+    out_odd = x_even * sin + x_odd * cos
+    return torch.stack([out_even, out_odd], dim=-1).flatten(-2)
+class OdinNextAttention(nn.Module):
+    """HGRN2 attention with optional RoPE on q/k."""
+    def __init__(
+        self,
+        d_model: int = 768,
+        n_heads: int = 6,
+        expand_ratio: Optional[int] = None,
+        use_rope: bool = True,
+    ):
+        super().__init__()
+        self.d_model = d_model
+        self.n_heads = n_heads
+        if expand_ratio is None:
+            expand_ratio = d_model // n_heads
+        self.expand_ratio = expand_ratio
+        self.head_f_dim = expand_ratio
+        self.head_i_dim = d_model // n_heads
+        self.forget_dim = n_heads * expand_ratio
+        self.use_rope = use_rope
+        self.q_proj = nn.Linear(d_model, self.forget_dim, bias=False)
+        self.f_proj = nn.Linear(d_model, self.forget_dim, bias=False)
+        self.i_proj = nn.Linear(d_model, d_model, bias=False)
+        self.g_norm = ZCRMSNorm(d_model)
+        self.o_proj = nn.Linear(d_model, d_model, bias=False)
+    def forward(
+        self,
+        x: torch.Tensor,
+        cos: Optional[torch.Tensor] = None,
+        sin: Optional[torch.Tensor] = None,
+        recurrent_state: Optional[torch.Tensor] = None,
+        output_state: bool = False,
+        use_recurrent_kernel: bool = False,
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
+        """
+        Args:
+            x: [B, T, D] hidden states.
+            cos, sin: RoPE caches if `use_rope`, else ignored.
+            recurrent_state: optional [B, H, K, V] HGRN2 state to seed the scan.
+            output_state: if True, return the final HGRN2 state alongside output.
+            use_recurrent_kernel: if True (single-token decode), call the
+                fused recurrent kernel; otherwise call chunk_gla.
+        """
+        B, T, D = x.shape
+        q = F.silu(self.q_proj(x))
+        forget_logits = self.f_proj(x)
+        g = F.logsigmoid(forget_logits)
+        k = torch.sigmoid(-forget_logits)
+        v = self.i_proj(x)
+        q = q.view(B, T, self.n_heads, self.head_f_dim)
+        k = k.view(B, T, self.n_heads, self.head_f_dim)
+        g = g.view(B, T, self.n_heads, self.head_f_dim)
+        v = v.view(B, T, self.n_heads, self.head_i_dim)
+        if self.use_rope and cos is not None:
+            q = _apply_rope(q, cos, sin)
+            k = _apply_rope(k, cos, sin)
+        if use_recurrent_kernel:
+            o, final_state = _fused_recurrent_gla(
+                q=q, k=k, v=v, gk=g,
+                initial_state=recurrent_state,
+                output_final_state=True,
+            )
+        else:
+            o, final_state = _chunk_gla(
+                q=q, k=k, v=v, g=g,
+                initial_state=recurrent_state,
+                output_final_state=output_state,
+            )
+        o = o.reshape(B, T, D)
+        o = self.g_norm(o)
+        o = self.o_proj(o)
+        if output_state:
+            return o, final_state
+        return o, None
+class OdinNextBlock(nn.Module):
+    """Pre-norm block with gated residuals.
+    Gates were absorbed and frozen at training time: `gate_attn` and
+    `gate_ffn` are stored as scalars whose `sigmoid()` ≈ 1 by the time of
+    this checkpoint. They remain in the state_dict for compatibility.
+    """
+    def __init__(
+        self,
+        d_model: int,
+        n_heads: int,
+        ffn_inner: int,
+        use_rope: bool = True,
+    ):
+        super().__init__()
+        self.pre_norm = ZCRMSNorm(d_model)
+        self.attn = OdinNextAttention(
+            d_model=d_model, n_heads=n_heads, use_rope=use_rope
+        )
+        self.ffn_norm = ZCRMSNorm(d_model)
+        self.ffn = SwiGLU2(d_model, ffn_inner)
+        self.gate_attn = nn.Parameter(torch.zeros(1))
+        self.gate_ffn = nn.Parameter(torch.zeros(1))
+    def forward(
+        self,
+        x: torch.Tensor,
+        cos: Optional[torch.Tensor] = None,
+        sin: Optional[torch.Tensor] = None,
+        recurrent_state: Optional[torch.Tensor] = None,
+        output_state: bool = False,
+        use_recurrent_kernel: bool = False,
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
+        attn_out, new_state = self.attn(
+            self.pre_norm(x),
+            cos=cos, sin=sin,
+            recurrent_state=recurrent_state,
+            output_state=output_state,
+            use_recurrent_kernel=use_recurrent_kernel,
+        )
+        x = x + torch.sigmoid(self.gate_attn) * attn_out
+        x = x + torch.sigmoid(self.gate_ffn) * self.ffn(self.ffn_norm(x))
+        return x, new_state
+# ---------------------------------------------------------------------------
+# OdinNext recurrent-state cache
+# ---------------------------------------------------------------------------
+class OdinNextCache:
+    """Container for HGRN2 recurrent states across all layers.
+    Wraps `List[Optional[Tensor]]` (one per layer, each [B, H, K, V]) with
+    just enough surface to satisfy HuggingFace `generate()`'s expectations
+    for `past_key_values`. Importantly: cache size is independent of T —
+    it is the per-layer hidden-state matrix S, not a growing K/V tape.
+    Also tracks `seen_tokens`, the number of input positions the cache has
+    consumed so far, which OdinNext uses to look up the correct RoPE
+    position offset during decode.
+    """
+    def __init__(self, n_layers: int):
+        self.n_layers = n_layers
+        self.states: List[Optional[torch.Tensor]] = [None] * n_layers
+        self.seen_tokens: int = 0
+    def __len__(self) -> int:
+        return self.n_layers
+    def __getitem__(self, idx: int) -> Optional[torch.Tensor]:
+        return self.states[idx]
+    def __setitem__(self, idx: int, value: Optional[torch.Tensor]) -> None:
+        self.states[idx] = value
+    def __iter__(self):
+        return iter(self.states)
+    def get_seq_length(self, layer_idx: Optional[int] = 0) -> int:
+        return self.seen_tokens
+    def get_max_length(self) -> Optional[int]:
+        return None  # HGRN2 has no hard cache length cap
+    def update_seen(self, n_new_tokens: int) -> None:
+        self.seen_tokens += n_new_tokens
+    def to(self, device: torch.device) -> "OdinNextCache":
+        for i, s in enumerate(self.states):
+            if s is not None:
+                self.states[i] = s.to(device)
+        return self
+# ---------------------------------------------------------------------------
+# OdinNext PreTrainedModel: HF integration
+# ---------------------------------------------------------------------------
+class OdinNextPreTrainedModel(PreTrainedModel):
+    """Base class wiring up HF infrastructure for OdinNext."""
+    config_class = OdinNextConfig
+    base_model_prefix = "model"
+    supports_gradient_checkpointing = False
+    _no_split_modules = ["OdinNextBlock"]
+    _skip_keys_device_placement = "past_key_values"
+    _supports_cache_class = False  # we use our own OdinNextCache
+    def _init_weights(self, module: nn.Module) -> None:
+        """Conservative init — at inference we only need to define defaults
+        in case someone constructs an OdinNext from scratch.
+        """
+        std = getattr(self.config, "initializer_range", 0.02)
+        if isinstance(module, nn.Linear):
+            nn.init.xavier_uniform_(module.weight)
+            if module.bias is not None:
+                nn.init.zeros_(module.bias)
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(mean=0.0, std=std)
+class OdinNextModel(OdinNextPreTrainedModel):
+    """Backbone (no LM head)."""
+    def __init__(self, config: OdinNextConfig):
+        super().__init__(config)
+        self.config = config
+        self.tok_embeddings = nn.Embedding(config.vocab_size, config.d_model)
+        self.layers = nn.ModuleList([
+            OdinNextBlock(
+                d_model=config.d_model,
+                n_heads=config.n_heads,
+                ffn_inner=config.ffn_inner,
+                use_rope=(i % 2 == 0),
+            )
+            for i in range(config.n_layers)
+        ])
+        self.final_norm = ZCRMSNorm(config.d_model)
+        # RoPE caches are lazy-built on first forward. Storing them as
+        # `register_buffer(..., persistent=False)` is incompatible with
+        # `from_pretrained(low_cpu_mem_usage=True)`: HF builds the model on
+        # the meta device and only materializes tensors that appear in the
+        # checkpoint. Non-persistent buffers are NOT in the checkpoint and
+        # so end up backed by uninitialized memory after meta -> real
+        # transfer. We side-step this entirely by computing cos/sin on the
+        # first forward, cached on the model object as plain attributes.
+        self._cos_cache: Optional[torch.Tensor] = None
+        self._sin_cache: Optional[torch.Tensor] = None
+        # Skip _init_weights here — we expect to load weights from a
+        # pretrained checkpoint immediately after construction.
+    def get_input_embeddings(self) -> nn.Embedding:
+        return self.tok_embeddings
+    def set_input_embeddings(self, value: nn.Embedding) -> None:
+        self.tok_embeddings = value
+    # -----------------------------------------------------------------
+    # Forward
+    # -----------------------------------------------------------------
+    def _ensure_rope_cache(self, target_device: torch.device) -> None:
+        """Build the RoPE cos/sin caches on `target_device` if not already.
+        Cached as plain Python attributes (not buffers) to avoid HF's
+        `low_cpu_mem_usage=True` meta-device materialization issue with
+        non-persistent buffers.
+        """
+        need_build = (
+            self._cos_cache is None
+            or self._cos_cache.device != target_device
+        )
+        if not need_build:
+            return
+        head_f_dim = self.config.d_model // self.config.n_heads
+        half_dim = head_f_dim // 2
+        freqs = 1.0 / (
+            self.config.rope_theta
+            ** (
+                torch.arange(0, half_dim, dtype=torch.float32, device=target_device)
+                / half_dim
+            )
+        )
+        t = torch.arange(self.config.max_seq_len, dtype=torch.float32, device=target_device)
+        angles = torch.outer(t, freqs)
+        self._cos_cache = angles.cos()
+        self._sin_cache = angles.sin()
+    def _rope_slice(
+        self,
+        seq_len: int,
+        offset: int,
+        target_dtype: torch.dtype,
+        target_device: torch.device,
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        end = offset + seq_len
+        if end > self.config.max_seq_len:
+            raise ValueError(
+                f"Position {end} exceeds max_seq_len={self.config.max_seq_len}. "
+                "OdinNext was trained with a 2048-token RoPE cache."
+            )
+        self._ensure_rope_cache(target_device)
+        cos = self._cos_cache[offset:end].to(dtype=target_dtype)
+        sin = self._sin_cache[offset:end].to(dtype=target_dtype)
+        cos = cos.unsqueeze(0).unsqueeze(2)  # [1, T, 1, D/2]
+        sin = sin.unsqueeze(0).unsqueeze(2)
+        return cos, sin
+    def forward(
+        self,
+        input_ids: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        past_key_values: Optional[OdinNextCache] = None,
+        use_cache: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        **_unused,
+    ) -> Tuple[torch.Tensor, Optional[OdinNextCache]]:
+        """Backbone forward.
+        Returns `(hidden_states, past_key_values)`. The LM-head wrapper
+        (`OdinNextForCausalLM`) projects to logits.
+        Note: `attention_mask` is accepted for HF API compatibility but is
+        NOT used. HGRN2 is causal by construction (the recurrence is strictly
+        forward-in-time) and cannot honor a left-padded mask. For correct
+        results with batched generation, callers must right-pad and ensure
+        all sequences in a batch have valid tokens at every position they
+        process. Single-sequence generation is unaffected.
+        """
+        if use_cache is None:
+            use_cache = self.config.use_cache
+        B, T = input_ids.shape
+        # Determine if we're in single-token decode mode.
+        single_step = (T == 1) and (past_key_values is not None)
+        # RoPE position offset
+        if past_key_values is not None:
+            offset = past_key_values.seen_tokens
+        else:
+            offset = 0
+        h = self.tok_embeddings(input_ids)
+        # Prepare RoPE caches in the embedding's dtype.
+        cos, sin = self._rope_slice(
+            seq_len=T, offset=offset,
+            target_dtype=h.dtype, target_device=h.device,
+        )
+        # Coerce past_key_values to our expected type. HF generate may
+        # try to auto-instantiate a DynamicCache or pass a legacy tuple;
+        # we want strict OdinNextCache or None.
+        if past_key_values is not None and not isinstance(past_key_values, OdinNextCache):
+            past_key_values = None
+        if past_key_values is None and use_cache:
+            past_key_values = OdinNextCache(self.config.n_layers)
+        for i, layer in enumerate(self.layers):
+            prev_state = past_key_values[i] if past_key_values is not None else None
+            h, new_state = layer(
+                h,
+                cos=cos, sin=sin,
+                recurrent_state=prev_state,
+                output_state=use_cache,
+                use_recurrent_kernel=single_step,
+            )
+            if use_cache and past_key_values is not None:
+                past_key_values[i] = new_state
+        h = self.final_norm(h)
+        if past_key_values is not None:
+            past_key_values.update_seen(T)
+        return h, past_key_values
+class OdinNextForCausalLM(OdinNextPreTrainedModel):
+    """Top-level wrapper exposing logits + HF generate()."""
+    # Map tied output -> source. Newer `transformers` (>=4.45) expects a
+    # dict; older versions tolerate (and used) a list of keys. Provide the
+    # dict form which is forward-compatible.
+    _tied_weights_keys = {"lm_head.weight": "model.tok_embeddings.weight"}
+    def __init__(self, config: OdinNextConfig):
+        super().__init__(config)
+        self.model = OdinNextModel(config)
+        self.lm_head = nn.Linear(config.d_model, config.vocab_size, bias=False)
+        if config.tie_embeddings:
+            self.lm_head.weight = self.model.tok_embeddings.weight
+        self.post_init()
+    def get_input_embeddings(self) -> nn.Embedding:
+        return self.model.tok_embeddings
+    def set_input_embeddings(self, value: nn.Embedding) -> None:
+        self.model.tok_embeddings = value
+    def get_output_embeddings(self) -> nn.Linear:
+        return self.lm_head
+    def set_output_embeddings(self, new_embeddings: nn.Linear) -> None:
+        self.lm_head = new_embeddings
+    def forward(
+        self,
+        input_ids: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        past_key_values: Optional[OdinNextCache] = None,
+        labels: Optional[torch.Tensor] = None,
+        use_cache: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        **_unused,
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+        return_dict = return_dict if return_dict is not None else True
+        hidden_states, past_key_values = self.model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+        )
+        logits = self.lm_head(hidden_states)
+        loss = None
+        if labels is not None:
+            shift_logits = logits[..., :-1, :].contiguous()
+            shift_labels = labels[..., 1:].contiguous()
+            loss = F.cross_entropy(
+                shift_logits.view(-1, shift_logits.size(-1)).float(),
+                shift_labels.view(-1).long(),
+                ignore_index=-100,
+            )
+        if not return_dict:
+            output = (logits,) + ((past_key_values,) if past_key_values is not None else ())
+            return ((loss,) + output) if loss is not None else output
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=past_key_values,
+            hidden_states=None,
+            attentions=None,
+        )
+    # -----------------------------------------------------------------
+    # generate() integration
+    # -----------------------------------------------------------------
+    def prepare_inputs_for_generation(
+        self,
+        input_ids: torch.Tensor,
+        past_key_values: Optional[OdinNextCache] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        use_cache: Optional[bool] = True,
+        **kwargs,
+    ) -> dict:
+        """Trim input_ids to only the new positions when a cache exists.
+        After the first forward, the recurrent state already encodes the
+        prompt. Subsequent calls only need to pass the most recently
+        generated token.
+        """
+        if past_key_values is not None and past_key_values.seen_tokens > 0:
+            # New tokens since last call.
+            new_count = input_ids.shape[1] - past_key_values.seen_tokens
+            if new_count <= 0:
+                # generate() can occasionally call us with the same length
+                # twice (e.g., assistant-decoding paths). Default to feeding
+                # the last token only.
+                input_ids = input_ids[:, -1:]
+            else:
+                input_ids = input_ids[:, -new_count:]
+        return {
+            "input_ids": input_ids,
+            "past_key_values": past_key_values,
+            "attention_mask": attention_mask,
+            "use_cache": use_cache,
+        }
+    def _reorder_cache(
+        self, past_key_values: OdinNextCache, beam_idx: torch.Tensor
+    ) -> OdinNextCache:
+        """Beam-search support: reorder per-layer states along the batch axis."""
+        for i, state in enumerate(past_key_values.states):
+            if state is not None:
+                past_key_values.states[i] = state.index_select(0, beam_idx.to(state.device))
+        return past_key_values
+    @staticmethod
+    def _supports_default_dynamic_cache() -> bool:
+        return False
+# Re-export for convenience
+__all__ = [
+    "OdinNextConfig",
+    "OdinNextModel",
+    "OdinNextForCausalLM",
+    "OdinNextCache",
+]

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,5 @@

+{
+  "bos_token": "<|endoftext|>",
+  "eos_token": "<|endoftext|>",
+  "pad_token": "<|pad|>"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,8 @@

+{
+  "tokenizer_class": "PreTrainedTokenizerFast",
+  "model_max_length": 2048,
+  "bos_token": "<|endoftext|>",
+  "eos_token": "<|endoftext|>",
+  "pad_token": "<|pad|>",
+  "clean_up_tokenization_spaces": false
+}