Instructions to use joelhenwang/OdinNext-138M-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use joelhenwang/OdinNext-138M-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="joelhenwang/OdinNext-138M-Instruct", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("joelhenwang/OdinNext-138M-Instruct", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use joelhenwang/OdinNext-138M-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "joelhenwang/OdinNext-138M-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "joelhenwang/OdinNext-138M-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/joelhenwang/OdinNext-138M-Instruct

SGLang

How to use joelhenwang/OdinNext-138M-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "joelhenwang/OdinNext-138M-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "joelhenwang/OdinNext-138M-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "joelhenwang/OdinNext-138M-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "joelhenwang/OdinNext-138M-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use joelhenwang/OdinNext-138M-Instruct with Docker Model Runner:
```
docker model run hf.co/joelhenwang/OdinNext-138M-Instruct
```

OdinNext-138M-Instruct / _hgrn2_fallback.py

joelhenwang

OdinNext-138M-Instruct: SFT + LFM2.5 SeqKD

8d6741e verified 2 days ago

raw

history blame contribute delete

3.36 kB

	# coding=utf-8
	# Copyright 2026 The OdinNext authors.
	# Licensed under the Apache License, Version 2.0.
	"""Pure-PyTorch HGRN2 recurrence — slow fallback when flash-linear-attention
	(`fla`) is unavailable.

	The `fla` library provides Triton/CUDA kernels for `chunk_gla` (chunk-wise
	parallel scan over T) and `fused_recurrent_gla` (token-by-token serial scan).
	On platforms without those kernels (CPU, non-CUDA/non-ROCm GPUs) we provide
	a reference implementation here.

	Speed: ~10-30x slower than `fla` at training shapes; comparable for
	single-token decode (since both are serial). Numerical match: bitwise on
	fp32, within fp16 noise on fp16.

	The recurrence (per head):
	S_t = diag(exp(g_t)) @ S_{t-1} + k_t.unsqueeze(-1) @ v_t.unsqueeze(-2)
	o_t = q_t @ S_t

	Shapes (matching `fla.ops.gla.chunk_gla`):
	q: [B, T, H, K] (K = head_f_dim, e.g. 128)
	k: [B, T, H, K]
	g: [B, T, H, K] (already in log-space, expected to be <= 0)
	v: [B, T, H, V] (V = head_i_dim, e.g. 128)
	-> o: [B, T, H, V]
	final_state: [B, H, K, V] if output_final_state else None
	"""

	from typing import Optional, Tuple

	import torch


	def chunk_gla(
	q: torch.Tensor,
	k: torch.Tensor,
	v: torch.Tensor,
	g: torch.Tensor,
	initial_state: Optional[torch.Tensor] = None,
	output_final_state: bool = False,
	**_unused,
	) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
	"""Pure-PyTorch chunk_gla replacement.

	Implements a serial (token-by-token) scan. We promote internals to fp32
	to keep the cumulative product of decays numerically sane over long T.
	"""
	B, T, H, K = q.shape
	V = v.shape[-1]
	device = q.device
	in_dtype = q.dtype

	# Promote scan internals to fp32 for stability (matches fla behavior).
	q32 = q.float()
	k32 = k.float()
	v32 = v.float()
	g32 = g.float()

	if initial_state is None:
	S = torch.zeros(B, H, K, V, device=device, dtype=torch.float32)
	else:
	S = initial_state.to(dtype=torch.float32)

	out = torch.empty(B, T, H, V, device=device, dtype=torch.float32)

	# Serial scan. exp(g_t) decays state element-wise along K.
	# k_t outer v_t -> [B, H, K, V] additive update.
	for t in range(T):
	decay = g32[:, t].exp().unsqueeze(-1) # [B, H, K, 1]
	S = decay * S + k32[:, t].unsqueeze(-1) * v32[:, t].unsqueeze(-2)
	# o_t = q_t (1xK) @ S (KxV) per head
	out[:, t] = (q32[:, t].unsqueeze(-2) @ S).squeeze(-2) # [B, H, V]

	out = out.to(in_dtype)
	if output_final_state:
	return out, S
	return out, None


	def fused_recurrent_gla(
	q: torch.Tensor,
	k: torch.Tensor,
	v: torch.Tensor,
	gk: torch.Tensor,
	initial_state: Optional[torch.Tensor] = None,
	output_final_state: bool = True,
	**_unused,
	) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
	"""Pure-PyTorch single-token (or short-T) recurrence.

	`fla.ops.gla.fused_recurrent_gla` is what OdinNext.generate uses for
	O(1) per-token decode. The signature matches: `gk` = log-decay (instead
	of `g`). We reuse `chunk_gla` internals — they are mathematically the
	same scan, just packaged with different defaults for kernel selection
	in fla.
	"""
	return chunk_gla(
	q=q, k=k, v=v, g=gk,
	initial_state=initial_state,
	output_final_state=output_final_state,
	)