axiomic banner

GPT-S2-5M

GPT-S2-5M is the latest entry in our GPT-S small-model family, now based on the T-X4 architecture with the all new XSA refresh gate. trained from scratch.

It combines RoPE + RMSNorm + SwiGLU + exclusive grouped-query attention, and the refresh gate re-injects the original token embedding back into the residual stream conditioned on the (detached) attention output to conunteract diluted token identities common in deep XSA models.

Architecture

Component Details
Position encoding RoPE, theta = 2,500
Normalization RMSNorm (eps 1e-6)
Feed-forward SwiGLU
Attention Exclusive grouped-query attention (XSA), 6 query heads / 2 KV heads
Refresh XSA refresh gate on layers 5 and 9, kernel 9
Embeddings Weight tied
Context length 512 tokens
Parameters 5,384,258

Config

vocab_size       = 4,096
hidden_size      = 192
num_layers       = 9
num_heads        = 6
num_kv_heads     = 2
head_dim         = 32
intermediate     = 672
block_size       = 512
rope_theta       = 2,500
inject_layers    = [4, 8]
refresh_kernel   = 9

The XSA refresh gate

The keystone of the T-X4 architecture is the injection layers, after the attention residual add, the block applies:

a     = RMSNorm(attn_out.detach())              # read attention as a signal
e     = RMSNorm(e0)                              # original token embedding
gate  = gate_proj(a) + causal_depthwise_conv(a) # kernel-9, depthwise
value = value_proj(e)
z     = RMSNorm(out_proj(SiLU(gate) * value))
x     = x + alpha * z                            # alpha is a learned scalar

e0 is the token embedding from the input layer, re-injected at every gate. The depthwise conv is strictly causal, so the gate is compatible with KV-cache generation (the conv state is carried alongside the attention cache).

Benchmarks

Zero-shot, evaluated in bf16 with an internal harness modeled on EleutherAI/lm-eval-harness; normalized accuracy where available.

Hellaswag ARC-Easy ARC-Challenge PIQA Arithmark-2
27.87% 33.92% 22.87% 57.56% 28.04%

Comparison

GPT-S2-5M against other small base models on the same evaluation suite. Achieving the highest avg score in the sub 10M category on the Open SLM Leaderboard

# Model Params HellaSwag ARC-Easy ARC-Challenge PIQA Arithmark-2
1 GPT-S2-5M (Axiomic Labs) 5.4M 27.87% 33.92% 22.87% 57.56% 28.04%
2 SLM-10M (Liodon AI) 10M 27.40% 35.52% 23.46% 57.07% 26.40%
3 GPT-S-5M (Axiomic Labs) 5.2M 27.46% 33.21% 21.16% 57.24% 27.12%
4 michel-nano-v2 (finnianx) 9.9M 27.36% 36.07% 21.84% 56.96% 25.04%
5 michel-nano (finnianx) 6M 27.12% 33.29% 22.70% 54.79% 26.00%
6 Tenete-8M (Harley ML) 8M 26.75% 31.69% 21.84% 55.66% 26.72%

Training

Hyperparameter Value
Optimizers AdamW (embeddings / LM head / norms / conv) + Muon (2D hidden weights)
Adam betas 0.9 / 0.95
Adam peak LR 2.5e-3
Muon peak LR 0.03
Muon momentum 0.95
Weight decay 0.01
Minimum LR 0
LR schedule Warmup-stable-decay
Warmup steps 1,500
Decay start 85% of training
Training tokens 75B
Total batch size 262,144 tokens
Microbatch 128 × 512 tokens
Gradient accumulation steps 4
Gradient clipping 1.0
Precision bfloat16 autocast

Trained on a mixture of filtered web text (DCLM) and synthetic "finephrase" data (FAQ / math / table / tutorial)

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "AxiomicLabs/GPT-S2-5M"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

prompt = "The future of AI is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.inference_mode():
    output = model.generate(
        **inputs,
        max_new_tokens=120,
        do_sample=True,
        temperature=0.8,
        top_p=0.95,
        repetition_penalty=1.1,
        no_repeat_ngram_size=4,
    )

print(tokenizer.decode(output[0], skip_special_tokens=True))

Limitations

This is a small base language model. It is not instruction tuned, has limited factual capacity, and uses a 512-token context window.

Downloads last month
-
Safetensors
Model size
5.38M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using AxiomicLabs/GPT-S2-5M 1

Collection including AxiomicLabs/GPT-S2-5M