GPT-S2-5M

GPT-S2-5M is the latest entry in our GPT-S small-model family, now based on the T-X4 architecture with the all new XSA refresh gate. trained from scratch.

It combines RoPE + RMSNorm + SwiGLU + exclusive grouped-query attention, and the refresh gate re-injects the original token embedding back into the residual stream conditioned on the (detached) attention output to conunteract diluted token identities common in deep XSA models.

Architecture

Component	Details
Position encoding	RoPE, theta = 2,500
Normalization	RMSNorm (eps 1e-6)
Feed-forward	SwiGLU
Attention	Exclusive grouped-query attention (XSA), 6 query heads / 2 KV heads
Refresh	XSA refresh gate on layers 5 and 9, kernel 9
Embeddings	Weight tied
Context length	512 tokens
Parameters	5,384,258

Config

vocab_size       = 4,096
hidden_size      = 192
num_layers       = 9
num_heads        = 6
num_kv_heads     = 2
head_dim         = 32
intermediate     = 672
block_size       = 512
rope_theta       = 2,500
inject_layers    = [4, 8]
refresh_kernel   = 9

The XSA refresh gate

The keystone of the T-X4 architecture is the injection layers, after the attention residual add, the block applies:

a     = RMSNorm(attn_out.detach())              # read attention as a signal
e     = RMSNorm(e0)                              # original token embedding
gate  = gate_proj(a) + causal_depthwise_conv(a) # kernel-9, depthwise
value = value_proj(e)
z     = RMSNorm(out_proj(SiLU(gate) * value))
x     = x + alpha * z                            # alpha is a learned scalar

e0 is the token embedding from the input layer, re-injected at every gate. The depthwise conv is strictly causal, so the gate is compatible with KV-cache generation (the conv state is carried alongside the attention cache).

Benchmarks

Zero-shot, evaluated in bf16 with an internal harness modeled on EleutherAI/lm-eval-harness; normalized accuracy where available.

Hellaswag	ARC-Easy	ARC-Challenge	PIQA	Arithmark-2
27.87%	33.92%	22.87%	57.56%	28.04%

Comparison

GPT-S2-5M against other small base models on the same evaluation suite. Achieving the highest avg score in the sub 10M category on the Open SLM Leaderboard

#	Model	Params	HellaSwag	ARC-Easy	ARC-Challenge	PIQA	Arithmark-2
1	GPT-S2-5M (Axiomic Labs)	5.4M	27.87%	33.92%	22.87%	57.56%	28.04%
2	SLM-10M (Liodon AI)	10M	27.40%	35.52%	23.46%	57.07%	26.40%
3	GPT-S-5M (Axiomic Labs)	5.2M	27.46%	33.21%	21.16%	57.24%	27.12%
4	michel-nano-v2 (finnianx)	9.9M	27.36%	36.07%	21.84%	56.96%	25.04%
5	michel-nano (finnianx)	6M	27.12%	33.29%	22.70%	54.79%	26.00%
6	Tenete-8M (Harley ML)	8M	26.75%	31.69%	21.84%	55.66%	26.72%

Training

Hyperparameter	Value
Optimizers	AdamW (embeddings / LM head / norms / conv) + Muon (2D hidden weights)
Adam betas	0.9 / 0.95
Adam peak LR	2.5e-3
Muon peak LR	0.03
Muon momentum	0.95
Weight decay	0.01
Minimum LR	0
LR schedule	Warmup-stable-decay
Warmup steps	1,500
Decay start	85% of training
Training tokens	75B
Total batch size	262,144 tokens
Microbatch	128 × 512 tokens
Gradient accumulation steps	4
Gradient clipping	1.0
Precision	bfloat16 autocast

Trained on a mixture of filtered web text (DCLM) and synthetic "finephrase" data (FAQ / math / table / tutorial)

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "AxiomicLabs/GPT-S2-5M"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

prompt = "The future of AI is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.inference_mode():
    output = model.generate(
        **inputs,
        max_new_tokens=120,
        do_sample=True,
        temperature=0.8,
        top_p=0.95,
        repetition_penalty=1.1,
        no_repeat_ngram_size=4,
    )

print(tokenizer.decode(output[0], skip_special_tokens=True))

Limitations

This is a small base language model. It is not instruction tuned, has limited factual capacity, and uses a 512-token context window.

Downloads last month: -

Safetensors

Model size

5.38M params

Tensor type

F32

Space using AxiomicLabs/GPT-S2-5M 1

Collection including AxiomicLabs/GPT-S2-5M

GPT-S

Collection

Our Tiny Model Family • 3 items • Updated about 5 hours ago