HebrewGPT-1B-AdamW 🇮🇱 (Ablation)

HebrewGPT-1B-AdamW is an ablation variant of HebrewGPT-1B trained with the AdamW optimizer instead of Muon. All other training conditions — architecture, data, hardware, and hyperparameters — are identical. This model demonstrates that the Muon optimizer provides a 12.3% improvement in validation BPB over AdamW at the 1B parameter scale.

This model is provided for research comparison purposes. For the best-performing Hebrew language model, use HebrewGPT-1B.

📄 Paper: Hebrew Language Model Research via Agentic AI
💻 GitHub: AgenticResearcher
🏆 Primary model: HebrewGPT-1B (Muon optimizer)

Model Description

This model has the exact same architecture as HebrewGPT-1B:

Parameter	Value
Parameters	1.08B
Hidden size (WIDTH)	2048
Layers (DEPTH)	20
Attention heads	16
Head dimension	128
MLP type	SwiGLU (intermediate_size=5504)
Positional encoding	RoPE (interleaved, θ=10000)
Normalization	RMSNorm
Vocabulary	32,000 (Hebrew-native SentencePiece BPE)
Context length	2,048 tokens
Weight tying	Yes
Precision	bfloat16

Training Details

What's Different

Optimizer: AdamW (replacing Muon)
Everything else is identical to HebrewGPT-1B

Training

Optimizer: AdamW + Lookahead(k=5, α=0.6) + SWA + 4 cosine cycles
Dropout: 0.1
Data: 2.48B tokens from 12 Hebrew datasets (same as primary model)
Hardware: 8× NVIDIA H100 80GB GPUs
Training time: ~8 hours
Steps: 11,904

Evaluation Results

Comparison: Muon vs AdamW

Metric	HebrewGPT-1B (Muon)	HebrewGPT-1B-AdamW (this)	Δ
Validation BPB (best ckpt)	25.89	28.09	+8.5% worse
Validation BPB (snapshot)	—	31.29	—
Validation BPB (SWA)	25.89	31.73	+22.6% worse

Key Finding

Muon provides a 12.3% advantage over AdamW at 1B scale when comparing best checkpoint BPB (25.89 vs 28.09). The gap widens further with SWA, suggesting Muon finds flatter, more SWA-compatible minima.

This is a significant finding for the optimizer community — Muon, originally designed for smaller models, scales effectively to 1B parameters and outperforms the established AdamW optimizer on Hebrew language modeling.

Usage

⚠️ Custom Architecture: This model uses the same custom architecture as HebrewGPT-1B. See the primary model repo for the full model class definition.

import torch
import sentencepiece as spm

# Use the same generate.py from HebrewGPT-1B
from generate import HebrewGPT, ModelConfig

config = ModelConfig(
    vocab_size=32000, width=2048, depth=20,
    n_heads=16, head_dim=128, max_seq_len=2048, dropout=0.0,
)
model = HebrewGPT(config)

state_dict = torch.load("best.pt", map_location="cpu", weights_only=True)
if isinstance(state_dict, dict) and "model" in state_dict:
    state_dict = state_dict["model"]
model.load_state_dict(state_dict)
model.eval()

sp = spm.SentencePieceProcessor()
sp.Load("tokenizer.model")

prompt = "בראשית ברא אלוהים את"
input_ids = torch.tensor([sp.Encode(prompt)])
output = model.generate(input_ids, max_new_tokens=100)
print(sp.Decode(output[0].tolist()))

Limitations

Same limitations as HebrewGPT-1B
Lower quality than the primary Muon-trained model
Provided for ablation/research purposes only

Citation

@article{slasky2025hebrewgpt,
  title={Hebrew Language Model Research via Agentic AI: Training HebrewGPT from Scratch},
  author={Slasky, Ronnen},
  year={2025},
  url={https://d11k83yu06biio.cloudfront.net/paper/hebrew-autoresearch.html}
}

Acknowledgments

Loki — AI research assistant (Amazon Bedrock on OpenClaw)
Andrej Karpathy — For the autoresearch framework

Contact

Author: Ronnen Slasky (ronnen@slasky.com)
GitHub: fatherRonnen/AgenticResearcher

Downloads last month: 4

Safetensors

Model size

1B params

Tensor type

F32

Slasky
/

HebrewGPT-1B-AdamW