HebrewGPT-1B-AdamW ๐ฎ๐ฑ (Ablation)
HebrewGPT-1B-AdamW is an ablation variant of HebrewGPT-1B trained with the AdamW optimizer instead of Muon. All other training conditions โ architecture, data, hardware, and hyperparameters โ are identical. This model demonstrates that the Muon optimizer provides a 12.3% improvement in validation BPB over AdamW at the 1B parameter scale.
This model is provided for research comparison purposes. For the best-performing Hebrew language model, use HebrewGPT-1B.
- ๐ Paper: Hebrew Language Model Research via Agentic AI
- ๐ป GitHub: AgenticResearcher
- ๐ Primary model: HebrewGPT-1B (Muon optimizer)
Model Description
This model has the exact same architecture as HebrewGPT-1B:
| Parameter | Value |
|---|---|
| Parameters | 1.08B |
| Hidden size (WIDTH) | 2048 |
| Layers (DEPTH) | 20 |
| Attention heads | 16 |
| Head dimension | 128 |
| MLP type | SwiGLU (intermediate_size=5504) |
| Positional encoding | RoPE (interleaved, ฮธ=10000) |
| Normalization | RMSNorm |
| Vocabulary | 32,000 (Hebrew-native SentencePiece BPE) |
| Context length | 2,048 tokens |
| Weight tying | Yes |
| Precision | bfloat16 |
Training Details
What's Different
- Optimizer: AdamW (replacing Muon)
- Everything else is identical to HebrewGPT-1B
Training
- Optimizer: AdamW + Lookahead(k=5, ฮฑ=0.6) + SWA + 4 cosine cycles
- Dropout: 0.1
- Data: 2.48B tokens from 12 Hebrew datasets (same as primary model)
- Hardware: 8ร NVIDIA H100 80GB GPUs
- Training time: ~8 hours
- Steps: 11,904
Evaluation Results
Comparison: Muon vs AdamW
| Metric | HebrewGPT-1B (Muon) | HebrewGPT-1B-AdamW (this) | ฮ |
|---|---|---|---|
| Validation BPB (best ckpt) | 25.89 | 28.09 | +8.5% worse |
| Validation BPB (snapshot) | โ | 31.29 | โ |
| Validation BPB (SWA) | 25.89 | 31.73 | +22.6% worse |
Key Finding
Muon provides a 12.3% advantage over AdamW at 1B scale when comparing best checkpoint BPB (25.89 vs 28.09). The gap widens further with SWA, suggesting Muon finds flatter, more SWA-compatible minima.
This is a significant finding for the optimizer community โ Muon, originally designed for smaller models, scales effectively to 1B parameters and outperforms the established AdamW optimizer on Hebrew language modeling.
Usage
โ ๏ธ Custom Architecture: This model uses the same custom architecture as HebrewGPT-1B. See the primary model repo for the full model class definition.
import torch
import sentencepiece as spm
# Use the same generate.py from HebrewGPT-1B
from generate import HebrewGPT, ModelConfig
config = ModelConfig(
vocab_size=32000, width=2048, depth=20,
n_heads=16, head_dim=128, max_seq_len=2048, dropout=0.0,
)
model = HebrewGPT(config)
state_dict = torch.load("best.pt", map_location="cpu", weights_only=True)
if isinstance(state_dict, dict) and "model" in state_dict:
state_dict = state_dict["model"]
model.load_state_dict(state_dict)
model.eval()
sp = spm.SentencePieceProcessor()
sp.Load("tokenizer.model")
prompt = "ืืจืืฉืืช ืืจื ืืืืืื ืืช"
input_ids = torch.tensor([sp.Encode(prompt)])
output = model.generate(input_ids, max_new_tokens=100)
print(sp.Decode(output[0].tolist()))
Limitations
- Same limitations as HebrewGPT-1B
- Lower quality than the primary Muon-trained model
- Provided for ablation/research purposes only
Citation
@article{slasky2025hebrewgpt,
title={Hebrew Language Model Research via Agentic AI: Training HebrewGPT from Scratch},
author={Slasky, Ronnen},
year={2025},
url={https://d11k83yu06biio.cloudfront.net/paper/hebrew-autoresearch.html}
}
Acknowledgments
- Loki โ AI research assistant (Amazon Bedrock on OpenClaw)
- Andrej Karpathy โ For the autoresearch framework
Contact
- Author: Ronnen Slasky (ronnen@slasky.com)
- GitHub: fatherRonnen/AgenticResearcher
- Downloads last month
- -