HebrewGPT-1B-AdamW ๐Ÿ‡ฎ๐Ÿ‡ฑ (Ablation)

HebrewGPT-1B-AdamW is an ablation variant of HebrewGPT-1B trained with the AdamW optimizer instead of Muon. All other training conditions โ€” architecture, data, hardware, and hyperparameters โ€” are identical. This model demonstrates that the Muon optimizer provides a 12.3% improvement in validation BPB over AdamW at the 1B parameter scale.

This model is provided for research comparison purposes. For the best-performing Hebrew language model, use HebrewGPT-1B.

Model Description

This model has the exact same architecture as HebrewGPT-1B:

Parameter Value
Parameters 1.08B
Hidden size (WIDTH) 2048
Layers (DEPTH) 20
Attention heads 16
Head dimension 128
MLP type SwiGLU (intermediate_size=5504)
Positional encoding RoPE (interleaved, ฮธ=10000)
Normalization RMSNorm
Vocabulary 32,000 (Hebrew-native SentencePiece BPE)
Context length 2,048 tokens
Weight tying Yes
Precision bfloat16

Training Details

What's Different

  • Optimizer: AdamW (replacing Muon)
  • Everything else is identical to HebrewGPT-1B

Training

  • Optimizer: AdamW + Lookahead(k=5, ฮฑ=0.6) + SWA + 4 cosine cycles
  • Dropout: 0.1
  • Data: 2.48B tokens from 12 Hebrew datasets (same as primary model)
  • Hardware: 8ร— NVIDIA H100 80GB GPUs
  • Training time: ~8 hours
  • Steps: 11,904

Evaluation Results

Comparison: Muon vs AdamW

Metric HebrewGPT-1B (Muon) HebrewGPT-1B-AdamW (this) ฮ”
Validation BPB (best ckpt) 25.89 28.09 +8.5% worse
Validation BPB (snapshot) โ€” 31.29 โ€”
Validation BPB (SWA) 25.89 31.73 +22.6% worse

Key Finding

Muon provides a 12.3% advantage over AdamW at 1B scale when comparing best checkpoint BPB (25.89 vs 28.09). The gap widens further with SWA, suggesting Muon finds flatter, more SWA-compatible minima.

This is a significant finding for the optimizer community โ€” Muon, originally designed for smaller models, scales effectively to 1B parameters and outperforms the established AdamW optimizer on Hebrew language modeling.

Usage

โš ๏ธ Custom Architecture: This model uses the same custom architecture as HebrewGPT-1B. See the primary model repo for the full model class definition.

import torch
import sentencepiece as spm

# Use the same generate.py from HebrewGPT-1B
from generate import HebrewGPT, ModelConfig

config = ModelConfig(
    vocab_size=32000, width=2048, depth=20,
    n_heads=16, head_dim=128, max_seq_len=2048, dropout=0.0,
)
model = HebrewGPT(config)

state_dict = torch.load("best.pt", map_location="cpu", weights_only=True)
if isinstance(state_dict, dict) and "model" in state_dict:
    state_dict = state_dict["model"]
model.load_state_dict(state_dict)
model.eval()

sp = spm.SentencePieceProcessor()
sp.Load("tokenizer.model")

prompt = "ื‘ืจืืฉื™ืช ื‘ืจื ืืœื•ื”ื™ื ืืช"
input_ids = torch.tensor([sp.Encode(prompt)])
output = model.generate(input_ids, max_new_tokens=100)
print(sp.Decode(output[0].tolist()))

Limitations

  • Same limitations as HebrewGPT-1B
  • Lower quality than the primary Muon-trained model
  • Provided for ablation/research purposes only

Citation

@article{slasky2025hebrewgpt,
  title={Hebrew Language Model Research via Agentic AI: Training HebrewGPT from Scratch},
  author={Slasky, Ronnen},
  year={2025},
  url={https://d11k83yu06biio.cloudfront.net/paper/hebrew-autoresearch.html}
}

Acknowledgments

  • Loki โ€” AI research assistant (Amazon Bedrock on OpenClaw)
  • Andrej Karpathy โ€” For the autoresearch framework

Contact

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Dataset used to train Slasky/HebrewGPT-1B-AdamW