HebrewGPT-296M ๐ฎ๐ฑ
HebrewGPT-296M is a 296 million parameter autoregressive Hebrew language model โ the smaller sibling of HebrewGPT-1B. Trained on 1 billion tokens of Hebrew Wikipedia using the Muon optimizer with Lookahead and SWA, it demonstrates strong Hebrew language understanding despite its compact size.
This model achieves 39.6% Top-1 and 68.4% Top-5 token prediction accuracy, making it suitable for research, prototyping, and resource-constrained Hebrew NLP applications.
- ๐ Paper: Hebrew Language Model Research via Agentic AI
- ๐ป GitHub: AgenticResearcher
- ๐ Larger model: HebrewGPT-1B (1.08B parameters)
Model Description
| Parameter | Value |
|---|---|
| Parameters | 296M |
| Hidden size (WIDTH) | 1536 |
| Layers (DEPTH) | 10 |
| Attention heads | 12 |
| Head dimension | 128 |
| MLP type | SwiGLU (intermediate_size=4096) |
| Positional encoding | RoPE (interleaved, ฮธ=10000) |
| Normalization | RMSNorm |
| Vocabulary | 32,000 (Hebrew-native SentencePiece BPE) |
| Context length | 512 tokens |
| Weight tying | Yes (embedding โ output head) |
| Precision | bfloat16 |
Architecture
Same design principles as HebrewGPT-1B but scaled down:
- SwiGLU MLP with hidden dim = 4096
- RoPE with interleaved pattern
- RMSNorm pre-norm architecture
- Weight tying between embedding and output head
Training Details
Optimizer
- Muon optimizer + Lookahead (k=5, ฮฑ=0.6) + Stochastic Weight Averaging (SWA)
- Cosine annealing with warm restarts
Data
- ~1 billion tokens from Hebrew Wikipedia
Hardware
- Hardware: 4ร NVIDIA A10G GPUs
- Training time: Several hours
Evaluation Results
Overall Metrics
| Metric | Value |
|---|---|
| Validation BPB (SWA) | 4.42 |
| Perplexity | 31.40 |
| Top-1 Token Accuracy | 39.6% |
| Top-5 Token Accuracy | 68.4% |
| Top-10 Token Accuracy | 78.9% |
Comparison Across Model Sizes
| Model | Params | Data | Top-1 | Top-5 | Top-10 | PPL |
|---|---|---|---|---|---|---|
| HebrewGPT-296M (this) | 296M | 1B tokens | 39.6% | 68.4% | 78.9% | 31.40 |
| HebrewGPT-1B | 1.08B | 2.48B tokens | 38.4% | 56.1% | 63.6% | 29.75 |
Note: The 296M model shows higher token accuracy on its evaluation set (Wikipedia-focused), while the 1B model was trained on more diverse data and has lower perplexity overall.
Usage
โ ๏ธ Custom Architecture: This model uses a custom architecture. See
generate.pyfor the full model class definition.
Quick Start
import torch
import sentencepiece as spm
from generate import HebrewGPT, ModelConfig
config = ModelConfig(
vocab_size=32000,
width=1536,
depth=10,
n_heads=12,
head_dim=128,
max_seq_len=512,
dropout=0.0,
)
model = HebrewGPT(config)
state_dict = torch.load("swa_best.pt", map_location="cpu", weights_only=True)
model.load_state_dict(state_dict)
model.eval()
sp = spm.SentencePieceProcessor()
sp.Load("tokenizer.model")
prompt = "ืืจืืฉืืื ืืื ืขืืจ"
input_ids = torch.tensor([sp.Encode(prompt)])
output = model.generate(input_ids, max_new_tokens=100)
print(sp.Decode(output[0].tolist()))
Command Line
python generate.py \
--model_path swa_best.pt \
--prompt "ืืจืืฉืืื ืืื ืขืืจ" \
--width 1536 --depth 10 --n_heads 12 --max_seq_len 512 \
--max_tokens 100 --temperature 0.8
Limitations
- Hebrew-only: Trained exclusively on Hebrew Wikipedia text
- Short context: Limited to 512 tokens (vs 2048 for the 1B model)
- Wikipedia-focused: Training data is primarily encyclopedic โ may struggle with conversational or legal text
- No instruction tuning: Base language model only
- Custom architecture: Requires the provided model class to load
- No safety filtering: May generate inappropriate or incorrect content
Citation
@article{slasky2025hebrewgpt,
title={Hebrew Language Model Research via Agentic AI: Training HebrewGPT from Scratch},
author={Slasky, Ronnen},
year={2025},
url={https://d11k83yu06biio.cloudfront.net/paper/hebrew-autoresearch.html}
}
Acknowledgments
- Loki โ AI research assistant (Amazon Bedrock on OpenClaw)
- Andrej Karpathy โ For the autoresearch framework
Contact
- Author: Ronnen Slasky (ronnen@slasky.com)
- GitHub: fatherRonnen/AgenticResearcher
- Downloads last month
- -
Evaluation results
- Perplexityself-reported31.400
- Top-1 Accuracyself-reported39.600
- Top-5 Accuracyself-reported68.400