| --- |
| language: |
| - he |
| license: apache-2.0 |
| tags: |
| - hebrew |
| - gpt |
| - causal-lm |
| - hebrew-nlp |
| - muon-optimizer |
| - sentencepiece |
| - rope |
| - swiglu |
| datasets: |
| - hebrew-wikipedia |
| library_name: transformers |
| pipeline_tag: text-generation |
| model-index: |
| - name: HebrewGPT-296M |
| results: |
| - task: |
| type: text-generation |
| name: Language Modeling |
| metrics: |
| - name: Perplexity |
| type: perplexity |
| value: 31.40 |
| - name: Top-1 Accuracy |
| type: accuracy |
| value: 39.6 |
| - name: Top-5 Accuracy |
| type: accuracy |
| value: 68.4 |
| --- |
| |
| # HebrewGPT-296M ๐ฎ๐ฑ |
|
|
| **HebrewGPT-296M** is a 296 million parameter autoregressive Hebrew language model โ the smaller sibling of [HebrewGPT-1B](https://huggingface.co/Slasky/HebrewGPT-1B). Trained on 1 billion tokens of Hebrew Wikipedia using the Muon optimizer with Lookahead and SWA, it demonstrates strong Hebrew language understanding despite its compact size. |
|
|
| This model achieves **39.6% Top-1** and **68.4% Top-5** token prediction accuracy, making it suitable for research, prototyping, and resource-constrained Hebrew NLP applications. |
|
|
| - ๐ **Paper**: [Hebrew Language Model Research via Agentic AI](https://d11k83yu06biio.cloudfront.net/paper/hebrew-autoresearch.html) |
| - ๐ป **GitHub**: [AgenticResearcher](https://github.com/fatherRonnen/AgenticResearcher) |
| - ๐ **Larger model**: [HebrewGPT-1B](https://huggingface.co/Slasky/HebrewGPT-1B) (1.08B parameters) |
|
|
| ## Model Description |
|
|
| | Parameter | Value | |
| |---|---| |
| | Parameters | 296M | |
| | Hidden size (WIDTH) | 1536 | |
| | Layers (DEPTH) | 10 | |
| | Attention heads | 12 | |
| | Head dimension | 128 | |
| | MLP type | SwiGLU (intermediate_size=4096) | |
| | Positional encoding | RoPE (interleaved, ฮธ=10000) | |
| | Normalization | RMSNorm | |
| | Vocabulary | 32,000 (Hebrew-native SentencePiece BPE) | |
| | Context length | 512 tokens | |
| | Weight tying | Yes (embedding โ output head) | |
| | Precision | bfloat16 | |
| |
| ### Architecture |
| |
| Same design principles as HebrewGPT-1B but scaled down: |
| - **SwiGLU MLP** with hidden dim = 4096 |
| - **RoPE** with interleaved pattern |
| - **RMSNorm** pre-norm architecture |
| - **Weight tying** between embedding and output head |
| |
| ## Training Details |
| |
| ### Optimizer |
| - **Muon** optimizer + **Lookahead** (k=5, ฮฑ=0.6) + **Stochastic Weight Averaging (SWA)** |
| - Cosine annealing with warm restarts |
| |
| ### Data |
| - ~1 billion tokens from **Hebrew Wikipedia** |
| |
| ### Hardware |
| - **Hardware**: 4ร NVIDIA A10G GPUs |
| - **Training time**: Several hours |
| |
| ## Evaluation Results |
| |
| ### Overall Metrics |
| |
| | Metric | Value | |
| |---|---| |
| | Validation BPB (SWA) | 4.42 | |
| | Perplexity | 31.40 | |
| | Top-1 Token Accuracy | 39.6% | |
| | Top-5 Token Accuracy | 68.4% | |
| | Top-10 Token Accuracy | 78.9% | |
| |
| ### Comparison Across Model Sizes |
| |
| | Model | Params | Data | Top-1 | Top-5 | Top-10 | PPL | |
| |---|---|---|---|---|---|---| |
| | **HebrewGPT-296M (this)** | 296M | 1B tokens | 39.6% | 68.4% | 78.9% | 31.40 | |
| | HebrewGPT-1B | 1.08B | 2.48B tokens | 38.4% | 56.1% | 63.6% | 29.75 | |
| |
| *Note: The 296M model shows higher token accuracy on its evaluation set (Wikipedia-focused), while the 1B model was trained on more diverse data and has lower perplexity overall.* |
| |
| ## Usage |
| |
| > โ ๏ธ **Custom Architecture**: This model uses a custom architecture. See [`generate.py`](generate.py) for the full model class definition. |
| |
| ### Quick Start |
| |
| ```python |
| import torch |
| import sentencepiece as spm |
| from generate import HebrewGPT, ModelConfig |
| |
| config = ModelConfig( |
| vocab_size=32000, |
| width=1536, |
| depth=10, |
| n_heads=12, |
| head_dim=128, |
| max_seq_len=512, |
| dropout=0.0, |
| ) |
| model = HebrewGPT(config) |
| |
| state_dict = torch.load("swa_best.pt", map_location="cpu", weights_only=True) |
| model.load_state_dict(state_dict) |
| model.eval() |
| |
| sp = spm.SentencePieceProcessor() |
| sp.Load("tokenizer.model") |
| |
| prompt = "ืืจืืฉืืื ืืื ืขืืจ" |
| input_ids = torch.tensor([sp.Encode(prompt)]) |
| output = model.generate(input_ids, max_new_tokens=100) |
| print(sp.Decode(output[0].tolist())) |
| ``` |
| |
| ### Command Line |
| |
| ```bash |
| python generate.py \ |
| --model_path swa_best.pt \ |
| --prompt "ืืจืืฉืืื ืืื ืขืืจ" \ |
| --width 1536 --depth 10 --n_heads 12 --max_seq_len 512 \ |
| --max_tokens 100 --temperature 0.8 |
| ``` |
| |
| ## Limitations |
|
|
| - **Hebrew-only**: Trained exclusively on Hebrew Wikipedia text |
| - **Short context**: Limited to 512 tokens (vs 2048 for the 1B model) |
| - **Wikipedia-focused**: Training data is primarily encyclopedic โ may struggle with conversational or legal text |
| - **No instruction tuning**: Base language model only |
| - **Custom architecture**: Requires the provided model class to load |
| - **No safety filtering**: May generate inappropriate or incorrect content |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{slasky2025hebrewgpt, |
| title={Hebrew Language Model Research via Agentic AI: Training HebrewGPT from Scratch}, |
| author={Slasky, Ronnen}, |
| year={2025}, |
| url={https://d11k83yu06biio.cloudfront.net/paper/hebrew-autoresearch.html} |
| } |
| ``` |
|
|
| ## Acknowledgments |
|
|
| - **Loki** โ AI research assistant (Amazon Bedrock on OpenClaw) |
| - **Andrej Karpathy** โ For the autoresearch framework |
|
|
| ## Contact |
|
|
| - **Author**: Ronnen Slasky (ronnen@slasky.com) |
| - **GitHub**: [fatherRonnen/AgenticResearcher](https://github.com/fatherRonnen/AgenticResearcher) |
|
|