--- language: - he license: apache-2.0 tags: - hebrew - gpt - causal-lm - hebrew-nlp - muon-optimizer - sentencepiece - rope - swiglu datasets: - hebrew-wikipedia library_name: transformers pipeline_tag: text-generation model-index: - name: HebrewGPT-296M results: - task: type: text-generation name: Language Modeling metrics: - name: Perplexity type: perplexity value: 31.40 - name: Top-1 Accuracy type: accuracy value: 39.6 - name: Top-5 Accuracy type: accuracy value: 68.4 --- # HebrewGPT-296M 🇮🇱 **HebrewGPT-296M** is a 296 million parameter autoregressive Hebrew language model — the smaller sibling of [HebrewGPT-1B](https://huggingface.co/Slasky/HebrewGPT-1B). Trained on 1 billion tokens of Hebrew Wikipedia using the Muon optimizer with Lookahead and SWA, it demonstrates strong Hebrew language understanding despite its compact size. This model achieves **39.6% Top-1** and **68.4% Top-5** token prediction accuracy, making it suitable for research, prototyping, and resource-constrained Hebrew NLP applications. - 📄 **Paper**: [Hebrew Language Model Research via Agentic AI](https://d11k83yu06biio.cloudfront.net/paper/hebrew-autoresearch.html) - 💻 **GitHub**: [AgenticResearcher](https://github.com/fatherRonnen/AgenticResearcher) - 🏆 **Larger model**: [HebrewGPT-1B](https://huggingface.co/Slasky/HebrewGPT-1B) (1.08B parameters) ## Model Description | Parameter | Value | |---|---| | Parameters | 296M | | Hidden size (WIDTH) | 1536 | | Layers (DEPTH) | 10 | | Attention heads | 12 | | Head dimension | 128 | | MLP type | SwiGLU (intermediate_size=4096) | | Positional encoding | RoPE (interleaved, θ=10000) | | Normalization | RMSNorm | | Vocabulary | 32,000 (Hebrew-native SentencePiece BPE) | | Context length | 512 tokens | | Weight tying | Yes (embedding ↔ output head) | | Precision | bfloat16 | ### Architecture Same design principles as HebrewGPT-1B but scaled down: - **SwiGLU MLP** with hidden dim = 4096 - **RoPE** with interleaved pattern - **RMSNorm** pre-norm architecture - **Weight tying** between embedding and output head ## Training Details ### Optimizer - **Muon** optimizer + **Lookahead** (k=5, α=0.6) + **Stochastic Weight Averaging (SWA)** - Cosine annealing with warm restarts ### Data - ~1 billion tokens from **Hebrew Wikipedia** ### Hardware - **Hardware**: 4× NVIDIA A10G GPUs - **Training time**: Several hours ## Evaluation Results ### Overall Metrics | Metric | Value | |---|---| | Validation BPB (SWA) | 4.42 | | Perplexity | 31.40 | | Top-1 Token Accuracy | 39.6% | | Top-5 Token Accuracy | 68.4% | | Top-10 Token Accuracy | 78.9% | ### Comparison Across Model Sizes | Model | Params | Data | Top-1 | Top-5 | Top-10 | PPL | |---|---|---|---|---|---|---| | **HebrewGPT-296M (this)** | 296M | 1B tokens | 39.6% | 68.4% | 78.9% | 31.40 | | HebrewGPT-1B | 1.08B | 2.48B tokens | 38.4% | 56.1% | 63.6% | 29.75 | *Note: The 296M model shows higher token accuracy on its evaluation set (Wikipedia-focused), while the 1B model was trained on more diverse data and has lower perplexity overall.* ## Usage > ⚠️ **Custom Architecture**: This model uses a custom architecture. See [`generate.py`](generate.py) for the full model class definition. ### Quick Start ```python import torch import sentencepiece as spm from generate import HebrewGPT, ModelConfig config = ModelConfig( vocab_size=32000, width=1536, depth=10, n_heads=12, head_dim=128, max_seq_len=512, dropout=0.0, ) model = HebrewGPT(config) state_dict = torch.load("swa_best.pt", map_location="cpu", weights_only=True) model.load_state_dict(state_dict) model.eval() sp = spm.SentencePieceProcessor() sp.Load("tokenizer.model") prompt = "ירושלים היא עיר" input_ids = torch.tensor([sp.Encode(prompt)]) output = model.generate(input_ids, max_new_tokens=100) print(sp.Decode(output[0].tolist())) ``` ### Command Line ```bash python generate.py \ --model_path swa_best.pt \ --prompt "ירושלים היא עיר" \ --width 1536 --depth 10 --n_heads 12 --max_seq_len 512 \ --max_tokens 100 --temperature 0.8 ``` ## Limitations - **Hebrew-only**: Trained exclusively on Hebrew Wikipedia text - **Short context**: Limited to 512 tokens (vs 2048 for the 1B model) - **Wikipedia-focused**: Training data is primarily encyclopedic — may struggle with conversational or legal text - **No instruction tuning**: Base language model only - **Custom architecture**: Requires the provided model class to load - **No safety filtering**: May generate inappropriate or incorrect content ## Citation ```bibtex @article{slasky2025hebrewgpt, title={Hebrew Language Model Research via Agentic AI: Training HebrewGPT from Scratch}, author={Slasky, Ronnen}, year={2025}, url={https://d11k83yu06biio.cloudfront.net/paper/hebrew-autoresearch.html} } ``` ## Acknowledgments - **Loki** — AI research assistant (Amazon Bedrock on OpenClaw) - **Andrej Karpathy** — For the autoresearch framework ## Contact - **Author**: Ronnen Slasky (ronnen@slasky.com) - **GitHub**: [fatherRonnen/AgenticResearcher](https://github.com/fatherRonnen/AgenticResearcher)