HebrewGPT-296M / README.md
ronnengmail's picture
Update: credit Amazon Bedrock as platform
a142784 verified
---
language:
- he
license: apache-2.0
tags:
- hebrew
- gpt
- causal-lm
- hebrew-nlp
- muon-optimizer
- sentencepiece
- rope
- swiglu
datasets:
- hebrew-wikipedia
library_name: transformers
pipeline_tag: text-generation
model-index:
- name: HebrewGPT-296M
results:
- task:
type: text-generation
name: Language Modeling
metrics:
- name: Perplexity
type: perplexity
value: 31.40
- name: Top-1 Accuracy
type: accuracy
value: 39.6
- name: Top-5 Accuracy
type: accuracy
value: 68.4
---
# HebrewGPT-296M ๐Ÿ‡ฎ๐Ÿ‡ฑ
**HebrewGPT-296M** is a 296 million parameter autoregressive Hebrew language model โ€” the smaller sibling of [HebrewGPT-1B](https://huggingface.co/Slasky/HebrewGPT-1B). Trained on 1 billion tokens of Hebrew Wikipedia using the Muon optimizer with Lookahead and SWA, it demonstrates strong Hebrew language understanding despite its compact size.
This model achieves **39.6% Top-1** and **68.4% Top-5** token prediction accuracy, making it suitable for research, prototyping, and resource-constrained Hebrew NLP applications.
- ๐Ÿ“„ **Paper**: [Hebrew Language Model Research via Agentic AI](https://d11k83yu06biio.cloudfront.net/paper/hebrew-autoresearch.html)
- ๐Ÿ’ป **GitHub**: [AgenticResearcher](https://github.com/fatherRonnen/AgenticResearcher)
- ๐Ÿ† **Larger model**: [HebrewGPT-1B](https://huggingface.co/Slasky/HebrewGPT-1B) (1.08B parameters)
## Model Description
| Parameter | Value |
|---|---|
| Parameters | 296M |
| Hidden size (WIDTH) | 1536 |
| Layers (DEPTH) | 10 |
| Attention heads | 12 |
| Head dimension | 128 |
| MLP type | SwiGLU (intermediate_size=4096) |
| Positional encoding | RoPE (interleaved, ฮธ=10000) |
| Normalization | RMSNorm |
| Vocabulary | 32,000 (Hebrew-native SentencePiece BPE) |
| Context length | 512 tokens |
| Weight tying | Yes (embedding โ†” output head) |
| Precision | bfloat16 |
### Architecture
Same design principles as HebrewGPT-1B but scaled down:
- **SwiGLU MLP** with hidden dim = 4096
- **RoPE** with interleaved pattern
- **RMSNorm** pre-norm architecture
- **Weight tying** between embedding and output head
## Training Details
### Optimizer
- **Muon** optimizer + **Lookahead** (k=5, ฮฑ=0.6) + **Stochastic Weight Averaging (SWA)**
- Cosine annealing with warm restarts
### Data
- ~1 billion tokens from **Hebrew Wikipedia**
### Hardware
- **Hardware**: 4ร— NVIDIA A10G GPUs
- **Training time**: Several hours
## Evaluation Results
### Overall Metrics
| Metric | Value |
|---|---|
| Validation BPB (SWA) | 4.42 |
| Perplexity | 31.40 |
| Top-1 Token Accuracy | 39.6% |
| Top-5 Token Accuracy | 68.4% |
| Top-10 Token Accuracy | 78.9% |
### Comparison Across Model Sizes
| Model | Params | Data | Top-1 | Top-5 | Top-10 | PPL |
|---|---|---|---|---|---|---|
| **HebrewGPT-296M (this)** | 296M | 1B tokens | 39.6% | 68.4% | 78.9% | 31.40 |
| HebrewGPT-1B | 1.08B | 2.48B tokens | 38.4% | 56.1% | 63.6% | 29.75 |
*Note: The 296M model shows higher token accuracy on its evaluation set (Wikipedia-focused), while the 1B model was trained on more diverse data and has lower perplexity overall.*
## Usage
> โš ๏ธ **Custom Architecture**: This model uses a custom architecture. See [`generate.py`](generate.py) for the full model class definition.
### Quick Start
```python
import torch
import sentencepiece as spm
from generate import HebrewGPT, ModelConfig
config = ModelConfig(
vocab_size=32000,
width=1536,
depth=10,
n_heads=12,
head_dim=128,
max_seq_len=512,
dropout=0.0,
)
model = HebrewGPT(config)
state_dict = torch.load("swa_best.pt", map_location="cpu", weights_only=True)
model.load_state_dict(state_dict)
model.eval()
sp = spm.SentencePieceProcessor()
sp.Load("tokenizer.model")
prompt = "ื™ืจื•ืฉืœื™ื ื”ื™ื ืขื™ืจ"
input_ids = torch.tensor([sp.Encode(prompt)])
output = model.generate(input_ids, max_new_tokens=100)
print(sp.Decode(output[0].tolist()))
```
### Command Line
```bash
python generate.py \
--model_path swa_best.pt \
--prompt "ื™ืจื•ืฉืœื™ื ื”ื™ื ืขื™ืจ" \
--width 1536 --depth 10 --n_heads 12 --max_seq_len 512 \
--max_tokens 100 --temperature 0.8
```
## Limitations
- **Hebrew-only**: Trained exclusively on Hebrew Wikipedia text
- **Short context**: Limited to 512 tokens (vs 2048 for the 1B model)
- **Wikipedia-focused**: Training data is primarily encyclopedic โ€” may struggle with conversational or legal text
- **No instruction tuning**: Base language model only
- **Custom architecture**: Requires the provided model class to load
- **No safety filtering**: May generate inappropriate or incorrect content
## Citation
```bibtex
@article{slasky2025hebrewgpt,
title={Hebrew Language Model Research via Agentic AI: Training HebrewGPT from Scratch},
author={Slasky, Ronnen},
year={2025},
url={https://d11k83yu06biio.cloudfront.net/paper/hebrew-autoresearch.html}
}
```
## Acknowledgments
- **Loki** โ€” AI research assistant (Amazon Bedrock on OpenClaw)
- **Andrej Karpathy** โ€” For the autoresearch framework
## Contact
- **Author**: Ronnen Slasky (ronnen@slasky.com)
- **GitHub**: [fatherRonnen/AgenticResearcher](https://github.com/fatherRonnen/AgenticResearcher)