---
language:
- he
license: apache-2.0
tags:
- hebrew
- gpt
- causal-lm
- hebrew-nlp
- muon-optimizer
- sentencepiece
- rope
- swiglu
datasets:
- hebrew-wikipedia
library_name: transformers
pipeline_tag: text-generation
model-index:
- name: HebrewGPT-296M
  results:
  - task:
      type: text-generation
      name: Language Modeling
    metrics:
    - name: Perplexity
      type: perplexity
      value: 31.40
    - name: Top-1 Accuracy
      type: accuracy
      value: 39.6
    - name: Top-5 Accuracy
      type: accuracy
      value: 68.4
---

# HebrewGPT-296M 🇮🇱

**HebrewGPT-296M** is a 296 million parameter autoregressive Hebrew language model — the smaller sibling of [HebrewGPT-1B](https://huggingface.co/Slasky/HebrewGPT-1B). Trained on 1 billion tokens of Hebrew Wikipedia using the Muon optimizer with Lookahead and SWA, it demonstrates strong Hebrew language understanding despite its compact size.

This model achieves **39.6% Top-1** and **68.4% Top-5** token prediction accuracy, making it suitable for research, prototyping, and resource-constrained Hebrew NLP applications.

- 📄 **Paper**: [Hebrew Language Model Research via Agentic AI](https://d11k83yu06biio.cloudfront.net/paper/hebrew-autoresearch.html)
- 💻 **GitHub**: [AgenticResearcher](https://github.com/fatherRonnen/AgenticResearcher)
- 🏆 **Larger model**: [HebrewGPT-1B](https://huggingface.co/Slasky/HebrewGPT-1B) (1.08B parameters)

## Model Description

| Parameter | Value |
|---|---|
| Parameters | 296M |
| Hidden size (WIDTH) | 1536 |
| Layers (DEPTH) | 10 |
| Attention heads | 12 |
| Head dimension | 128 |
| MLP type | SwiGLU (intermediate_size=4096) |
| Positional encoding | RoPE (interleaved, θ=10000) |
| Normalization | RMSNorm |
| Vocabulary | 32,000 (Hebrew-native SentencePiece BPE) |
| Context length | 512 tokens |
| Weight tying | Yes (embedding ↔ output head) |
| Precision | bfloat16 |

### Architecture

Same design principles as HebrewGPT-1B but scaled down:
- **SwiGLU MLP** with hidden dim = 4096
- **RoPE** with interleaved pattern
- **RMSNorm** pre-norm architecture
- **Weight tying** between embedding and output head

## Training Details

### Optimizer
- **Muon** optimizer + **Lookahead** (k=5, α=0.6) + **Stochastic Weight Averaging (SWA)**
- Cosine annealing with warm restarts

### Data
- ~1 billion tokens from **Hebrew Wikipedia**

### Hardware
- **Hardware**: 4× NVIDIA A10G GPUs
- **Training time**: Several hours

## Evaluation Results

### Overall Metrics

| Metric | Value |
|---|---|
| Validation BPB (SWA) | 4.42 |
| Perplexity | 31.40 |
| Top-1 Token Accuracy | 39.6% |
| Top-5 Token Accuracy | 68.4% |
| Top-10 Token Accuracy | 78.9% |

### Comparison Across Model Sizes

| Model | Params | Data | Top-1 | Top-5 | Top-10 | PPL |
|---|---|---|---|---|---|---|
| **HebrewGPT-296M (this)** | 296M | 1B tokens | 39.6% | 68.4% | 78.9% | 31.40 |
| HebrewGPT-1B | 1.08B | 2.48B tokens | 38.4% | 56.1% | 63.6% | 29.75 |

*Note: The 296M model shows higher token accuracy on its evaluation set (Wikipedia-focused), while the 1B model was trained on more diverse data and has lower perplexity overall.*

## Usage

> ⚠️ **Custom Architecture**: This model uses a custom architecture. See [`generate.py`](generate.py) for the full model class definition.

### Quick Start

```python
import torch
import sentencepiece as spm
from generate import HebrewGPT, ModelConfig

config = ModelConfig(
    vocab_size=32000,
    width=1536,
    depth=10,
    n_heads=12,
    head_dim=128,
    max_seq_len=512,
    dropout=0.0,
)
model = HebrewGPT(config)

state_dict = torch.load("swa_best.pt", map_location="cpu", weights_only=True)
model.load_state_dict(state_dict)
model.eval()

sp = spm.SentencePieceProcessor()
sp.Load("tokenizer.model")

prompt = "ירושלים היא עיר"
input_ids = torch.tensor([sp.Encode(prompt)])
output = model.generate(input_ids, max_new_tokens=100)
print(sp.Decode(output[0].tolist()))
```

### Command Line

```bash
python generate.py \
    --model_path swa_best.pt \
    --prompt "ירושלים היא עיר" \
    --width 1536 --depth 10 --n_heads 12 --max_seq_len 512 \
    --max_tokens 100 --temperature 0.8
```

## Limitations

- **Hebrew-only**: Trained exclusively on Hebrew Wikipedia text
- **Short context**: Limited to 512 tokens (vs 2048 for the 1B model)
- **Wikipedia-focused**: Training data is primarily encyclopedic — may struggle with conversational or legal text
- **No instruction tuning**: Base language model only
- **Custom architecture**: Requires the provided model class to load
- **No safety filtering**: May generate inappropriate or incorrect content

## Citation

```bibtex
@article{slasky2025hebrewgpt,
  title={Hebrew Language Model Research via Agentic AI: Training HebrewGPT from Scratch},
  author={Slasky, Ronnen},
  year={2025},
  url={https://d11k83yu06biio.cloudfront.net/paper/hebrew-autoresearch.html}
}
```

## Acknowledgments

- **Loki** — AI research assistant (Amazon Bedrock on OpenClaw)
- **Andrej Karpathy** — For the autoresearch framework

## Contact

- **Author**: Ronnen Slasky (ronnen@slasky.com)
- **GitHub**: [fatherRonnen/AgenticResearcher](https://github.com/fatherRonnen/AgenticResearcher)