HebrewGPT-296M 🇮🇱

HebrewGPT-296M is a 296 million parameter autoregressive Hebrew language model — the smaller sibling of HebrewGPT-1B. Trained on 1 billion tokens of Hebrew Wikipedia using the Muon optimizer with Lookahead and SWA, it demonstrates strong Hebrew language understanding despite its compact size.

This model achieves 39.6% Top-1 and 68.4% Top-5 token prediction accuracy, making it suitable for research, prototyping, and resource-constrained Hebrew NLP applications.

📄 Paper: Hebrew Language Model Research via Agentic AI
💻 GitHub: AgenticResearcher
🏆 Larger model: HebrewGPT-1B (1.08B parameters)

Model Description

Parameter	Value
Parameters	296M
Hidden size (WIDTH)	1536
Layers (DEPTH)	10
Attention heads	12
Head dimension	128
MLP type	SwiGLU (intermediate_size=4096)
Positional encoding	RoPE (interleaved, θ=10000)
Normalization	RMSNorm
Vocabulary	32,000 (Hebrew-native SentencePiece BPE)
Context length	512 tokens
Weight tying	Yes (embedding ↔ output head)
Precision	bfloat16

Architecture

Same design principles as HebrewGPT-1B but scaled down:

SwiGLU MLP with hidden dim = 4096
RoPE with interleaved pattern
RMSNorm pre-norm architecture
Weight tying between embedding and output head

Training Details

Optimizer

Muon optimizer + Lookahead (k=5, α=0.6) + Stochastic Weight Averaging (SWA)
Cosine annealing with warm restarts

Data

~1 billion tokens from Hebrew Wikipedia

Hardware

Hardware: 4× NVIDIA A10G GPUs
Training time: Several hours

Evaluation Results

Overall Metrics

Metric	Value
Validation BPB (SWA)	4.42
Perplexity	31.40
Top-1 Token Accuracy	39.6%
Top-5 Token Accuracy	68.4%
Top-10 Token Accuracy	78.9%

Comparison Across Model Sizes

Model	Params	Data	Top-1	Top-5	Top-10	PPL
HebrewGPT-296M (this)	296M	1B tokens	39.6%	68.4%	78.9%	31.40
HebrewGPT-1B	1.08B	2.48B tokens	38.4%	56.1%	63.6%	29.75

Note: The 296M model shows higher token accuracy on its evaluation set (Wikipedia-focused), while the 1B model was trained on more diverse data and has lower perplexity overall.

Usage

⚠️ Custom Architecture: This model uses a custom architecture. See generate.py for the full model class definition.

Quick Start

import torch
import sentencepiece as spm
from generate import HebrewGPT, ModelConfig

config = ModelConfig(
    vocab_size=32000,
    width=1536,
    depth=10,
    n_heads=12,
    head_dim=128,
    max_seq_len=512,
    dropout=0.0,
)
model = HebrewGPT(config)

state_dict = torch.load("swa_best.pt", map_location="cpu", weights_only=True)
model.load_state_dict(state_dict)
model.eval()

sp = spm.SentencePieceProcessor()
sp.Load("tokenizer.model")

prompt = "ירושלים היא עיר"
input_ids = torch.tensor([sp.Encode(prompt)])
output = model.generate(input_ids, max_new_tokens=100)
print(sp.Decode(output[0].tolist()))

Command Line

python generate.py \
    --model_path swa_best.pt \
    --prompt "ירושלים היא עיר" \
    --width 1536 --depth 10 --n_heads 12 --max_seq_len 512 \
    --max_tokens 100 --temperature 0.8

Limitations

Hebrew-only: Trained exclusively on Hebrew Wikipedia text
Short context: Limited to 512 tokens (vs 2048 for the 1B model)
Wikipedia-focused: Training data is primarily encyclopedic — may struggle with conversational or legal text
No instruction tuning: Base language model only
Custom architecture: Requires the provided model class to load
No safety filtering: May generate inappropriate or incorrect content

Citation

@article{slasky2025hebrewgpt,
  title={Hebrew Language Model Research via Agentic AI: Training HebrewGPT from Scratch},
  author={Slasky, Ronnen},
  year={2025},
  url={https://d11k83yu06biio.cloudfront.net/paper/hebrew-autoresearch.html}
}

Acknowledgments

Loki — AI research assistant (Amazon Bedrock on OpenClaw)
Andrej Karpathy — For the autoresearch framework

Contact

Author: Ronnen Slasky (ronnen@slasky.com)
GitHub: fatherRonnen/AgenticResearcher

Downloads last month: 13

Safetensors

Model size

0.3B params

Tensor type

F32

Evaluation results

Perplexity
self-reported

31.400
Top-1 Accuracy
self-reported

39.600
Top-5 Accuracy
self-reported

68.400