HebrewGPT-296M ๐Ÿ‡ฎ๐Ÿ‡ฑ

HebrewGPT-296M is a 296 million parameter autoregressive Hebrew language model โ€” the smaller sibling of HebrewGPT-1B. Trained on 1 billion tokens of Hebrew Wikipedia using the Muon optimizer with Lookahead and SWA, it demonstrates strong Hebrew language understanding despite its compact size.

This model achieves 39.6% Top-1 and 68.4% Top-5 token prediction accuracy, making it suitable for research, prototyping, and resource-constrained Hebrew NLP applications.

Model Description

Parameter Value
Parameters 296M
Hidden size (WIDTH) 1536
Layers (DEPTH) 10
Attention heads 12
Head dimension 128
MLP type SwiGLU (intermediate_size=4096)
Positional encoding RoPE (interleaved, ฮธ=10000)
Normalization RMSNorm
Vocabulary 32,000 (Hebrew-native SentencePiece BPE)
Context length 512 tokens
Weight tying Yes (embedding โ†” output head)
Precision bfloat16

Architecture

Same design principles as HebrewGPT-1B but scaled down:

  • SwiGLU MLP with hidden dim = 4096
  • RoPE with interleaved pattern
  • RMSNorm pre-norm architecture
  • Weight tying between embedding and output head

Training Details

Optimizer

  • Muon optimizer + Lookahead (k=5, ฮฑ=0.6) + Stochastic Weight Averaging (SWA)
  • Cosine annealing with warm restarts

Data

  • ~1 billion tokens from Hebrew Wikipedia

Hardware

  • Hardware: 4ร— NVIDIA A10G GPUs
  • Training time: Several hours

Evaluation Results

Overall Metrics

Metric Value
Validation BPB (SWA) 4.42
Perplexity 31.40
Top-1 Token Accuracy 39.6%
Top-5 Token Accuracy 68.4%
Top-10 Token Accuracy 78.9%

Comparison Across Model Sizes

Model Params Data Top-1 Top-5 Top-10 PPL
HebrewGPT-296M (this) 296M 1B tokens 39.6% 68.4% 78.9% 31.40
HebrewGPT-1B 1.08B 2.48B tokens 38.4% 56.1% 63.6% 29.75

Note: The 296M model shows higher token accuracy on its evaluation set (Wikipedia-focused), while the 1B model was trained on more diverse data and has lower perplexity overall.

Usage

โš ๏ธ Custom Architecture: This model uses a custom architecture. See generate.py for the full model class definition.

Quick Start

import torch
import sentencepiece as spm
from generate import HebrewGPT, ModelConfig

config = ModelConfig(
    vocab_size=32000,
    width=1536,
    depth=10,
    n_heads=12,
    head_dim=128,
    max_seq_len=512,
    dropout=0.0,
)
model = HebrewGPT(config)

state_dict = torch.load("swa_best.pt", map_location="cpu", weights_only=True)
model.load_state_dict(state_dict)
model.eval()

sp = spm.SentencePieceProcessor()
sp.Load("tokenizer.model")

prompt = "ื™ืจื•ืฉืœื™ื ื”ื™ื ืขื™ืจ"
input_ids = torch.tensor([sp.Encode(prompt)])
output = model.generate(input_ids, max_new_tokens=100)
print(sp.Decode(output[0].tolist()))

Command Line

python generate.py \
    --model_path swa_best.pt \
    --prompt "ื™ืจื•ืฉืœื™ื ื”ื™ื ืขื™ืจ" \
    --width 1536 --depth 10 --n_heads 12 --max_seq_len 512 \
    --max_tokens 100 --temperature 0.8

Limitations

  • Hebrew-only: Trained exclusively on Hebrew Wikipedia text
  • Short context: Limited to 512 tokens (vs 2048 for the 1B model)
  • Wikipedia-focused: Training data is primarily encyclopedic โ€” may struggle with conversational or legal text
  • No instruction tuning: Base language model only
  • Custom architecture: Requires the provided model class to load
  • No safety filtering: May generate inappropriate or incorrect content

Citation

@article{slasky2025hebrewgpt,
  title={Hebrew Language Model Research via Agentic AI: Training HebrewGPT from Scratch},
  author={Slasky, Ronnen},
  year={2025},
  url={https://d11k83yu06biio.cloudfront.net/paper/hebrew-autoresearch.html}
}

Acknowledgments

  • Loki โ€” AI research assistant (Amazon Bedrock on OpenClaw)
  • Andrej Karpathy โ€” For the autoresearch framework

Contact

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Evaluation results