README.md · Slasky/HebrewGPT-296M at main

HebrewGPT-296M / README.md

ronnengmail

Update: credit Amazon Bedrock as platform

a142784 verified 7 days ago

preview code

raw

history blame contribute delete

5.22 kB

	---
	language:
	- he
	license: apache-2.0
	tags:
	- hebrew
	- gpt
	- causal-lm
	- hebrew-nlp
	- muon-optimizer
	- sentencepiece
	- rope
	- swiglu
	datasets:
	- hebrew-wikipedia
	library_name: transformers
	pipeline_tag: text-generation
	model-index:
	- name: HebrewGPT-296M
	results:
	- task:
	type: text-generation
	name: Language Modeling
	metrics:
	- name: Perplexity
	type: perplexity
	value: 31.40
	- name: Top-1 Accuracy
	type: accuracy
	value: 39.6
	- name: Top-5 Accuracy
	type: accuracy
	value: 68.4
	---

	# HebrewGPT-296M 🇮🇱

	HebrewGPT-296M is a 296 million parameter autoregressive Hebrew language model — the smaller sibling of [HebrewGPT-1B](https://huggingface.co/Slasky/HebrewGPT-1B). Trained on 1 billion tokens of Hebrew Wikipedia using the Muon optimizer with Lookahead and SWA, it demonstrates strong Hebrew language understanding despite its compact size.

	This model achieves 39.6% Top-1 and 68.4% Top-5 token prediction accuracy, making it suitable for research, prototyping, and resource-constrained Hebrew NLP applications.

	- 📄 Paper: [Hebrew Language Model Research via Agentic AI](https://d11k83yu06biio.cloudfront.net/paper/hebrew-autoresearch.html)
	- 💻 GitHub: [AgenticResearcher](https://github.com/fatherRonnen/AgenticResearcher)
	- 🏆 Larger model: [HebrewGPT-1B](https://huggingface.co/Slasky/HebrewGPT-1B) (1.08B parameters)

	## Model Description

	\| Parameter \| Value \|
	\|---\|---\|
	\| Parameters \| 296M \|
	\| Hidden size (WIDTH) \| 1536 \|
	\| Layers (DEPTH) \| 10 \|
	\| Attention heads \| 12 \|
	\| Head dimension \| 128 \|
	\| MLP type \| SwiGLU (intermediate_size=4096) \|
	\| Positional encoding \| RoPE (interleaved, θ=10000) \|
	\| Normalization \| RMSNorm \|
	\| Vocabulary \| 32,000 (Hebrew-native SentencePiece BPE) \|
	\| Context length \| 512 tokens \|
	\| Weight tying \| Yes (embedding ↔ output head) \|
	\| Precision \| bfloat16 \|

	### Architecture

	Same design principles as HebrewGPT-1B but scaled down:
	- SwiGLU MLP with hidden dim = 4096
	- RoPE with interleaved pattern
	- RMSNorm pre-norm architecture
	- Weight tying between embedding and output head

	## Training Details

	### Optimizer
	- Muon optimizer + Lookahead (k=5, α=0.6) + Stochastic Weight Averaging (SWA)
	- Cosine annealing with warm restarts

	### Data
	- ~1 billion tokens from Hebrew Wikipedia

	### Hardware
	- Hardware: 4× NVIDIA A10G GPUs
	- Training time: Several hours

	## Evaluation Results

	### Overall Metrics

	\| Metric \| Value \|
	\|---\|---\|
	\| Validation BPB (SWA) \| 4.42 \|
	\| Perplexity \| 31.40 \|
	\| Top-1 Token Accuracy \| 39.6% \|
	\| Top-5 Token Accuracy \| 68.4% \|
	\| Top-10 Token Accuracy \| 78.9% \|

	### Comparison Across Model Sizes

	\| Model \| Params \| Data \| Top-1 \| Top-5 \| Top-10 \| PPL \|
	\|---\|---\|---\|---\|---\|---\|---\|
	\| HebrewGPT-296M (this) \| 296M \| 1B tokens \| 39.6% \| 68.4% \| 78.9% \| 31.40 \|
	\| HebrewGPT-1B \| 1.08B \| 2.48B tokens \| 38.4% \| 56.1% \| 63.6% \| 29.75 \|

	Note: The 296M model shows higher token accuracy on its evaluation set (Wikipedia-focused), while the 1B model was trained on more diverse data and has lower perplexity overall.

	## Usage

	> ⚠️ Custom Architecture: This model uses a custom architecture. See [`generate.py`](generate.py) for the full model class definition.

	### Quick Start

	```python
	import torch
	import sentencepiece as spm
	from generate import HebrewGPT, ModelConfig

	config = ModelConfig(
	vocab_size=32000,
	width=1536,
	depth=10,
	n_heads=12,
	head_dim=128,
	max_seq_len=512,
	dropout=0.0,
	)
	model = HebrewGPT(config)

	state_dict = torch.load("swa_best.pt", map_location="cpu", weights_only=True)
	model.load_state_dict(state_dict)
	model.eval()

	sp = spm.SentencePieceProcessor()
	sp.Load("tokenizer.model")

	prompt = "ירושלים היא עיר"
	input_ids = torch.tensor([sp.Encode(prompt)])
	output = model.generate(input_ids, max_new_tokens=100)
	print(sp.Decode(output[0].tolist()))
	```

	### Command Line

	```bash
	python generate.py \
	--model_path swa_best.pt \
	--prompt "ירושלים היא עיר" \
	--width 1536 --depth 10 --n_heads 12 --max_seq_len 512 \
	--max_tokens 100 --temperature 0.8
	```

	## Limitations

	- Hebrew-only: Trained exclusively on Hebrew Wikipedia text
	- Short context: Limited to 512 tokens (vs 2048 for the 1B model)
	- Wikipedia-focused: Training data is primarily encyclopedic — may struggle with conversational or legal text
	- No instruction tuning: Base language model only
	- Custom architecture: Requires the provided model class to load
	- No safety filtering: May generate inappropriate or incorrect content

	## Citation

	```bibtex
	@article{slasky2025hebrewgpt,
	title={Hebrew Language Model Research via Agentic AI: Training HebrewGPT from Scratch},
	author={Slasky, Ronnen},
	year={2025},
	url={https://d11k83yu06biio.cloudfront.net/paper/hebrew-autoresearch.html}
	}
	```

	## Acknowledgments

	- Loki — AI research assistant (Amazon Bedrock on OpenClaw)
	- Andrej Karpathy — For the autoresearch framework

	## Contact

	- Author: Ronnen Slasky (ronnen@slasky.com)
	- GitHub: [fatherRonnen/AgenticResearcher](https://github.com/fatherRonnen/AgenticResearcher)