HebrewGPT-1B-v2
A 1.08 billion parameter Hebrew language model, version 2 with continued pre-training on 7.8B additional Hebrew tokens from HeDC4 (Hebrew Digital Corpus).
Model Details
| Property | Value |
|---|---|
| Parameters | 1.08B |
| Base Model | HebrewGPT-1B |
| Architecture | Custom Mamba-Transformer hybrid (interleaved RoPE + Mamba SSM, SwiGLU MLP) |
| Total Training Tokens | 17.6B (9.8B original + 7.8B HeDC4) |
| Best Val Loss | 3.3937 (BPB 4.90) |
| Training | 15,000 steps, p4d.24xlarge (8xA100 40GB), 14.3 hours |
| Context Length | 2,048 tokens |
| Tokenizer | SentencePiece BPE, 32K vocab |
What is new in v2
HebrewGPT-1B-v2 extends the original model with continued pre-training on HeDC4, a diverse modern Hebrew web corpus. This nearly doubles the total training data from 9.8B to 17.6B tokens, adding:
- Modern Hebrew news and web content
- Broader vocabulary coverage
- More diverse writing styles and topics
Evaluation (Base Model, no SFT)
| Task | v1 (9.8B tokens) | v2 (17.6B tokens) |
|---|---|---|
| SNLI | 50.0% | 35.0% |
| QA | 20.0% | 60.0% |
| Sentiment | 33.3% | 33.3% |
| Trivia | 13.3% | 13.3% |
| Average | 29.2% | 35.4% |
Note: These are base model results without instruction tuning. QA improved significantly (+40pp) suggesting better reading comprehension from the additional pre-training data. An instruction-tuned version (SFT on v2) is in progress.
Pre-Training Data
Phase 1: Original (9.8B tokens)
Hebrew Wikipedia (12%), Supreme Court (22%), Ben Yehuda (23%), C4 Hebrew (20%), CC100 (19%), Task data (4%)
Phase 2: Continued Pre-Training (7.8B tokens)
HeDC4 — Hebrew Digital Corpus for the 21st Century
Related Models
- HebrewGPT-1B - Base model (v1)
- HebrewGPT-1B-Instruct - SFT on v1
- HebrewGPT-1B-AdamW - AdamW ablation
- HebrewGPT-296M - Smaller model
Infrastructure
Trained on Amazon Bedrock (research orchestration) + AWS EC2. Cost: ~$73 (14.3h x $5.12/hr spot).