HebrewGPT-1B-v2

A 1.08 billion parameter Hebrew language model, version 2 with continued pre-training on 7.8B additional Hebrew tokens from HeDC4 (Hebrew Digital Corpus).

Model Details

Property	Value
Parameters	1.08B
Base Model	HebrewGPT-1B
Architecture	Custom Mamba-Transformer hybrid (interleaved RoPE + Mamba SSM, SwiGLU MLP)
Total Training Tokens	17.6B (9.8B original + 7.8B HeDC4)
Best Val Loss	3.3937 (BPB 4.90)
Training	15,000 steps, p4d.24xlarge (8xA100 40GB), 14.3 hours
Context Length	2,048 tokens
Tokenizer	SentencePiece BPE, 32K vocab

What is new in v2

HebrewGPT-1B-v2 extends the original model with continued pre-training on HeDC4, a diverse modern Hebrew web corpus. This nearly doubles the total training data from 9.8B to 17.6B tokens, adding:

Modern Hebrew news and web content
Broader vocabulary coverage
More diverse writing styles and topics

Evaluation (Base Model, no SFT)

Task	v1 (9.8B tokens)	v2 (17.6B tokens)
SNLI	50.0%	35.0%
QA	20.0%	60.0%
Sentiment	33.3%	33.3%
Trivia	13.3%	13.3%
Average	29.2%	35.4%

Note: These are base model results without instruction tuning. QA improved significantly (+40pp) suggesting better reading comprehension from the additional pre-training data. An instruction-tuned version (SFT on v2) is in progress.

Pre-Training Data

Phase 1: Original (9.8B tokens)

Hebrew Wikipedia (12%), Supreme Court (22%), Ben Yehuda (23%), C4 Hebrew (20%), CC100 (19%), Task data (4%)

Phase 2: Continued Pre-Training (7.8B tokens)

HeDC4 — Hebrew Digital Corpus for the 21st Century

Related Models

HebrewGPT-1B - Base model (v1)
HebrewGPT-1B-Instruct - SFT on v1
HebrewGPT-1B-AdamW - AdamW ablation
HebrewGPT-296M - Smaller model

Infrastructure

Trained on Amazon Bedrock (research orchestration) + AWS EC2. Cost: ~$73 (14.3h x $5.12/hr spot).

Downloads last month: -; Downloads are not tracked for this model. How to track