HebrewGPT-1B-v2

A 1.08 billion parameter Hebrew language model, version 2 with continued pre-training on 7.8B additional Hebrew tokens from HeDC4 (Hebrew Digital Corpus).

Model Details

Property Value
Parameters 1.08B
Base Model HebrewGPT-1B
Architecture Custom Mamba-Transformer hybrid (interleaved RoPE + Mamba SSM, SwiGLU MLP)
Total Training Tokens 17.6B (9.8B original + 7.8B HeDC4)
Best Val Loss 3.3937 (BPB 4.90)
Training 15,000 steps, p4d.24xlarge (8xA100 40GB), 14.3 hours
Context Length 2,048 tokens
Tokenizer SentencePiece BPE, 32K vocab

What is new in v2

HebrewGPT-1B-v2 extends the original model with continued pre-training on HeDC4, a diverse modern Hebrew web corpus. This nearly doubles the total training data from 9.8B to 17.6B tokens, adding:

  • Modern Hebrew news and web content
  • Broader vocabulary coverage
  • More diverse writing styles and topics

Evaluation (Base Model, no SFT)

Task v1 (9.8B tokens) v2 (17.6B tokens)
SNLI 50.0% 35.0%
QA 20.0% 60.0%
Sentiment 33.3% 33.3%
Trivia 13.3% 13.3%
Average 29.2% 35.4%

Note: These are base model results without instruction tuning. QA improved significantly (+40pp) suggesting better reading comprehension from the additional pre-training data. An instruction-tuned version (SFT on v2) is in progress.

Pre-Training Data

Phase 1: Original (9.8B tokens)

Hebrew Wikipedia (12%), Supreme Court (22%), Ben Yehuda (23%), C4 Hebrew (20%), CC100 (19%), Task data (4%)

Phase 2: Continued Pre-Training (7.8B tokens)

HeDC4 — Hebrew Digital Corpus for the 21st Century

Related Models

Infrastructure

Trained on Amazon Bedrock (research orchestration) + AWS EC2. Cost: ~$73 (14.3h x $5.12/hr spot).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support