Initial model card with eval results
Browse files
README.md
ADDED
|
@@ -0,0 +1,69 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- he
|
| 4 |
+
license: apache-2.0
|
| 5 |
+
tags:
|
| 6 |
+
- hebrew
|
| 7 |
+
- continued-pretraining
|
| 8 |
+
- language-model
|
| 9 |
+
- text-generation
|
| 10 |
+
- mamba
|
| 11 |
+
- transformer
|
| 12 |
+
pipeline_tag: text-generation
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
# HebrewGPT-1B-v2
|
| 16 |
+
|
| 17 |
+
A **1.08 billion parameter** Hebrew language model, version 2 with continued pre-training on 7.8B additional Hebrew tokens from HeDC4 (Hebrew Digital Corpus).
|
| 18 |
+
|
| 19 |
+
## Model Details
|
| 20 |
+
|
| 21 |
+
| Property | Value |
|
| 22 |
+
|----------|-------|
|
| 23 |
+
| **Parameters** | 1.08B |
|
| 24 |
+
| **Base Model** | [HebrewGPT-1B](https://huggingface.co/Slasky/HebrewGPT-1B) |
|
| 25 |
+
| **Architecture** | Custom Mamba-Transformer hybrid (interleaved RoPE + Mamba SSM, SwiGLU MLP) |
|
| 26 |
+
| **Total Training Tokens** | 17.6B (9.8B original + 7.8B HeDC4) |
|
| 27 |
+
| **Best Val Loss** | 3.3937 (BPB 4.90) |
|
| 28 |
+
| **Training** | 15,000 steps, p4d.24xlarge (8xA100 40GB), 14.3 hours |
|
| 29 |
+
| **Context Length** | 2,048 tokens |
|
| 30 |
+
| **Tokenizer** | SentencePiece BPE, 32K vocab |
|
| 31 |
+
|
| 32 |
+
## What is new in v2
|
| 33 |
+
|
| 34 |
+
HebrewGPT-1B-v2 extends the original model with continued pre-training on HeDC4, a diverse modern Hebrew web corpus. This nearly doubles the total training data from 9.8B to 17.6B tokens, adding:
|
| 35 |
+
|
| 36 |
+
- Modern Hebrew news and web content
|
| 37 |
+
- Broader vocabulary coverage
|
| 38 |
+
- More diverse writing styles and topics
|
| 39 |
+
|
| 40 |
+
## Evaluation (Base Model, no SFT)
|
| 41 |
+
|
| 42 |
+
| Task | v1 (9.8B tokens) | v2 (17.6B tokens) |
|
| 43 |
+
|------|-------------------|-------------------|
|
| 44 |
+
| SNLI | 50.0% | 35.0% |
|
| 45 |
+
| QA | 20.0% | 60.0% |
|
| 46 |
+
| Sentiment | 33.3% | 33.3% |
|
| 47 |
+
| Trivia | 13.3% | 13.3% |
|
| 48 |
+
| **Average** | **29.2%** | **35.4%** |
|
| 49 |
+
|
| 50 |
+
Note: These are base model results without instruction tuning. QA improved significantly (+40pp) suggesting better reading comprehension from the additional pre-training data. An instruction-tuned version (SFT on v2) is in progress.
|
| 51 |
+
|
| 52 |
+
## Pre-Training Data
|
| 53 |
+
|
| 54 |
+
### Phase 1: Original (9.8B tokens)
|
| 55 |
+
Hebrew Wikipedia (12%), Supreme Court (22%), Ben Yehuda (23%), C4 Hebrew (20%), CC100 (19%), Task data (4%)
|
| 56 |
+
|
| 57 |
+
### Phase 2: Continued Pre-Training (7.8B tokens)
|
| 58 |
+
HeDC4 — Hebrew Digital Corpus for the 21st Century
|
| 59 |
+
|
| 60 |
+
## Related Models
|
| 61 |
+
|
| 62 |
+
- [HebrewGPT-1B](https://huggingface.co/Slasky/HebrewGPT-1B) - Base model (v1)
|
| 63 |
+
- [HebrewGPT-1B-Instruct](https://huggingface.co/Slasky/HebrewGPT-1B-Instruct) - SFT on v1
|
| 64 |
+
- [HebrewGPT-1B-AdamW](https://huggingface.co/Slasky/HebrewGPT-1B-AdamW) - AdamW ablation
|
| 65 |
+
- [HebrewGPT-296M](https://huggingface.co/Slasky/HebrewGPT-296M) - Smaller model
|
| 66 |
+
|
| 67 |
+
## Infrastructure
|
| 68 |
+
|
| 69 |
+
Trained on Amazon Bedrock (research orchestration) + AWS EC2. Cost: ~$73 (14.3h x $5.12/hr spot).
|