ronnengmail commited on
Commit
9204e29
·
verified ·
1 Parent(s): 455fc5b

Initial model card with eval results

Browse files
Files changed (1) hide show
  1. README.md +69 -0
README.md ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - he
4
+ license: apache-2.0
5
+ tags:
6
+ - hebrew
7
+ - continued-pretraining
8
+ - language-model
9
+ - text-generation
10
+ - mamba
11
+ - transformer
12
+ pipeline_tag: text-generation
13
+ ---
14
+
15
+ # HebrewGPT-1B-v2
16
+
17
+ A **1.08 billion parameter** Hebrew language model, version 2 with continued pre-training on 7.8B additional Hebrew tokens from HeDC4 (Hebrew Digital Corpus).
18
+
19
+ ## Model Details
20
+
21
+ | Property | Value |
22
+ |----------|-------|
23
+ | **Parameters** | 1.08B |
24
+ | **Base Model** | [HebrewGPT-1B](https://huggingface.co/Slasky/HebrewGPT-1B) |
25
+ | **Architecture** | Custom Mamba-Transformer hybrid (interleaved RoPE + Mamba SSM, SwiGLU MLP) |
26
+ | **Total Training Tokens** | 17.6B (9.8B original + 7.8B HeDC4) |
27
+ | **Best Val Loss** | 3.3937 (BPB 4.90) |
28
+ | **Training** | 15,000 steps, p4d.24xlarge (8xA100 40GB), 14.3 hours |
29
+ | **Context Length** | 2,048 tokens |
30
+ | **Tokenizer** | SentencePiece BPE, 32K vocab |
31
+
32
+ ## What is new in v2
33
+
34
+ HebrewGPT-1B-v2 extends the original model with continued pre-training on HeDC4, a diverse modern Hebrew web corpus. This nearly doubles the total training data from 9.8B to 17.6B tokens, adding:
35
+
36
+ - Modern Hebrew news and web content
37
+ - Broader vocabulary coverage
38
+ - More diverse writing styles and topics
39
+
40
+ ## Evaluation (Base Model, no SFT)
41
+
42
+ | Task | v1 (9.8B tokens) | v2 (17.6B tokens) |
43
+ |------|-------------------|-------------------|
44
+ | SNLI | 50.0% | 35.0% |
45
+ | QA | 20.0% | 60.0% |
46
+ | Sentiment | 33.3% | 33.3% |
47
+ | Trivia | 13.3% | 13.3% |
48
+ | **Average** | **29.2%** | **35.4%** |
49
+
50
+ Note: These are base model results without instruction tuning. QA improved significantly (+40pp) suggesting better reading comprehension from the additional pre-training data. An instruction-tuned version (SFT on v2) is in progress.
51
+
52
+ ## Pre-Training Data
53
+
54
+ ### Phase 1: Original (9.8B tokens)
55
+ Hebrew Wikipedia (12%), Supreme Court (22%), Ben Yehuda (23%), C4 Hebrew (20%), CC100 (19%), Task data (4%)
56
+
57
+ ### Phase 2: Continued Pre-Training (7.8B tokens)
58
+ HeDC4 — Hebrew Digital Corpus for the 21st Century
59
+
60
+ ## Related Models
61
+
62
+ - [HebrewGPT-1B](https://huggingface.co/Slasky/HebrewGPT-1B) - Base model (v1)
63
+ - [HebrewGPT-1B-Instruct](https://huggingface.co/Slasky/HebrewGPT-1B-Instruct) - SFT on v1
64
+ - [HebrewGPT-1B-AdamW](https://huggingface.co/Slasky/HebrewGPT-1B-AdamW) - AdamW ablation
65
+ - [HebrewGPT-296M](https://huggingface.co/Slasky/HebrewGPT-296M) - Smaller model
66
+
67
+ ## Infrastructure
68
+
69
+ Trained on Amazon Bedrock (research orchestration) + AWS EC2. Cost: ~$73 (14.3h x $5.12/hr spot).