File size: 5,547 Bytes
cfcf37a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0cae1a7
cfcf37a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b7c819d
cfcf37a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
---
license: apache-2.0
language:
  - he
  - ar
  - en
  - fa
tags:
  - multilingual
  - hebrew
  - arabic
  - farsi
  - persian
  - semitic
  - gpt
  - causal-lm
  - low-resource
  - efficient-training
datasets:
  - CulturaX
  - OSCAR
  - CC-100
  - allenai/dolma
model-index:
  - name: SemiticGPT-3B
    results:
      - task:
          type: text-generation
        dataset:
          type: facebook/belebele
          name: Belebele
        metrics:
          - type: accuracy
            name: English
            value: 31.8
          - type: accuracy
            name: Hebrew
            value: 27.0
          - type: accuracy
            name: Arabic
            value: 28.4
          - type: accuracy
            name: Farsi
            value: 28.2
---

# SemiticGPT-3B 🌍

A 3.04 billion parameter multilingual language model trained from scratch for **Hebrew, Arabic, English, and Farsi** β€” four languages spanning three scripts (Latin, Hebrew, Arabic).

## Highlights

- **3.04B parameters** trained from scratch on ~50B tokens
- **Custom 32K multilingual BPE tokenizer** optimized for script-diverse languages
- **Hebrew-anchored design**: Hebrew as primary low-resource target with cross-lingual transfer
- **Budget-efficient**: Trained on a single p4de.24xlarge
- **SFT variant included**: Instruction-tuned with multilingual supervised data

## Model Variants

| Variant | File | Size | Description |
|---------|------|------|-------------|
| Base (pretrained) | `checkpoints/best_model.pt` | 11.7 GB | Best pretrained checkpoint (step 20,000) |
| SFT (instruction-tuned) | `checkpoints/sft_model.pt` | 5.7 GB | Multilingual SFT on Hebrew, Arabic, English, Farsi data |

## Architecture

- **Type**: GPT-2 style decoder-only transformer
- **Parameters**: 3.04B
- **Layers**: 32
- **Hidden dim**: 2560
- **Attention heads**: 32
- **Vocabulary**: 32,000 (custom multilingual BPE)
- **Context length**: 2048 tokens
- **Tokenizer**: SentencePiece BPE trained on balanced multilingual corpus

## Training Data

Pretrained on ~50B tokens from:
- **CulturaX** (Hebrew, Arabic, Farsi, English)
- **OSCAR** (multilingual web crawl)
- **CC-100** (Common Crawl monolingual)
- **Dolma** (English high-quality)

Language distribution weighted toward Hebrew as anchor language.

## Tokenizer

Custom 32K vocabulary trained on balanced multilingual corpus:

| Language | Fertility (tokens/word) |
|----------|------------------------|
| Hebrew | 1.75 BPB (best) |
| Farsi | 3.14 BPB |
| Arabic | 3.73 BPB |
| English | 3.83 BPB |

The tokenizer is specifically designed for script-diverse languages, avoiding the vocabulary dilution that occurs with large multilingual tokenizers.

## Benchmark Results

### Belebele (reading comprehension, 4-way multiple choice)

| Language | Accuracy |
|----------|----------|
| English | 31.8% |
| Hebrew | 27.0% |
| Arabic | 28.4% |
| Farsi | 28.2% |
| **Overall** | **28.9%** |

*Note: Random baseline is 25%. This is a 3B model trained on a budget β€” competitive performance relative to scale.*

### SFT Generation Quality

- **Hebrew**: πŸ”₯ Excellent β€” fluent, factual responses in domain-specific Hebrew
- **English**: Coherent, factual
- **Farsi**: Good, coherent
- **Arabic**: Weak (data quality issue β€” machine-translated Alpaca)

## Training Details

### Pretraining
- **Hardware**: 1Γ— p4de.24xlarge (8Γ— A100 80GB)
- **Framework**: PyTorch FSDP
- **Steps**: 20,000
- **Batch size**: 512K tokens
- **Learning rate**: 3e-4 (cosine decay)
- **Optimizer**: AdamW


### SFT
- **Hardware**: 1Γ— g6e.xlarge (L40S 48GB)
- **Steps**: 4,000 (best val_loss at step 1,600: 2.1164)
- **Data**: ~27K Hebrew samples (native domain data) + Aya multilingual + translated Alpaca

## Files

```
SemiticGPT/
β”œβ”€β”€ checkpoints/
β”‚   β”œβ”€β”€ best_model.pt          # Pretrained base model
β”‚   └── sft_model.pt           # SFT instruction-tuned model
β”œβ”€β”€ tokenizer/
β”‚   β”œβ”€β”€ multilingual_32k.model # SentencePiece tokenizer
β”‚   └── multilingual_32k.vocab # Vocabulary file
β”œβ”€β”€ eval/
β”‚   β”œβ”€β”€ belebele_3b_results.json
β”‚   └── belebele_3b.log
β”œβ”€β”€ training_scripts/
β”‚   β”œβ”€β”€ train_multilingual_3b_fsdp.py
β”‚   β”œβ”€β”€ train_sft_3b.py
β”‚   └── prepare_sft_data_v2.py
└── README.md
```

## Usage

```python
import torch
import sentencepiece as spm

# Load tokenizer
sp = spm.SentencePieceProcessor()
sp.load("tokenizer/multilingual_32k.model")

# Load model (custom architecture β€” see training_scripts/)
# The model uses a custom GPT implementation, not HuggingFace AutoModel
checkpoint = torch.load("checkpoints/best_model.pt", map_location="cpu")
# See train_multilingual_3b_fsdp.py for model class definition
```

## Known Limitations

- **Arabic generation is weak** due to machine-translated SFT data. Native Arabic instruction data would significantly improve this.
- **Small scale**: 3B parameters is modest by current standards. This is an efficiency-focused research model.
- **Custom architecture**: Not directly compatible with HuggingFace AutoModel β€” requires the training script's model class.
- **Benchmark scores are baseline-level**: The model is designed for research into efficient multilingual pretraining, not benchmark competition.

## Citation

```bibtex
@misc{slasky2026semiticgpt,
  title={SemiticGPT: Efficient Multilingual Pretraining for Low-Resource Script-Diverse Languages},
  author={Slasky, Ronnen},
  year={2026},
  url={https://huggingface.co/Slasky/SemiticGPT}
}
```

## License

Apache 2.0