File size: 5,547 Bytes
cfcf37a 0cae1a7 cfcf37a b7c819d cfcf37a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 | ---
license: apache-2.0
language:
- he
- ar
- en
- fa
tags:
- multilingual
- hebrew
- arabic
- farsi
- persian
- semitic
- gpt
- causal-lm
- low-resource
- efficient-training
datasets:
- CulturaX
- OSCAR
- CC-100
- allenai/dolma
model-index:
- name: SemiticGPT-3B
results:
- task:
type: text-generation
dataset:
type: facebook/belebele
name: Belebele
metrics:
- type: accuracy
name: English
value: 31.8
- type: accuracy
name: Hebrew
value: 27.0
- type: accuracy
name: Arabic
value: 28.4
- type: accuracy
name: Farsi
value: 28.2
---
# SemiticGPT-3B π
A 3.04 billion parameter multilingual language model trained from scratch for **Hebrew, Arabic, English, and Farsi** β four languages spanning three scripts (Latin, Hebrew, Arabic).
## Highlights
- **3.04B parameters** trained from scratch on ~50B tokens
- **Custom 32K multilingual BPE tokenizer** optimized for script-diverse languages
- **Hebrew-anchored design**: Hebrew as primary low-resource target with cross-lingual transfer
- **Budget-efficient**: Trained on a single p4de.24xlarge
- **SFT variant included**: Instruction-tuned with multilingual supervised data
## Model Variants
| Variant | File | Size | Description |
|---------|------|------|-------------|
| Base (pretrained) | `checkpoints/best_model.pt` | 11.7 GB | Best pretrained checkpoint (step 20,000) |
| SFT (instruction-tuned) | `checkpoints/sft_model.pt` | 5.7 GB | Multilingual SFT on Hebrew, Arabic, English, Farsi data |
## Architecture
- **Type**: GPT-2 style decoder-only transformer
- **Parameters**: 3.04B
- **Layers**: 32
- **Hidden dim**: 2560
- **Attention heads**: 32
- **Vocabulary**: 32,000 (custom multilingual BPE)
- **Context length**: 2048 tokens
- **Tokenizer**: SentencePiece BPE trained on balanced multilingual corpus
## Training Data
Pretrained on ~50B tokens from:
- **CulturaX** (Hebrew, Arabic, Farsi, English)
- **OSCAR** (multilingual web crawl)
- **CC-100** (Common Crawl monolingual)
- **Dolma** (English high-quality)
Language distribution weighted toward Hebrew as anchor language.
## Tokenizer
Custom 32K vocabulary trained on balanced multilingual corpus:
| Language | Fertility (tokens/word) |
|----------|------------------------|
| Hebrew | 1.75 BPB (best) |
| Farsi | 3.14 BPB |
| Arabic | 3.73 BPB |
| English | 3.83 BPB |
The tokenizer is specifically designed for script-diverse languages, avoiding the vocabulary dilution that occurs with large multilingual tokenizers.
## Benchmark Results
### Belebele (reading comprehension, 4-way multiple choice)
| Language | Accuracy |
|----------|----------|
| English | 31.8% |
| Hebrew | 27.0% |
| Arabic | 28.4% |
| Farsi | 28.2% |
| **Overall** | **28.9%** |
*Note: Random baseline is 25%. This is a 3B model trained on a budget β competitive performance relative to scale.*
### SFT Generation Quality
- **Hebrew**: π₯ Excellent β fluent, factual responses in domain-specific Hebrew
- **English**: Coherent, factual
- **Farsi**: Good, coherent
- **Arabic**: Weak (data quality issue β machine-translated Alpaca)
## Training Details
### Pretraining
- **Hardware**: 1Γ p4de.24xlarge (8Γ A100 80GB)
- **Framework**: PyTorch FSDP
- **Steps**: 20,000
- **Batch size**: 512K tokens
- **Learning rate**: 3e-4 (cosine decay)
- **Optimizer**: AdamW
### SFT
- **Hardware**: 1Γ g6e.xlarge (L40S 48GB)
- **Steps**: 4,000 (best val_loss at step 1,600: 2.1164)
- **Data**: ~27K Hebrew samples (native domain data) + Aya multilingual + translated Alpaca
## Files
```
SemiticGPT/
βββ checkpoints/
β βββ best_model.pt # Pretrained base model
β βββ sft_model.pt # SFT instruction-tuned model
βββ tokenizer/
β βββ multilingual_32k.model # SentencePiece tokenizer
β βββ multilingual_32k.vocab # Vocabulary file
βββ eval/
β βββ belebele_3b_results.json
β βββ belebele_3b.log
βββ training_scripts/
β βββ train_multilingual_3b_fsdp.py
β βββ train_sft_3b.py
β βββ prepare_sft_data_v2.py
βββ README.md
```
## Usage
```python
import torch
import sentencepiece as spm
# Load tokenizer
sp = spm.SentencePieceProcessor()
sp.load("tokenizer/multilingual_32k.model")
# Load model (custom architecture β see training_scripts/)
# The model uses a custom GPT implementation, not HuggingFace AutoModel
checkpoint = torch.load("checkpoints/best_model.pt", map_location="cpu")
# See train_multilingual_3b_fsdp.py for model class definition
```
## Known Limitations
- **Arabic generation is weak** due to machine-translated SFT data. Native Arabic instruction data would significantly improve this.
- **Small scale**: 3B parameters is modest by current standards. This is an efficiency-focused research model.
- **Custom architecture**: Not directly compatible with HuggingFace AutoModel β requires the training script's model class.
- **Benchmark scores are baseline-level**: The model is designed for research into efficient multilingual pretraining, not benchmark competition.
## Citation
```bibtex
@misc{slasky2026semiticgpt,
title={SemiticGPT: Efficient Multilingual Pretraining for Low-Resource Script-Diverse Languages},
author={Slasky, Ronnen},
year={2026},
url={https://huggingface.co/Slasky/SemiticGPT}
}
```
## License
Apache 2.0
|