SemiticGPT / README.md
ronnengmail's picture
Update README.md
0cae1a7 verified
---
license: apache-2.0
language:
- he
- ar
- en
- fa
tags:
- multilingual
- hebrew
- arabic
- farsi
- persian
- semitic
- gpt
- causal-lm
- low-resource
- efficient-training
datasets:
- CulturaX
- OSCAR
- CC-100
- allenai/dolma
model-index:
- name: SemiticGPT-3B
results:
- task:
type: text-generation
dataset:
type: facebook/belebele
name: Belebele
metrics:
- type: accuracy
name: English
value: 31.8
- type: accuracy
name: Hebrew
value: 27.0
- type: accuracy
name: Arabic
value: 28.4
- type: accuracy
name: Farsi
value: 28.2
---
# SemiticGPT-3B 🌍
A 3.04 billion parameter multilingual language model trained from scratch for **Hebrew, Arabic, English, and Farsi** β€” four languages spanning three scripts (Latin, Hebrew, Arabic).
## Highlights
- **3.04B parameters** trained from scratch on ~50B tokens
- **Custom 32K multilingual BPE tokenizer** optimized for script-diverse languages
- **Hebrew-anchored design**: Hebrew as primary low-resource target with cross-lingual transfer
- **Budget-efficient**: Trained on a single p4de.24xlarge
- **SFT variant included**: Instruction-tuned with multilingual supervised data
## Model Variants
| Variant | File | Size | Description |
|---------|------|------|-------------|
| Base (pretrained) | `checkpoints/best_model.pt` | 11.7 GB | Best pretrained checkpoint (step 20,000) |
| SFT (instruction-tuned) | `checkpoints/sft_model.pt` | 5.7 GB | Multilingual SFT on Hebrew, Arabic, English, Farsi data |
## Architecture
- **Type**: GPT-2 style decoder-only transformer
- **Parameters**: 3.04B
- **Layers**: 32
- **Hidden dim**: 2560
- **Attention heads**: 32
- **Vocabulary**: 32,000 (custom multilingual BPE)
- **Context length**: 2048 tokens
- **Tokenizer**: SentencePiece BPE trained on balanced multilingual corpus
## Training Data
Pretrained on ~50B tokens from:
- **CulturaX** (Hebrew, Arabic, Farsi, English)
- **OSCAR** (multilingual web crawl)
- **CC-100** (Common Crawl monolingual)
- **Dolma** (English high-quality)
Language distribution weighted toward Hebrew as anchor language.
## Tokenizer
Custom 32K vocabulary trained on balanced multilingual corpus:
| Language | Fertility (tokens/word) |
|----------|------------------------|
| Hebrew | 1.75 BPB (best) |
| Farsi | 3.14 BPB |
| Arabic | 3.73 BPB |
| English | 3.83 BPB |
The tokenizer is specifically designed for script-diverse languages, avoiding the vocabulary dilution that occurs with large multilingual tokenizers.
## Benchmark Results
### Belebele (reading comprehension, 4-way multiple choice)
| Language | Accuracy |
|----------|----------|
| English | 31.8% |
| Hebrew | 27.0% |
| Arabic | 28.4% |
| Farsi | 28.2% |
| **Overall** | **28.9%** |
*Note: Random baseline is 25%. This is a 3B model trained on a budget β€” competitive performance relative to scale.*
### SFT Generation Quality
- **Hebrew**: πŸ”₯ Excellent β€” fluent, factual responses in domain-specific Hebrew
- **English**: Coherent, factual
- **Farsi**: Good, coherent
- **Arabic**: Weak (data quality issue β€” machine-translated Alpaca)
## Training Details
### Pretraining
- **Hardware**: 1Γ— p4de.24xlarge (8Γ— A100 80GB)
- **Framework**: PyTorch FSDP
- **Steps**: 20,000
- **Batch size**: 512K tokens
- **Learning rate**: 3e-4 (cosine decay)
- **Optimizer**: AdamW
### SFT
- **Hardware**: 1Γ— g6e.xlarge (L40S 48GB)
- **Steps**: 4,000 (best val_loss at step 1,600: 2.1164)
- **Data**: ~27K Hebrew samples (native domain data) + Aya multilingual + translated Alpaca
## Files
```
SemiticGPT/
β”œβ”€β”€ checkpoints/
β”‚ β”œβ”€β”€ best_model.pt # Pretrained base model
β”‚ └── sft_model.pt # SFT instruction-tuned model
β”œβ”€β”€ tokenizer/
β”‚ β”œβ”€β”€ multilingual_32k.model # SentencePiece tokenizer
β”‚ └── multilingual_32k.vocab # Vocabulary file
β”œβ”€β”€ eval/
β”‚ β”œβ”€β”€ belebele_3b_results.json
β”‚ └── belebele_3b.log
β”œβ”€β”€ training_scripts/
β”‚ β”œβ”€β”€ train_multilingual_3b_fsdp.py
β”‚ β”œβ”€β”€ train_sft_3b.py
β”‚ └── prepare_sft_data_v2.py
└── README.md
```
## Usage
```python
import torch
import sentencepiece as spm
# Load tokenizer
sp = spm.SentencePieceProcessor()
sp.load("tokenizer/multilingual_32k.model")
# Load model (custom architecture β€” see training_scripts/)
# The model uses a custom GPT implementation, not HuggingFace AutoModel
checkpoint = torch.load("checkpoints/best_model.pt", map_location="cpu")
# See train_multilingual_3b_fsdp.py for model class definition
```
## Known Limitations
- **Arabic generation is weak** due to machine-translated SFT data. Native Arabic instruction data would significantly improve this.
- **Small scale**: 3B parameters is modest by current standards. This is an efficiency-focused research model.
- **Custom architecture**: Not directly compatible with HuggingFace AutoModel β€” requires the training script's model class.
- **Benchmark scores are baseline-level**: The model is designed for research into efficient multilingual pretraining, not benchmark competition.
## Citation
```bibtex
@misc{slasky2026semiticgpt,
title={SemiticGPT: Efficient Multilingual Pretraining for Low-Resource Script-Diverse Languages},
author={Slasky, Ronnen},
year={2026},
url={https://huggingface.co/Slasky/SemiticGPT}
}
```
## License
Apache 2.0