| --- |
| license: apache-2.0 |
| language: |
| - he |
| - ar |
| - en |
| - fa |
| tags: |
| - multilingual |
| - hebrew |
| - arabic |
| - farsi |
| - persian |
| - semitic |
| - gpt |
| - causal-lm |
| - low-resource |
| - efficient-training |
| datasets: |
| - CulturaX |
| - OSCAR |
| - CC-100 |
| - allenai/dolma |
| model-index: |
| - name: SemiticGPT-3B |
| results: |
| - task: |
| type: text-generation |
| dataset: |
| type: facebook/belebele |
| name: Belebele |
| metrics: |
| - type: accuracy |
| name: English |
| value: 31.8 |
| - type: accuracy |
| name: Hebrew |
| value: 27.0 |
| - type: accuracy |
| name: Arabic |
| value: 28.4 |
| - type: accuracy |
| name: Farsi |
| value: 28.2 |
| --- |
| |
| # SemiticGPT-3B π |
|
|
| A 3.04 billion parameter multilingual language model trained from scratch for **Hebrew, Arabic, English, and Farsi** β four languages spanning three scripts (Latin, Hebrew, Arabic). |
|
|
| ## Highlights |
|
|
| - **3.04B parameters** trained from scratch on ~50B tokens |
| - **Custom 32K multilingual BPE tokenizer** optimized for script-diverse languages |
| - **Hebrew-anchored design**: Hebrew as primary low-resource target with cross-lingual transfer |
| - **Budget-efficient**: Trained on a single p4de.24xlarge |
| - **SFT variant included**: Instruction-tuned with multilingual supervised data |
|
|
| ## Model Variants |
|
|
| | Variant | File | Size | Description | |
| |---------|------|------|-------------| |
| | Base (pretrained) | `checkpoints/best_model.pt` | 11.7 GB | Best pretrained checkpoint (step 20,000) | |
| | SFT (instruction-tuned) | `checkpoints/sft_model.pt` | 5.7 GB | Multilingual SFT on Hebrew, Arabic, English, Farsi data | |
|
|
| ## Architecture |
|
|
| - **Type**: GPT-2 style decoder-only transformer |
| - **Parameters**: 3.04B |
| - **Layers**: 32 |
| - **Hidden dim**: 2560 |
| - **Attention heads**: 32 |
| - **Vocabulary**: 32,000 (custom multilingual BPE) |
| - **Context length**: 2048 tokens |
| - **Tokenizer**: SentencePiece BPE trained on balanced multilingual corpus |
|
|
| ## Training Data |
|
|
| Pretrained on ~50B tokens from: |
| - **CulturaX** (Hebrew, Arabic, Farsi, English) |
| - **OSCAR** (multilingual web crawl) |
| - **CC-100** (Common Crawl monolingual) |
| - **Dolma** (English high-quality) |
|
|
| Language distribution weighted toward Hebrew as anchor language. |
|
|
| ## Tokenizer |
|
|
| Custom 32K vocabulary trained on balanced multilingual corpus: |
|
|
| | Language | Fertility (tokens/word) | |
| |----------|------------------------| |
| | Hebrew | 1.75 BPB (best) | |
| | Farsi | 3.14 BPB | |
| | Arabic | 3.73 BPB | |
| | English | 3.83 BPB | |
|
|
| The tokenizer is specifically designed for script-diverse languages, avoiding the vocabulary dilution that occurs with large multilingual tokenizers. |
|
|
| ## Benchmark Results |
|
|
| ### Belebele (reading comprehension, 4-way multiple choice) |
|
|
| | Language | Accuracy | |
| |----------|----------| |
| | English | 31.8% | |
| | Hebrew | 27.0% | |
| | Arabic | 28.4% | |
| | Farsi | 28.2% | |
| | **Overall** | **28.9%** | |
|
|
| *Note: Random baseline is 25%. This is a 3B model trained on a budget β competitive performance relative to scale.* |
|
|
| ### SFT Generation Quality |
|
|
| - **Hebrew**: π₯ Excellent β fluent, factual responses in domain-specific Hebrew |
| - **English**: Coherent, factual |
| - **Farsi**: Good, coherent |
| - **Arabic**: Weak (data quality issue β machine-translated Alpaca) |
|
|
| ## Training Details |
|
|
| ### Pretraining |
| - **Hardware**: 1Γ p4de.24xlarge (8Γ A100 80GB) |
| - **Framework**: PyTorch FSDP |
| - **Steps**: 20,000 |
| - **Batch size**: 512K tokens |
| - **Learning rate**: 3e-4 (cosine decay) |
| - **Optimizer**: AdamW |
|
|
|
|
| ### SFT |
| - **Hardware**: 1Γ g6e.xlarge (L40S 48GB) |
| - **Steps**: 4,000 (best val_loss at step 1,600: 2.1164) |
| - **Data**: ~27K Hebrew samples (native domain data) + Aya multilingual + translated Alpaca |
| |
| ## Files |
| |
| ``` |
| SemiticGPT/ |
| βββ checkpoints/ |
| β βββ best_model.pt # Pretrained base model |
| β βββ sft_model.pt # SFT instruction-tuned model |
| βββ tokenizer/ |
| β βββ multilingual_32k.model # SentencePiece tokenizer |
| β βββ multilingual_32k.vocab # Vocabulary file |
| βββ eval/ |
| β βββ belebele_3b_results.json |
| β βββ belebele_3b.log |
| βββ training_scripts/ |
| β βββ train_multilingual_3b_fsdp.py |
| β βββ train_sft_3b.py |
| β βββ prepare_sft_data_v2.py |
| βββ README.md |
| ``` |
| |
| ## Usage |
| |
| ```python |
| import torch |
| import sentencepiece as spm |
| |
| # Load tokenizer |
| sp = spm.SentencePieceProcessor() |
| sp.load("tokenizer/multilingual_32k.model") |
|
|
| # Load model (custom architecture β see training_scripts/) |
| # The model uses a custom GPT implementation, not HuggingFace AutoModel |
| checkpoint = torch.load("checkpoints/best_model.pt", map_location="cpu") |
| # See train_multilingual_3b_fsdp.py for model class definition |
| ``` |
| |
| ## Known Limitations |
| |
| - **Arabic generation is weak** due to machine-translated SFT data. Native Arabic instruction data would significantly improve this. |
| - **Small scale**: 3B parameters is modest by current standards. This is an efficiency-focused research model. |
| - **Custom architecture**: Not directly compatible with HuggingFace AutoModel β requires the training script's model class. |
| - **Benchmark scores are baseline-level**: The model is designed for research into efficient multilingual pretraining, not benchmark competition. |
| |
| ## Citation |
| |
| ```bibtex |
| @misc{slasky2026semiticgpt, |
| title={SemiticGPT: Efficient Multilingual Pretraining for Low-Resource Script-Diverse Languages}, |
| author={Slasky, Ronnen}, |
| year={2026}, |
| url={https://huggingface.co/Slasky/SemiticGPT} |
| } |
| ``` |
| |
| ## License |
| |
| Apache 2.0 |
| |