File size: 5,547 Bytes

---
license: apache-2.0
language:
  - he
  - ar
  - en
  - fa
tags:
  - multilingual
  - hebrew
  - arabic
  - farsi
  - persian
  - semitic
  - gpt
  - causal-lm
  - low-resource
  - efficient-training
datasets:
  - CulturaX
  - OSCAR
  - CC-100
  - allenai/dolma
model-index:
  - name: SemiticGPT-3B
    results:
      - task:
          type: text-generation
        dataset:
          type: facebook/belebele
          name: Belebele
        metrics:
          - type: accuracy
            name: English
            value: 31.8
          - type: accuracy
            name: Hebrew
            value: 27.0
          - type: accuracy
            name: Arabic
            value: 28.4
          - type: accuracy
            name: Farsi
            value: 28.2
---

# SemiticGPT-3B 🌍

A 3.04 billion parameter multilingual language model trained from scratch for **Hebrew, Arabic, English, and Farsi** — four languages spanning three scripts (Latin, Hebrew, Arabic).

## Highlights

- **3.04B parameters** trained from scratch on ~50B tokens
- **Custom 32K multilingual BPE tokenizer** optimized for script-diverse languages
- **Hebrew-anchored design**: Hebrew as primary low-resource target with cross-lingual transfer
- **Budget-efficient**: Trained on a single p4de.24xlarge
- **SFT variant included**: Instruction-tuned with multilingual supervised data

## Model Variants

| Variant | File | Size | Description |
|---------|------|------|-------------|
| Base (pretrained) | `checkpoints/best_model.pt` | 11.7 GB | Best pretrained checkpoint (step 20,000) |
| SFT (instruction-tuned) | `checkpoints/sft_model.pt` | 5.7 GB | Multilingual SFT on Hebrew, Arabic, English, Farsi data |

## Architecture

- **Type**: GPT-2 style decoder-only transformer
- **Parameters**: 3.04B
- **Layers**: 32
- **Hidden dim**: 2560
- **Attention heads**: 32
- **Vocabulary**: 32,000 (custom multilingual BPE)
- **Context length**: 2048 tokens
- **Tokenizer**: SentencePiece BPE trained on balanced multilingual corpus

## Training Data

Pretrained on ~50B tokens from:
- **CulturaX** (Hebrew, Arabic, Farsi, English)
- **OSCAR** (multilingual web crawl)
- **CC-100** (Common Crawl monolingual)
- **Dolma** (English high-quality)

Language distribution weighted toward Hebrew as anchor language.

## Tokenizer

Custom 32K vocabulary trained on balanced multilingual corpus:

| Language | Fertility (tokens/word) |
|----------|------------------------|
| Hebrew | 1.75 BPB (best) |
| Farsi | 3.14 BPB |
| Arabic | 3.73 BPB |
| English | 3.83 BPB |

The tokenizer is specifically designed for script-diverse languages, avoiding the vocabulary dilution that occurs with large multilingual tokenizers.

## Benchmark Results

### Belebele (reading comprehension, 4-way multiple choice)

| Language | Accuracy |
|----------|----------|
| English | 31.8% |
| Hebrew | 27.0% |
| Arabic | 28.4% |
| Farsi | 28.2% |
| **Overall** | **28.9%** |

*Note: Random baseline is 25%. This is a 3B model trained on a budget — competitive performance relative to scale.*

### SFT Generation Quality

- **Hebrew**: 🔥 Excellent — fluent, factual responses in domain-specific Hebrew
- **English**: Coherent, factual
- **Farsi**: Good, coherent
- **Arabic**: Weak (data quality issue — machine-translated Alpaca)

## Training Details

### Pretraining
- **Hardware**: 1× p4de.24xlarge (8× A100 80GB)
- **Framework**: PyTorch FSDP
- **Steps**: 20,000
- **Batch size**: 512K tokens
- **Learning rate**: 3e-4 (cosine decay)
- **Optimizer**: AdamW


### SFT
- **Hardware**: 1× g6e.xlarge (L40S 48GB)
- **Steps**: 4,000 (best val_loss at step 1,600: 2.1164)
- **Data**: ~27K Hebrew samples (native domain data) + Aya multilingual + translated Alpaca

## Files

```
SemiticGPT/
├── checkpoints/
│   ├── best_model.pt          # Pretrained base model
│   └── sft_model.pt           # SFT instruction-tuned model
├── tokenizer/
│   ├── multilingual_32k.model # SentencePiece tokenizer
│   └── multilingual_32k.vocab # Vocabulary file
├── eval/
│   ├── belebele_3b_results.json
│   └── belebele_3b.log
├── training_scripts/
│   ├── train_multilingual_3b_fsdp.py
│   ├── train_sft_3b.py
│   └── prepare_sft_data_v2.py
└── README.md
```

## Usage

```python
import torch
import sentencepiece as spm

# Load tokenizer
sp = spm.SentencePieceProcessor()
sp.load("tokenizer/multilingual_32k.model")

# Load model (custom architecture — see training_scripts/)
# The model uses a custom GPT implementation, not HuggingFace AutoModel
checkpoint = torch.load("checkpoints/best_model.pt", map_location="cpu")
# See train_multilingual_3b_fsdp.py for model class definition
```

## Known Limitations

- **Arabic generation is weak** due to machine-translated SFT data. Native Arabic instruction data would significantly improve this.
- **Small scale**: 3B parameters is modest by current standards. This is an efficiency-focused research model.
- **Custom architecture**: Not directly compatible with HuggingFace AutoModel — requires the training script's model class.
- **Benchmark scores are baseline-level**: The model is designed for research into efficient multilingual pretraining, not benchmark competition.

## Citation

```bibtex
@misc{slasky2026semiticgpt,
  title={SemiticGPT: Efficient Multilingual Pretraining for Low-Resource Script-Diverse Languages},
  author={Slasky, Ronnen},
  year={2026},
  url={https://huggingface.co/Slasky/SemiticGPT}
}
```

## License

Apache 2.0