Update README.md

0cae1a7 verified about 2 months ago

5.55 kB

	---
	license: apache-2.0
	language:
	- he
	- ar
	- en
	- fa
	tags:
	- multilingual
	- hebrew
	- arabic
	- farsi
	- persian
	- semitic
	- gpt
	- causal-lm
	- low-resource
	- efficient-training
	datasets:
	- CulturaX
	- OSCAR
	- CC-100
	- allenai/dolma
	model-index:
	- name: SemiticGPT-3B
	results:
	- task:
	type: text-generation
	dataset:
	type: facebook/belebele
	name: Belebele
	metrics:
	- type: accuracy
	name: English
	value: 31.8
	- type: accuracy
	name: Hebrew
	value: 27.0
	- type: accuracy
	name: Arabic
	value: 28.4
	- type: accuracy
	name: Farsi
	value: 28.2
	---

	# SemiticGPT-3B 🌍

	A 3.04 billion parameter multilingual language model trained from scratch for Hebrew, Arabic, English, and Farsi — four languages spanning three scripts (Latin, Hebrew, Arabic).

	## Highlights

	- 3.04B parameters trained from scratch on ~50B tokens
	- Custom 32K multilingual BPE tokenizer optimized for script-diverse languages
	- Hebrew-anchored design: Hebrew as primary low-resource target with cross-lingual transfer
	- Budget-efficient: Trained on a single p4de.24xlarge
	- SFT variant included: Instruction-tuned with multilingual supervised data

	## Model Variants

	\| Variant \| File \| Size \| Description \|
	\|---------\|------\|------\|-------------\|
	\| Base (pretrained) \| `checkpoints/best_model.pt` \| 11.7 GB \| Best pretrained checkpoint (step 20,000) \|
	\| SFT (instruction-tuned) \| `checkpoints/sft_model.pt` \| 5.7 GB \| Multilingual SFT on Hebrew, Arabic, English, Farsi data \|

	## Architecture

	- Type: GPT-2 style decoder-only transformer
	- Parameters: 3.04B
	- Layers: 32
	- Hidden dim: 2560
	- Attention heads: 32
	- Vocabulary: 32,000 (custom multilingual BPE)
	- Context length: 2048 tokens
	- Tokenizer: SentencePiece BPE trained on balanced multilingual corpus

	## Training Data

	Pretrained on ~50B tokens from:
	- CulturaX (Hebrew, Arabic, Farsi, English)
	- OSCAR (multilingual web crawl)
	- CC-100 (Common Crawl monolingual)
	- Dolma (English high-quality)

	Language distribution weighted toward Hebrew as anchor language.

	## Tokenizer

	Custom 32K vocabulary trained on balanced multilingual corpus:

	\| Language \| Fertility (tokens/word) \|
	\|----------\|------------------------\|
	\| Hebrew \| 1.75 BPB (best) \|
	\| Farsi \| 3.14 BPB \|
	\| Arabic \| 3.73 BPB \|
	\| English \| 3.83 BPB \|

	The tokenizer is specifically designed for script-diverse languages, avoiding the vocabulary dilution that occurs with large multilingual tokenizers.

	## Benchmark Results

	### Belebele (reading comprehension, 4-way multiple choice)

	\| Language \| Accuracy \|
	\|----------\|----------\|
	\| English \| 31.8% \|
	\| Hebrew \| 27.0% \|
	\| Arabic \| 28.4% \|
	\| Farsi \| 28.2% \|
	\| Overall \| 28.9% \|

	Note: Random baseline is 25%. This is a 3B model trained on a budget — competitive performance relative to scale.

	### SFT Generation Quality

	- Hebrew: 🔥 Excellent — fluent, factual responses in domain-specific Hebrew
	- English: Coherent, factual
	- Farsi: Good, coherent
	- Arabic: Weak (data quality issue — machine-translated Alpaca)

	## Training Details

	### Pretraining
	- Hardware: 1× p4de.24xlarge (8× A100 80GB)
	- Framework: PyTorch FSDP
	- Steps: 20,000
	- Batch size: 512K tokens
	- Learning rate: 3e-4 (cosine decay)
	- Optimizer: AdamW


	### SFT
	- Hardware: 1× g6e.xlarge (L40S 48GB)
	- Steps: 4,000 (best val_loss at step 1,600: 2.1164)
	- Data: ~27K Hebrew samples (native domain data) + Aya multilingual + translated Alpaca

	## Files

	```
	SemiticGPT/
	├── checkpoints/
	│ ├── best_model.pt # Pretrained base model
	│ └── sft_model.pt # SFT instruction-tuned model
	├── tokenizer/
	│ ├── multilingual_32k.model # SentencePiece tokenizer
	│ └── multilingual_32k.vocab # Vocabulary file
	├── eval/
	│ ├── belebele_3b_results.json
	│ └── belebele_3b.log
	├── training_scripts/
	│ ├── train_multilingual_3b_fsdp.py
	│ ├── train_sft_3b.py
	│ └── prepare_sft_data_v2.py
	└── README.md
	```

	## Usage

	```python
	import torch
	import sentencepiece as spm

	# Load tokenizer
	sp = spm.SentencePieceProcessor()
	sp.load("tokenizer/multilingual_32k.model")

	# Load model (custom architecture — see training_scripts/)
	# The model uses a custom GPT implementation, not HuggingFace AutoModel
	checkpoint = torch.load("checkpoints/best_model.pt", map_location="cpu")
	# See train_multilingual_3b_fsdp.py for model class definition
	```

	## Known Limitations

	- Arabic generation is weak due to machine-translated SFT data. Native Arabic instruction data would significantly improve this.
	- Small scale: 3B parameters is modest by current standards. This is an efficiency-focused research model.
	- Custom architecture: Not directly compatible with HuggingFace AutoModel — requires the training script's model class.
	- Benchmark scores are baseline-level: The model is designed for research into efficient multilingual pretraining, not benchmark competition.

	## Citation

	```bibtex
	@misc{slasky2026semiticgpt,
	title={SemiticGPT: Efficient Multilingual Pretraining for Low-Resource Script-Diverse Languages},
	author={Slasky, Ronnen},
	year={2026},
	url={https://huggingface.co/Slasky/SemiticGPT}
	}
	```

	## License

	Apache 2.0