SARF โ Balanced Bilingual Tokenizer
SARF (ุตุฑู, "morphology") is a parity-aware MYTE tokenizer trained on balanced Arabic-English data.
It combines character-count parity sampling, Morfessor-based morphological segmentation, and BPE to achieve balanced representation across Arabic and English.
Key Metrics
| Metric | Value |
|---|---|
| Vocabulary size | 72,195 (65,796 BPE + 6,399 PUA morpheme tokens) |
| Arabic chars/token | 2.80 |
| English chars/token | 2.77 |
| Arabic tokens/word | 2.08 |
| English tokens/word | 1.77 |
| AR/EN parity ratio | 1.01 (near-perfect balance) |
| PUA atomicity | 100% (all 6,399 codes โ single tokens) |
| Training data | 16B characters (balanced 50/50 AR/EN) |
| BPE merges | 65,536 (parity-aware) |
Pipeline Phases
- Lexicon Extraction โ Extract word lexicons from Arabic and English corpora
- Morfessor Training โ Train unsupervised morphological segmentation models
- Morpheme Mapping โ Map morphemes to PUA Unicode characters (U+E000โU+F8FF)
- Text Preprocessing โ Rewrite training text using morpheme-aware byte encoding
- BPE Training โ Train BPE merges on preprocessed text (parity-aware character-count sampling)
- Tokenizer Assembly โ Build final tokenizer with vocabulary and special tokens
- Evaluation โ Validate PUA atomicity, compression rates, and Arabic/English parity
Files
lexicons/
| File | Size | Source Path | Description |
|---|---|---|---|
lexicon_ar.txt |
16.7 MB | /workspace/smctm/morfessor_models/lexicon_ar.txt |
Arabic word lexicon extracted from training corpus |
lexicon_en.txt |
53.3 MB | /workspace/smctm/morfessor_models/lexicon_en.txt |
English word lexicon extracted from training corpus |
morfessor/
| File | Size | Source Path | Description |
|---|---|---|---|
morf_map.json |
136.3 KB | /workspace/smctm/morfessor_models/morf_map.json |
Morpheme-to-PUA Unicode character mapping (6,399 morphemes) |
morfessor_ar.bin |
4.0 MB | /workspace/smctm/morfessor_models/morfessor_ar.bin |
Trained Morfessor model for Arabic morphological segmentation |
morfessor_en.bin |
1.8 MB | /workspace/smctm/morfessor_models/morfessor_en.bin |
Trained Morfessor model for English morphological segmentation |
training/
| File | Size | Source Path | Description |
|---|---|---|---|
merges.txt |
1.2 MB | /root/.cache/deeplatent/tokenizer_128k_v4/merges.txt |
BPE merge rules (65,536 merges, parity-aware) |
phase1_stats.json |
217.0 B | /root/.cache/deeplatent/tokenizer_128k_v4/phase1_stats.json |
Statistics from Phase 1 text preprocessing |
train.ar |
12.8 GB | /root/.cache/deeplatent/tokenizer_128k_v4/train.ar |
MYTE-preprocessed Arabic training text (PUA-encoded) |
train.en |
9.1 GB | /root/.cache/deeplatent/tokenizer_128k_v4/train.en |
MYTE-preprocessed English training text |
tokenizer/
| File | Size | Source Path | Description |
|---|---|---|---|
special_tokens_map.json |
686.0 B | /root/.cache/deeplatent/tokenizer_128k_v4/special_tokens_map.json |
Special tokens mapping |
tokenizer.json |
10.1 MB | /root/.cache/deeplatent/tokenizer_128k_v4/tokenizer.json |
Final HuggingFace tokenizer (42,239 tokens = 35,840 BPE + 6,399 PUA) |
tokenizer_config.json |
1.0 MB | /root/.cache/deeplatent/tokenizer_128k_v4/tokenizer_config.json |
HuggingFace tokenizer configuration |
vocab.json |
2.3 MB | /root/.cache/deeplatent/tokenizer_128k_v4/vocab.json |
Vocabulary mapping (token string โ token ID) |
Usage
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast(
tokenizer_file="tokenizer/tokenizer.json"
)
# For Arabic text, preprocess through MYTE first:
from scripts.rewrite_bytes import ByteRewriter
rewriter = ByteRewriter("morfessor/morf_map.json")
preprocessed = rewriter.rewrite_text("ู
ุฑุญุจุง ุจุงูุนุงูู
")
tokens = tokenizer.encode(preprocessed)
Training
Trained using the parity-aware pipeline in DeepLatent.
See speedrun.sh and scripts/tok_train_myte.py for details.
Training Environment
- Machine:
/workspace/smctm - Tokenizer output:
/root/.cache/deeplatent/tokenizer_parity - Morfessor models:
./morfessor_models - Training time: ~5.5 hours (328.9 minutes)
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support