SARF โ€” Balanced Bilingual Tokenizer

SARF (ุตุฑู, "morphology") is a parity-aware MYTE tokenizer trained on balanced Arabic-English data.

It combines character-count parity sampling, Morfessor-based morphological segmentation, and BPE to achieve balanced representation across Arabic and English.

Key Metrics

Metric Value
Vocabulary size 72,195 (65,796 BPE + 6,399 PUA morpheme tokens)
Arabic chars/token 2.80
English chars/token 2.77
Arabic tokens/word 2.08
English tokens/word 1.77
AR/EN parity ratio 1.01 (near-perfect balance)
PUA atomicity 100% (all 6,399 codes โ†’ single tokens)
Training data 16B characters (balanced 50/50 AR/EN)
BPE merges 65,536 (parity-aware)

Pipeline Phases

  1. Lexicon Extraction โ€” Extract word lexicons from Arabic and English corpora
  2. Morfessor Training โ€” Train unsupervised morphological segmentation models
  3. Morpheme Mapping โ€” Map morphemes to PUA Unicode characters (U+E000โ€“U+F8FF)
  4. Text Preprocessing โ€” Rewrite training text using morpheme-aware byte encoding
  5. BPE Training โ€” Train BPE merges on preprocessed text (parity-aware character-count sampling)
  6. Tokenizer Assembly โ€” Build final tokenizer with vocabulary and special tokens
  7. Evaluation โ€” Validate PUA atomicity, compression rates, and Arabic/English parity

Files

lexicons/

File Size Source Path Description
lexicon_ar.txt 16.7 MB /workspace/smctm/morfessor_models/lexicon_ar.txt Arabic word lexicon extracted from training corpus
lexicon_en.txt 53.3 MB /workspace/smctm/morfessor_models/lexicon_en.txt English word lexicon extracted from training corpus

morfessor/

File Size Source Path Description
morf_map.json 136.3 KB /workspace/smctm/morfessor_models/morf_map.json Morpheme-to-PUA Unicode character mapping (6,399 morphemes)
morfessor_ar.bin 4.0 MB /workspace/smctm/morfessor_models/morfessor_ar.bin Trained Morfessor model for Arabic morphological segmentation
morfessor_en.bin 1.8 MB /workspace/smctm/morfessor_models/morfessor_en.bin Trained Morfessor model for English morphological segmentation

training/

File Size Source Path Description
merges.txt 1.2 MB /root/.cache/deeplatent/tokenizer_128k_v4/merges.txt BPE merge rules (65,536 merges, parity-aware)
phase1_stats.json 217.0 B /root/.cache/deeplatent/tokenizer_128k_v4/phase1_stats.json Statistics from Phase 1 text preprocessing
train.ar 12.8 GB /root/.cache/deeplatent/tokenizer_128k_v4/train.ar MYTE-preprocessed Arabic training text (PUA-encoded)
train.en 9.1 GB /root/.cache/deeplatent/tokenizer_128k_v4/train.en MYTE-preprocessed English training text

tokenizer/

File Size Source Path Description
special_tokens_map.json 686.0 B /root/.cache/deeplatent/tokenizer_128k_v4/special_tokens_map.json Special tokens mapping
tokenizer.json 10.1 MB /root/.cache/deeplatent/tokenizer_128k_v4/tokenizer.json Final HuggingFace tokenizer (42,239 tokens = 35,840 BPE + 6,399 PUA)
tokenizer_config.json 1.0 MB /root/.cache/deeplatent/tokenizer_128k_v4/tokenizer_config.json HuggingFace tokenizer configuration
vocab.json 2.3 MB /root/.cache/deeplatent/tokenizer_128k_v4/vocab.json Vocabulary mapping (token string โ†’ token ID)

Usage

from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast(
    tokenizer_file="tokenizer/tokenizer.json"
)

# For Arabic text, preprocess through MYTE first:
from scripts.rewrite_bytes import ByteRewriter
rewriter = ByteRewriter("morfessor/morf_map.json")
preprocessed = rewriter.rewrite_text("ู…ุฑุญุจุง ุจุงู„ุนุงู„ู…")
tokens = tokenizer.encode(preprocessed)

Training

Trained using the parity-aware pipeline in DeepLatent. See speedrun.sh and scripts/tok_train_myte.py for details.

Training Environment

  • Machine: /workspace/smctm
  • Tokenizer output: /root/.cache/deeplatent/tokenizer_parity
  • Morfessor models: ./morfessor_models
  • Training time: ~5.5 hours (328.9 minutes)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support