SARF — Balanced Bilingual Tokenizer

SARF (صرف, "morphology") is a parity-aware MYTE tokenizer trained on balanced Arabic-English data.

It combines character-count parity sampling, Morfessor-based morphological segmentation, and BPE to achieve balanced representation across Arabic and English.

Key Metrics

Metric	Value
Vocabulary size	72,195 (65,796 BPE + 6,399 PUA morpheme tokens)
Arabic chars/token	2.80
English chars/token	2.77
Arabic tokens/word	2.08
English tokens/word	1.77
AR/EN parity ratio	1.01 (near-perfect balance)
PUA atomicity	100% (all 6,399 codes → single tokens)
Training data	16B characters (balanced 50/50 AR/EN)
BPE merges	65,536 (parity-aware)

Pipeline Phases

Lexicon Extraction — Extract word lexicons from Arabic and English corpora
Morfessor Training — Train unsupervised morphological segmentation models
Morpheme Mapping — Map morphemes to PUA Unicode characters (U+E000–U+F8FF)
Text Preprocessing — Rewrite training text using morpheme-aware byte encoding
BPE Training — Train BPE merges on preprocessed text (parity-aware character-count sampling)
Tokenizer Assembly — Build final tokenizer with vocabulary and special tokens
Evaluation — Validate PUA atomicity, compression rates, and Arabic/English parity

Files

`lexicons/`

File	Size	Source Path	Description
`lexicon_ar.txt`	16.7 MB	`/workspace/smctm/morfessor_models/lexicon_ar.txt`	Arabic word lexicon extracted from training corpus
`lexicon_en.txt`	53.3 MB	`/workspace/smctm/morfessor_models/lexicon_en.txt`	English word lexicon extracted from training corpus

`morfessor/`

File	Size	Source Path	Description
`morf_map.json`	136.3 KB	`/workspace/smctm/morfessor_models/morf_map.json`	Morpheme-to-PUA Unicode character mapping (6,399 morphemes)
`morfessor_ar.bin`	4.0 MB	`/workspace/smctm/morfessor_models/morfessor_ar.bin`	Trained Morfessor model for Arabic morphological segmentation
`morfessor_en.bin`	1.8 MB	`/workspace/smctm/morfessor_models/morfessor_en.bin`	Trained Morfessor model for English morphological segmentation

`training/`

File	Size	Source Path	Description
`merges.txt`	1.2 MB	`/root/.cache/deeplatent/tokenizer_128k_v4/merges.txt`	BPE merge rules (65,536 merges, parity-aware)
`phase1_stats.json`	217.0 B	`/root/.cache/deeplatent/tokenizer_128k_v4/phase1_stats.json`	Statistics from Phase 1 text preprocessing
`train.ar`	12.8 GB	`/root/.cache/deeplatent/tokenizer_128k_v4/train.ar`	MYTE-preprocessed Arabic training text (PUA-encoded)
`train.en`	9.1 GB	`/root/.cache/deeplatent/tokenizer_128k_v4/train.en`	MYTE-preprocessed English training text

`tokenizer/`

File	Size	Source Path	Description
`special_tokens_map.json`	686.0 B	`/root/.cache/deeplatent/tokenizer_128k_v4/special_tokens_map.json`	Special tokens mapping
`tokenizer.json`	10.1 MB	`/root/.cache/deeplatent/tokenizer_128k_v4/tokenizer.json`	Final HuggingFace tokenizer (42,239 tokens = 35,840 BPE + 6,399 PUA)
`tokenizer_config.json`	1.0 MB	`/root/.cache/deeplatent/tokenizer_128k_v4/tokenizer_config.json`	HuggingFace tokenizer configuration
`vocab.json`	2.3 MB	`/root/.cache/deeplatent/tokenizer_128k_v4/vocab.json`	Vocabulary mapping (token string → token ID)

Usage

from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast(
    tokenizer_file="tokenizer/tokenizer.json"
)

# For Arabic text, preprocess through MYTE first:
from scripts.rewrite_bytes import ByteRewriter
rewriter = ByteRewriter("morfessor/morf_map.json")
preprocessed = rewriter.rewrite_text("مرحبا بالعالم")
tokens = tokenizer.encode(preprocessed)

Training

Trained using the parity-aware pipeline in DeepLatent. See speedrun.sh and scripts/tok_train_myte.py for details.

Training Environment

Machine: /workspace/smctm
Tokenizer output: /root/.cache/deeplatent/tokenizer_parity
Morfessor models: ./morfessor_models
Training time: ~5.5 hours (328.9 minutes)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support