YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
SARF Tokenizer - Parity Benchmark Results
Overview
SARF (Semantically-Aware Robust Foundational) tokenizers using MYTE morphological preprocessing, optimized for Arabic-English parity.
Winner: SARF-88k-plus - Best parity (1.0162, closest to perfect 1.0)
5-Run Averaged Benchmark Results
Metric: Parity = AR_chars/token Γ· EN_chars/token (1.0 = perfect balance)
Methodology: 5 runs Γ 5,000 samples/run, randomly sampled
| Rank | Tokenizer | Vocab | AR Fert | EN Fert | Avg Fert | Parity | Β±Std |
|---|---|---|---|---|---|---|---|
| 1 | πSARF-88k-plus** | 88,097 | 2.409 | 2.166 | 2.288 | 1.0162 | 0.0073 |
| 2 | SARF-115k-plus | 115,398 | 2.249 | 2.140 | 2.195 | 1.0632 | 0.0082 |
| 3 | Gemma-3-4B | 262,145 | 2.429 | 1.328 | 1.878 | 0.9308 | 0.0067 |
| 4 | Fanar-1-9B | 128,256 | 2.409 | 1.356 | 1.882 | 0.9230 | 0.0065 |
| 5 | Command-R-Arabic | 255,033 | 2.449 | 1.330 | 1.889 | 0.9104 | 0.0063 |
| 6 | GPT-4o | 200,019 | 2.394 | 1.434 | 1.914 | 0.8768 | 0.0059 |
| 7 | SARF-65k-v2 | 64,603 | 2.287 | 1.568 | 1.928 | 0.8669 | 0.0062 |
| 8 | SARF-65k | 64,688 | 2.289 | 1.535 | 1.912 | 0.8547 | 0.0064 |
| 9 | Qwen3-4B | 151,669 | 2.502 | 1.500 | 2.001 | 0.8396 | 0.0057 |
Key Findings
- SARF-88k-plus achieves best parity (1.0162) - closest to perfect 1.0
- SARF tokenizers occupy top 2 positions for parity
- Commercial tokenizers (Gemma, Fanar, GPT-4o) have parity 0.88-0.93 (Arabic under-represented)
- SARF-65k variants have lower parity but better fertility (fewer tokens/word)
Choosing the Right Tokenizer
| Priority | Recommendation |
|---|---|
| Perfect AR/EN balance | SARF-88k-plus (parity 1.02) |
| Arabic-heavy workloads | SARF-115k-plus (parity 1.06) |
| Smallest vocab + good fertility | SARF-65k-v2 (65K vocab, fert 1.93) |
| Production compatibility | Gemma-3-4B or Fanar-1-9B |
Reproducibility
All tokenizer files and benchmark scripts are provided for full reproducibility.
Directory Structure
βββ SARF-88k-plus/ # π BEST PARITY
β βββ tokenizer.json
β βββ vocab.json
β βββ merges.txt
β βββ morf_map.supp.json
βββ SARF-65k-v2/ # Best small vocab
β βββ tokenizer.json
β βββ morf_map.basic.json
βββ benchmark_5runs_final.json # Full benchmark data
βββ benchmark_script.py # Reproducibility script
βββ README.md
Usage
from transformers import PreTrainedTokenizerFast
from huggingface_hub import hf_hub_download
import json
# Download SARF-88k-plus (best parity)
repo = 'almaghrabima/myte-parity-sweep'
tokenizer_path = hf_hub_download(repo, 'SARF-88k-plus/tokenizer.json')
morf_map_path = hf_hub_download(repo, 'SARF-88k-plus/morf_map.supp.json')
# Load tokenizer
tokenizer = PreTrainedTokenizerFast(tokenizer_file=tokenizer_path)
# Load morpheme map for MYTE preprocessing
with open(morf_map_path) as f:
morf_map = json.load(f)
# Simple rewriter
def rewrite(text, morf_map):
for morph, pua in sorted(morf_map.items(), key=lambda x: -len(x[0])):
text = text.replace(morph, pua)
return text
# Encode
def encode(text):
return tokenizer.encode(rewrite(text, morf_map), add_special_tokens=False)
# Test
print(encode('Ω
Ψ±ΨΨ¨Ψ§ Ψ¨Ψ§ΩΨΉΨ§ΩΩ
')) # Arabic
print(encode('Hello world')) # English
Training Details
- Data: 16B characters (50% Arabic, 50% English)
- Method: Parity-Aware BPE with MYTE morphological preprocessing
- Morfessor: Arabic morphological segmentation
- PUA Mapping: Morphemes β Private Use Area Unicode characters
Citation
@software{sarf_tokenizer,
title={SARF: Semantically-Aware Robust Foundational Tokenizer},
author={Al Maghrabima},
year={2025},
url={https://huggingface.co/almaghrabima/myte-parity-sweep}
}
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support