YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

SARF Tokenizer - Parity Benchmark Results

Overview

SARF (Semantically-Aware Robust Foundational) tokenizers using MYTE morphological preprocessing, optimized for Arabic-English parity.

Winner: SARF-88k-plus - Best parity (1.0162, closest to perfect 1.0)

5-Run Averaged Benchmark Results

Metric: Parity = AR_chars/token ÷ EN_chars/token (1.0 = perfect balance)

Methodology: 5 runs × 5,000 samples/run, randomly sampled

Rank	Tokenizer	Vocab	AR Fert	EN Fert	Avg Fert	Parity	±Std
1	🏆SARF-88k-plus**	88,097	2.409	2.166	2.288	1.0162	0.0073
2	SARF-115k-plus	115,398	2.249	2.140	2.195	1.0632	0.0082
3	Gemma-3-4B	262,145	2.429	1.328	1.878	0.9308	0.0067
4	Fanar-1-9B	128,256	2.409	1.356	1.882	0.9230	0.0065
5	Command-R-Arabic	255,033	2.449	1.330	1.889	0.9104	0.0063
6	GPT-4o	200,019	2.394	1.434	1.914	0.8768	0.0059
7	SARF-65k-v2	64,603	2.287	1.568	1.928	0.8669	0.0062
8	SARF-65k	64,688	2.289	1.535	1.912	0.8547	0.0064
9	Qwen3-4B	151,669	2.502	1.500	2.001	0.8396	0.0057

Key Findings

SARF-88k-plus achieves best parity (1.0162) - closest to perfect 1.0
SARF tokenizers occupy top 2 positions for parity
Commercial tokenizers (Gemma, Fanar, GPT-4o) have parity 0.88-0.93 (Arabic under-represented)
SARF-65k variants have lower parity but better fertility (fewer tokens/word)

Choosing the Right Tokenizer

Priority	Recommendation
Perfect AR/EN balance	SARF-88k-plus (parity 1.02)
Arabic-heavy workloads	SARF-115k-plus (parity 1.06)
Smallest vocab + good fertility	SARF-65k-v2 (65K vocab, fert 1.93)
Production compatibility	Gemma-3-4B or Fanar-1-9B

Reproducibility

All tokenizer files and benchmark scripts are provided for full reproducibility.

Directory Structure

├── SARF-88k-plus/                # 🏆 BEST PARITY
│   ├── tokenizer.json
│   ├── vocab.json
│   ├── merges.txt
│   └── morf_map.supp.json
├── SARF-65k-v2/                  # Best small vocab
│   ├── tokenizer.json
│   └── morf_map.basic.json
├── benchmark_5runs_final.json    # Full benchmark data
├── benchmark_script.py           # Reproducibility script
└── README.md

Usage

from transformers import PreTrainedTokenizerFast
from huggingface_hub import hf_hub_download
import json

# Download SARF-88k-plus (best parity)
repo = 'almaghrabima/myte-parity-sweep'
tokenizer_path = hf_hub_download(repo, 'SARF-88k-plus/tokenizer.json')
morf_map_path = hf_hub_download(repo, 'SARF-88k-plus/morf_map.supp.json')

# Load tokenizer
tokenizer = PreTrainedTokenizerFast(tokenizer_file=tokenizer_path)

# Load morpheme map for MYTE preprocessing
with open(morf_map_path) as f:
    morf_map = json.load(f)

# Simple rewriter
def rewrite(text, morf_map):
    for morph, pua in sorted(morf_map.items(), key=lambda x: -len(x[0])):
        text = text.replace(morph, pua)
    return text

# Encode
def encode(text):
    return tokenizer.encode(rewrite(text, morf_map), add_special_tokens=False)

# Test
print(encode('مرحبا بالعالم'))  # Arabic
print(encode('Hello world'))    # English

Training Details

Data: 16B characters (50% Arabic, 50% English)
Method: Parity-Aware BPE with MYTE morphological preprocessing
Morfessor: Arabic morphological segmentation
PUA Mapping: Morphemes → Private Use Area Unicode characters

Citation

@software{sarf_tokenizer,
  title={SARF: Semantically-Aware Robust Foundational Tokenizer},
  author={Al Maghrabima},
  year={2025},
  url={https://huggingface.co/almaghrabima/myte-parity-sweep}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support