YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

SARF Tokenizer - Parity Benchmark Results

Overview

SARF (Semantically-Aware Robust Foundational) tokenizers using MYTE morphological preprocessing, optimized for Arabic-English parity.

Winner: SARF-88k-plus - Best parity (1.0162, closest to perfect 1.0)

5-Run Averaged Benchmark Results

Metric: Parity = AR_chars/token Γ· EN_chars/token (1.0 = perfect balance)

Methodology: 5 runs Γ— 5,000 samples/run, randomly sampled

Rank Tokenizer Vocab AR Fert EN Fert Avg Fert Parity Β±Std
1 πŸ†SARF-88k-plus** 88,097 2.409 2.166 2.288 1.0162 0.0073
2 SARF-115k-plus 115,398 2.249 2.140 2.195 1.0632 0.0082
3 Gemma-3-4B 262,145 2.429 1.328 1.878 0.9308 0.0067
4 Fanar-1-9B 128,256 2.409 1.356 1.882 0.9230 0.0065
5 Command-R-Arabic 255,033 2.449 1.330 1.889 0.9104 0.0063
6 GPT-4o 200,019 2.394 1.434 1.914 0.8768 0.0059
7 SARF-65k-v2 64,603 2.287 1.568 1.928 0.8669 0.0062
8 SARF-65k 64,688 2.289 1.535 1.912 0.8547 0.0064
9 Qwen3-4B 151,669 2.502 1.500 2.001 0.8396 0.0057

Key Findings

  1. SARF-88k-plus achieves best parity (1.0162) - closest to perfect 1.0
  2. SARF tokenizers occupy top 2 positions for parity
  3. Commercial tokenizers (Gemma, Fanar, GPT-4o) have parity 0.88-0.93 (Arabic under-represented)
  4. SARF-65k variants have lower parity but better fertility (fewer tokens/word)

Choosing the Right Tokenizer

Priority Recommendation
Perfect AR/EN balance SARF-88k-plus (parity 1.02)
Arabic-heavy workloads SARF-115k-plus (parity 1.06)
Smallest vocab + good fertility SARF-65k-v2 (65K vocab, fert 1.93)
Production compatibility Gemma-3-4B or Fanar-1-9B

Reproducibility

All tokenizer files and benchmark scripts are provided for full reproducibility.

Directory Structure

β”œβ”€β”€ SARF-88k-plus/                # πŸ† BEST PARITY
β”‚   β”œβ”€β”€ tokenizer.json
β”‚   β”œβ”€β”€ vocab.json
β”‚   β”œβ”€β”€ merges.txt
β”‚   └── morf_map.supp.json
β”œβ”€β”€ SARF-65k-v2/                  # Best small vocab
β”‚   β”œβ”€β”€ tokenizer.json
β”‚   └── morf_map.basic.json
β”œβ”€β”€ benchmark_5runs_final.json    # Full benchmark data
β”œβ”€β”€ benchmark_script.py           # Reproducibility script
└── README.md

Usage

from transformers import PreTrainedTokenizerFast
from huggingface_hub import hf_hub_download
import json

# Download SARF-88k-plus (best parity)
repo = 'almaghrabima/myte-parity-sweep'
tokenizer_path = hf_hub_download(repo, 'SARF-88k-plus/tokenizer.json')
morf_map_path = hf_hub_download(repo, 'SARF-88k-plus/morf_map.supp.json')

# Load tokenizer
tokenizer = PreTrainedTokenizerFast(tokenizer_file=tokenizer_path)

# Load morpheme map for MYTE preprocessing
with open(morf_map_path) as f:
    morf_map = json.load(f)

# Simple rewriter
def rewrite(text, morf_map):
    for morph, pua in sorted(morf_map.items(), key=lambda x: -len(x[0])):
        text = text.replace(morph, pua)
    return text

# Encode
def encode(text):
    return tokenizer.encode(rewrite(text, morf_map), add_special_tokens=False)

# Test
print(encode('Ω…Ψ±Ψ­Ψ¨Ψ§ Ψ¨Ψ§Ω„ΨΉΨ§Ω„Ω…'))  # Arabic
print(encode('Hello world'))    # English

Training Details

  • Data: 16B characters (50% Arabic, 50% English)
  • Method: Parity-Aware BPE with MYTE morphological preprocessing
  • Morfessor: Arabic morphological segmentation
  • PUA Mapping: Morphemes β†’ Private Use Area Unicode characters

Citation

@software{sarf_tokenizer,
  title={SARF: Semantically-Aware Robust Foundational Tokenizer},
  author={Al Maghrabima},
  year={2025},
  url={https://huggingface.co/almaghrabima/myte-parity-sweep}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support