Hugging Face
Models
Datasets
Spaces
Buckets
new
Docs
Enterprise
Pricing
Website
Tasks
HuggingChat
Collections
Languages
Organizations
Community
Blog
Posts
Daily Papers
Learn
Discord
Forum
GitHub
Solutions
Team & Enterprise
Hugging Face PRO
Enterprise Support
Inference Providers
Inference Endpoints
Storage Buckets
Log In
Sign Up
almaghrabima
/
SARFTokenizer
like
3
Arabic
English
tokenizers
tokenizer
sarf
bilingual
arabic
english
math
code
sentencepiece-style
License:
cc-by-nc-4.0
Model card
Files
Files and versions
xet
Community
Copy to bucket
new
main
SARFTokenizer
6.92 MB
Ctrl+K
Ctrl+K
1 contributor
History:
61 commits
almaghrabima
Add 'lower/higher is better' captions under the two charts
8446f73
verified
about 2 months ago
.gitattributes
1.78 kB
Add characters per 1M tokens chart
about 2 months ago
BENCHMARK.md
Safe
1.74 kB
Promote v0.3.1 to main (4-domain SOTA at 100k vocab)
2 months ago
FAIR_BENCHMARK.md
Safe
5.67 kB
Honest OOD FineWeb benchmark: lead AR +13.5% vs GPT-4o, lose EN by 6.2%
2 months ago
README.md
26.7 kB
Add 'lower/higher is better' captions under the two charts
about 2 months ago
api_cost_comparison_per_1m_tokens.png
139 kB
xet
Add API cost comparison per 1M tokens chart
about 2 months ago
bench_results.json
Safe
12.5 kB
Promote v0.3.1 to main (4-domain SOTA at 100k vocab)
2 months ago
benchmark_results.json
Safe
2.97 kB
v0.2: Unigram LM at 65k vocab — beats GPT-4o on AR and EN at 1/3 vocab
2 months ago
benchmark_results_2026flagships.json
Safe
883 Bytes
README: add Gemma-4, Qwen3.6, Kimi-K2.6 benchmarks + head-to-head vs flagships
2 months ago
characters_per_1m_tokens.png
128 kB
xet
Add characters per 1M tokens chart
about 2 months ago
fair_benchmark_results.json
Safe
4.8 kB
Honest OOD FineWeb benchmark: lead AR +13.5% vs GPT-4o, lose EN by 6.2%
2 months ago
special_tokens_map.json
Safe
449 Bytes
Retrain v0.3.1 with 13 modern <|...|> special tokens (chat / code-block / tool-output boundaries; benchmark unchanged)
2 months ago
tokenizer.json
Safe
6.6 MB
Retrain v0.3.1 with 13 modern <|...|> special tokens (chat / code-block / tool-output boundaries; benchmark unchanged)
2 months ago
tokenizer_config.json
Safe
571 Bytes
Retrain v0.3.1 with 13 modern <|...|> special tokens (chat / code-block / tool-output boundaries; benchmark unchanged)
2 months ago