Add 'lower/higher is better' captions under the two charts 8446f73 verified almaghrabima commited on May 11
Update README with cost/compression analysis and embed two charts da39581 verified almaghrabima commited on May 11
Retrain v0.3.1 with 13 modern <|...|> special tokens (chat / code-block / tool-output boundaries; benchmark unchanged) cf3eebc verified almaghrabima commited on Apr 25
Promote v0.3.1 to main (4-domain SOTA at 100k vocab) f6c72b6 verified almaghrabima commited on Apr 25
Honest OOD FineWeb benchmark: lead AR +13.5% vs GPT-4o, lose EN by 6.2% 6479992 verified almaghrabima commited on Apr 23
Honest OOD FineWeb benchmark: lead AR +13.5% vs GPT-4o, lose EN by 6.2% e4a23ce verified almaghrabima commited on Apr 23
Honest OOD FineWeb benchmark: lead AR +13.5% vs GPT-4o, lose EN by 6.2% 3afc36b verified almaghrabima commited on Apr 23
Link to public SARFTokenizer-benchmark-eval dataset for reproducibility b776880 verified almaghrabima commited on Apr 22
README: add Gemma-4, Qwen3.6, Kimi-K2.6 benchmarks + head-to-head vs flagships 5d4f9d7 verified almaghrabima commited on Apr 22
README: add Gemma-4, Qwen3.6, Kimi-K2.6 benchmarks + head-to-head vs flagships 15b06b6 verified almaghrabima commited on Apr 22
Update README: Colab-ready code, benchmark, troubleshooting 579b42f verified almaghrabima commited on Apr 22
v0.2: Unigram LM at 65k vocab — beats GPT-4o on AR and EN at 1/3 vocab d33a32a verified almaghrabima commited on Apr 22
v0.2: Unigram LM at 65k vocab — beats GPT-4o on AR and EN at 1/3 vocab 9608763 verified almaghrabima commited on Apr 22
v0.2: Unigram LM at 65k vocab — beats GPT-4o on AR and EN at 1/3 vocab 1c23cd4 verified almaghrabima commited on Apr 22
v0.2: Unigram LM at 65k vocab — beats GPT-4o on AR and EN at 1/3 vocab edcc327 verified almaghrabima commited on Apr 22
v0.2: Unigram LM at 65k vocab — beats GPT-4o on AR and EN at 1/3 vocab 92c3237 verified almaghrabima commited on Apr 22
v0.2: Unigram LM at 65k vocab — beats GPT-4o on AR and EN at 1/3 vocab 3c1bd73 verified almaghrabima commited on Apr 22
Drop Fanar-2-Diwan from benchmark (unfair Arabic-maximalist peer) 85df039 verified almaghrabima commited on Apr 21
Drop Fanar-2-Diwan from benchmark (unfair Arabic-maximalist peer) 5785206 verified almaghrabima commited on Apr 21
Add benchmark_results.json: v0.1 benchmark vs 11 tokenizers 73dc8b5 verified almaghrabima commited on Apr 21
Upload morfessor_models/morf_map_reverse.json with huggingface_hub 2772bff verified almaghrabima commited on Feb 8
Upload morfessor_models/morf_map.json with huggingface_hub 5fee9e3 verified almaghrabima commited on Feb 8
Upload morfessor_models/morfessor_en.bin with huggingface_hub 77f39b7 verified almaghrabima commited on Feb 8
Upload morfessor_models/morfessor_ar.bin with huggingface_hub 7309795 verified almaghrabima commited on Feb 8
Update benchmark results with new tokenizers (Falcon-H1, ALLaM, Hala, Mistral) 55db6a1 verified almaghrabima commited on Feb 4
Upload benchmark_parallel_results.json with huggingface_hub a88f8a7 verified almaghrabima commited on Feb 4
Upload test_comprehensive_results.json with huggingface_hub 1e8911f verified almaghrabima commited on Feb 4