Spaces:
Running
Running
| title: Arabic Tokenizer Arena | |
| emoji: ποΈ | |
| colorFrom: green | |
| colorTo: blue | |
| sdk: gradio | |
| sdk_version: "5.9.1" | |
| app_file: app.py | |
| pinned: false | |
| # ποΈ Arabic Tokenizer Arena Pro | |
| Advanced research & production platform for Arabic tokenization analysis. | |
| ## Features | |
| - π **Comprehensive Metrics**: Fertility, compression, STRR, OOV rate, and more | |
| - π **Arabic-Specific Analysis**: Dialect support, diacritic preservation | |
| - βοΈ **Side-by-Side Comparison**: Compare multiple tokenizers instantly | |
| - π¨ **Beautiful Visualization**: Token-by-token display with IDs | |
| - π **Leaderboard**: Evaluate on real HuggingFace Arabic datasets | |
| - π **Multi-Variant Support**: MSA, dialectal, and Classical Arabic | |
| ## Project Structure | |
| ``` | |
| arabic_tokenizer_arena/ | |
| βββ app.py # Main Gradio application | |
| βββ config.py # Tokenizer registry & dataset configs | |
| βββ tokenizer_manager.py # Tokenizer loading & caching | |
| βββ analysis.py # Tokenization analysis functions | |
| βββ leaderboard.py # Leaderboard with HF datasets | |
| βββ ui_components.py # HTML generation | |
| βββ styles.py # CSS styles | |
| βββ utils.py # Arabic text utilities | |
| βββ requirements.txt # Dependencies | |
| βββ README.md # This file | |
| ``` | |
| ## Installation | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| ## Usage | |
| ### Local Development | |
| ```bash | |
| python app.py | |
| ``` | |
| ### HuggingFace Spaces | |
| 1. Upload all `.py` files to your Space | |
| 2. Add `HF_TOKEN` secret if using gated models | |
| 3. The app will start automatically | |
| ## Available Tokenizers | |
| ### Arabic BERT Models | |
| - AraBERT v2 (AUB MIND Lab) | |
| - CAMeLBERT Mix/MSA/DA/CA (CAMeL Lab) | |
| - MARBERT & ARBERT (UBC NLP) | |
| ### Arabic LLMs | |
| - Jais 13B/30B (Inception/MBZUAI) | |
| - SILMA 9B (SILMA AI) | |
| - Fanar 9B (QCRI) | |
| - Yehia 7B (Navid AI) | |
| - Atlas-Chat (MBZUAI Paris) | |
| ### Arabic Tokenizers | |
| - Aranizer PBE/SP 32K/86K (RIOTU Lab) | |
| ### Multilingual Models | |
| - Qwen 2.5 (Alibaba) | |
| - Gemma 2 (Google) | |
| - Mistral (Mistral AI) | |
| - XLM-RoBERTa (Meta) | |
| ## Leaderboard Datasets | |
| | Dataset | Source | Category | | |
| |---------|--------|----------| | |
| | ArabicMMLU | MBZUAI | MSA Benchmark | | |
| | ArSenTD-LEV | ramybaly | Levantine Dialect | | |
| | ATHAR | mohamed-khalil | Classical Arabic | | |
| | ARCD | arcd | QA Dataset | | |
| | Ashaar | arbml | Poetry | | |
| | Hadith | gurgutan | Religious | | |
| | Arabic Sentiment | arbml | Social Media | | |
| | SANAD | arbml | News | | |
| ## Metrics | |
| - **Fertility**: Tokens per word (lower = better, 1.0 ideal) | |
| - **Compression**: Bytes per token (higher = better) | |
| - **STRR**: Single Token Retention Rate (higher = better) | |
| - **OOV Rate**: Out-of-vocabulary percentage (lower = better) | |
| ## License | |
| MIT License | |
| ## Contributing | |
| Contributions welcome! Please open an issue or PR. | |
| --- | |
| Built with β€οΈ for the Arabic NLP community | |