Arabic_Tokenizer / README.md
HeshamHaroon's picture
Add HuggingFace Spaces YAML configuration
751def7
---
title: Arabic Tokenizer Arena
emoji: 🏟️
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: "5.9.1"
app_file: app.py
pinned: false
---
# 🏟️ Arabic Tokenizer Arena Pro
Advanced research & production platform for Arabic tokenization analysis.
## Features
- πŸ“Š **Comprehensive Metrics**: Fertility, compression, STRR, OOV rate, and more
- 🌍 **Arabic-Specific Analysis**: Dialect support, diacritic preservation
- βš–οΈ **Side-by-Side Comparison**: Compare multiple tokenizers instantly
- 🎨 **Beautiful Visualization**: Token-by-token display with IDs
- πŸ† **Leaderboard**: Evaluate on real HuggingFace Arabic datasets
- πŸ“– **Multi-Variant Support**: MSA, dialectal, and Classical Arabic
## Project Structure
```
arabic_tokenizer_arena/
β”œβ”€β”€ app.py # Main Gradio application
β”œβ”€β”€ config.py # Tokenizer registry & dataset configs
β”œβ”€β”€ tokenizer_manager.py # Tokenizer loading & caching
β”œβ”€β”€ analysis.py # Tokenization analysis functions
β”œβ”€β”€ leaderboard.py # Leaderboard with HF datasets
β”œβ”€β”€ ui_components.py # HTML generation
β”œβ”€β”€ styles.py # CSS styles
β”œβ”€β”€ utils.py # Arabic text utilities
β”œβ”€β”€ requirements.txt # Dependencies
└── README.md # This file
```
## Installation
```bash
pip install -r requirements.txt
```
## Usage
### Local Development
```bash
python app.py
```
### HuggingFace Spaces
1. Upload all `.py` files to your Space
2. Add `HF_TOKEN` secret if using gated models
3. The app will start automatically
## Available Tokenizers
### Arabic BERT Models
- AraBERT v2 (AUB MIND Lab)
- CAMeLBERT Mix/MSA/DA/CA (CAMeL Lab)
- MARBERT & ARBERT (UBC NLP)
### Arabic LLMs
- Jais 13B/30B (Inception/MBZUAI)
- SILMA 9B (SILMA AI)
- Fanar 9B (QCRI)
- Yehia 7B (Navid AI)
- Atlas-Chat (MBZUAI Paris)
### Arabic Tokenizers
- Aranizer PBE/SP 32K/86K (RIOTU Lab)
### Multilingual Models
- Qwen 2.5 (Alibaba)
- Gemma 2 (Google)
- Mistral (Mistral AI)
- XLM-RoBERTa (Meta)
## Leaderboard Datasets
| Dataset | Source | Category |
|---------|--------|----------|
| ArabicMMLU | MBZUAI | MSA Benchmark |
| ArSenTD-LEV | ramybaly | Levantine Dialect |
| ATHAR | mohamed-khalil | Classical Arabic |
| ARCD | arcd | QA Dataset |
| Ashaar | arbml | Poetry |
| Hadith | gurgutan | Religious |
| Arabic Sentiment | arbml | Social Media |
| SANAD | arbml | News |
## Metrics
- **Fertility**: Tokens per word (lower = better, 1.0 ideal)
- **Compression**: Bytes per token (higher = better)
- **STRR**: Single Token Retention Rate (higher = better)
- **OOV Rate**: Out-of-vocabulary percentage (lower = better)
## License
MIT License
## Contributing
Contributions welcome! Please open an issue or PR.
---
Built with ❀️ for the Arabic NLP community