Spaces:
Sleeping
Sleeping
| title: Tokenizers Languages | |
| emoji: 📉 | |
| colorFrom: red | |
| colorTo: yellow | |
| sdk: gradio | |
| sdk_version: 6.4.0 | |
| app_file: app.py | |
| pinned: false | |
| license: apache-2.0 | |
| short_description: Comparing LLM tokenizers in multiple languages | |
| # All languages are NOT created (tokenized) equal! 🌐 | |
| Gradio app that compares the tokenization length for different languages across various LLM tokenizers. | |
| For some tokenizers, tokenizing a message in one language may result in 10-20x more tokens than a comparable message in another language (e.g., try English vs. Burmese). | |
| > 📺 Live version available at [hf.co/spaces/jgalego/tokenizers-languages](https://hf.co/spaces/jgalego/tokenizers-languages) | |
| > 🙏 Adapted, modified and updated from [All languages are NOT created (tokenized) equal](https://www.artfish.ai/p/all-languages-are-not-created-tokenized) | |
|  | |
| ## Features ✨ | |
| - **Interactive Tokenizer Comparison**: Select from 16 different tokenizers including GPT-4, Claude, Llama 3, Mistral, Gemma, and more | |
| - **Multi-Language Analysis**: Compare tokenization across 51 languages from the Amazon Massive dataset | |
| - **Visual Analytics**: | |
| - Token distribution plots with customizable histograms | |
| - Median token length metrics for selected languages | |
| - Bar charts showing languages with shortest/longest token counts | |
| - Random example texts with token counts | |
| - **Real-time Updates**: Dynamic visualizations that update as you change selections | |
| ## Data Source 💾 | |
| The data is from the validation set of the [Amazon Massive](https://huggingface.co/datasets/AmazonScience/massive) dataset, consisting of 2,033 short sentences and phrases translated into 51 different languages. | |
| Learn more from [Amazon's blog post](https://www.amazon.science/blog/amazon-releases-51-language-dataset-for-language-understanding). | |
| ## Getting Started 🚀 | |
| ```bash | |
| # Install dependencies | |
| pip install -r requirements.txt | |
| # Run the app | |
| python app.py | |
| ``` | |
| The app will be available at `http://localhost:7860` | |