Spaces:
Running
Running
File size: 2,000 Bytes
cd625ee e86313c 8e64677 e86313c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | ---
title: Tokenizers Languages
emoji: 📉
colorFrom: red
colorTo: yellow
sdk: gradio
sdk_version: 6.4.0
app_file: app.py
pinned: false
license: apache-2.0
short_description: Comparing LLM tokenizers in multiple languages
---
# All languages are NOT created (tokenized) equal! 🌐
Gradio app that compares the tokenization length for different languages across various LLM tokenizers.
For some tokenizers, tokenizing a message in one language may result in 10-20x more tokens than a comparable message in another language (e.g., try English vs. Burmese).
> 📺 Live version available at [hf.co/spaces/jgalego/tokenizers-languages](https://hf.co/spaces/jgalego/tokenizers-languages)
> 🙏 Adapted, modified and updated from [All languages are NOT created (tokenized) equal](https://www.artfish.ai/p/all-languages-are-not-created-tokenized)

## Features ✨
- **Interactive Tokenizer Comparison**: Select from 16 different tokenizers including GPT-4, Claude, Llama 3, Mistral, Gemma, and more
- **Multi-Language Analysis**: Compare tokenization across 51 languages from the Amazon Massive dataset
- **Visual Analytics**:
- Token distribution plots with customizable histograms
- Median token length metrics for selected languages
- Bar charts showing languages with shortest/longest token counts
- Random example texts with token counts
- **Real-time Updates**: Dynamic visualizations that update as you change selections
## Data Source 💾
The data is from the validation set of the [Amazon Massive](https://huggingface.co/datasets/AmazonScience/massive) dataset, consisting of 2,033 short sentences and phrases translated into 51 different languages.
Learn more from [Amazon's blog post](https://www.amazon.science/blog/amazon-releases-51-language-dataset-for-language-understanding).
## Getting Started 🚀
```bash
# Install dependencies
pip install -r requirements.txt
# Run the app
python app.py
```
The app will be available at `http://localhost:7860`
|