Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available:
6.9.0
metadata
title: Tokenizers Languages
emoji: 📉
colorFrom: red
colorTo: yellow
sdk: gradio
sdk_version: 6.4.0
app_file: app.py
pinned: false
license: apache-2.0
short_description: Comparing LLM tokenizers in multiple languages
All languages are NOT created (tokenized) equal! 🌐
Gradio app that compares the tokenization length for different languages across various LLM tokenizers.
For some tokenizers, tokenizing a message in one language may result in 10-20x more tokens than a comparable message in another language (e.g., try English vs. Burmese).
📺 Live version available at hf.co/spaces/jgalego/tokenizers-languages
🙏 Adapted, modified and updated from All languages are NOT created (tokenized) equal
Features ✨
- Interactive Tokenizer Comparison: Select from 16 different tokenizers including GPT-4, Claude, Llama 3, Mistral, Gemma, and more
- Multi-Language Analysis: Compare tokenization across 51 languages from the Amazon Massive dataset
- Visual Analytics:
- Token distribution plots with customizable histograms
- Median token length metrics for selected languages
- Bar charts showing languages with shortest/longest token counts
- Random example texts with token counts
- Real-time Updates: Dynamic visualizations that update as you change selections
Data Source 💾
The data is from the validation set of the Amazon Massive dataset, consisting of 2,033 short sentences and phrases translated into 51 different languages.
Learn more from Amazon's blog post.
Getting Started 🚀
# Install dependencies
pip install -r requirements.txt
# Run the app
python app.py
The app will be available at http://localhost:7860
