Spaces:

jgalego
/

tokenizers-languages

Sleeping

App Files Files Community

tokenizers-languages / README.md

jgalego

Update docs

e86313c about 1 month ago

preview code

raw

history blame contribute delete

2 kB

A newer version of the Gradio SDK is available: 6.9.0

Upgrade

metadata

title: Tokenizers Languages
emoji: 📉
colorFrom: red
colorTo: yellow
sdk: gradio
sdk_version: 6.4.0
app_file: app.py
pinned: false
license: apache-2.0
short_description: Comparing LLM tokenizers in multiple languages

All languages are NOT created (tokenized) equal! 🌐

Gradio app that compares the tokenization length for different languages across various LLM tokenizers.

For some tokenizers, tokenizing a message in one language may result in 10-20x more tokens than a comparable message in another language (e.g., try English vs. Burmese).

📺 Live version available at hf.co/spaces/jgalego/tokenizers-languages

🙏 Adapted, modified and updated from All languages are NOT created (tokenized) equal

Features ✨

Interactive Tokenizer Comparison: Select from 16 different tokenizers including GPT-4, Claude, Llama 3, Mistral, Gemma, and more
Multi-Language Analysis: Compare tokenization across 51 languages from the Amazon Massive dataset
Visual Analytics:
- Token distribution plots with customizable histograms
- Median token length metrics for selected languages
- Bar charts showing languages with shortest/longest token counts
- Random example texts with token counts
Real-time Updates: Dynamic visualizations that update as you change selections

Data Source 💾

The data is from the validation set of the Amazon Massive dataset, consisting of 2,033 short sentences and phrases translated into 51 different languages.

Learn more from Amazon's blog post.

Getting Started 🚀

# Install dependencies
pip install -r requirements.txt

# Run the app
python app.py

The app will be available at http://localhost:7860