Spaces:
Sleeping
Sleeping
Update docs
Browse files
.gitattributes
CHANGED
|
@@ -34,3 +34,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
*.csv filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
*.csv filter=lfs diff=lfs merge=lfs -text
|
| 37 |
+
*.png filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
|
@@ -11,6 +11,43 @@ license: apache-2.0
|
|
| 11 |
short_description: Comparing LLM tokenizers in multiple languages
|
| 12 |
---
|
| 13 |
|
| 14 |
-
|
| 15 |
|
| 16 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
short_description: Comparing LLM tokenizers in multiple languages
|
| 12 |
---
|
| 13 |
|
| 14 |
+
# All languages are NOT created (tokenized) equal! 🌐
|
| 15 |
|
| 16 |
+
Gradio app that compares the tokenization length for different languages across various LLM tokenizers.
|
| 17 |
+
|
| 18 |
+
For some tokenizers, tokenizing a message in one language may result in 10-20x more tokens than a comparable message in another language (e.g., try English vs. Burmese).
|
| 19 |
+
|
| 20 |
+
> 📺 Live version available at [hf.co/spaces/jgalego/tokenizers-languages](https://hf.co/spaces/jgalego/tokenizers-languages)
|
| 21 |
+
|
| 22 |
+
> 🙏 Adapted, modified and updated from [All languages are NOT created (tokenized) equal](https://www.artfish.ai/p/all-languages-are-not-created-tokenized)
|
| 23 |
+
|
| 24 |
+

|
| 25 |
+
|
| 26 |
+
## Features ✨
|
| 27 |
+
|
| 28 |
+
- **Interactive Tokenizer Comparison**: Select from 16 different tokenizers including GPT-4, Claude, Llama 3, Mistral, Gemma, and more
|
| 29 |
+
- **Multi-Language Analysis**: Compare tokenization across 51 languages from the Amazon Massive dataset
|
| 30 |
+
- **Visual Analytics**:
|
| 31 |
+
- Token distribution plots with customizable histograms
|
| 32 |
+
- Median token length metrics for selected languages
|
| 33 |
+
- Bar charts showing languages with shortest/longest token counts
|
| 34 |
+
- Random example texts with token counts
|
| 35 |
+
- **Real-time Updates**: Dynamic visualizations that update as you change selections
|
| 36 |
+
|
| 37 |
+
## Data Source 💾
|
| 38 |
+
|
| 39 |
+
The data is from the validation set of the [Amazon Massive](https://huggingface.co/datasets/AmazonScience/massive) dataset, consisting of 2,033 short sentences and phrases translated into 51 different languages.
|
| 40 |
+
|
| 41 |
+
Learn more from [Amazon's blog post](https://www.amazon.science/blog/amazon-releases-51-language-dataset-for-language-understanding).
|
| 42 |
+
|
| 43 |
+
## Getting Started 🚀
|
| 44 |
+
|
| 45 |
+
```bash
|
| 46 |
+
# Install dependencies
|
| 47 |
+
pip install -r requirements.txt
|
| 48 |
+
|
| 49 |
+
# Run the app
|
| 50 |
+
python app.py
|
| 51 |
+
```
|
| 52 |
+
|
| 53 |
+
The app will be available at `http://localhost:7860`
|
app.png
ADDED
|
Git LFS Details
|
app.py
CHANGED
|
@@ -121,8 +121,7 @@ with gr.Blocks(title="Tokenizer Language Comparison") as demo:
|
|
| 121 |
(e.g. try English vs. Burmese).
|
| 122 |
|
| 123 |
This is part of a larger project of measuring inequality in NLP.
|
| 124 |
-
See the original article
|
| 125 |
-
(https://www.artfish.ai/p/all-languages-are-not-created-tokenized)
|
| 126 |
on [Art Fish Intelligence](https://www.artfish.ai/).
|
| 127 |
"""
|
| 128 |
)
|
|
|
|
| 121 |
(e.g. try English vs. Burmese).
|
| 122 |
|
| 123 |
This is part of a larger project of measuring inequality in NLP.
|
| 124 |
+
See the original article 'All languages are NOT created (tokenized) equal'
|
|
|
|
| 125 |
on [Art Fish Intelligence](https://www.artfish.ai/).
|
| 126 |
"""
|
| 127 |
)
|