Spaces:

jgalego
/

tokenizers-languages

Sleeping

App Files Files Community

jgalego commited on Jan 23

Commit

e86313c

1 Parent(s): 8bedd21

Update docs

Browse files

Files changed (4) hide show

.gitattributes +1 -0
README.md +39 -2
app.png +3 -0
app.py +1 -2

.gitattributes CHANGED Viewed

@@ -34,3 +34,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 *.csv filter=lfs diff=lfs merge=lfs -text

 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 *.csv filter=lfs diff=lfs merge=lfs -text
+*.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -11,6 +11,43 @@ license: apache-2.0
 short_description: Comparing LLM tokenizers in multiple languages
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
-> Adapted from [All languages are NOT created (tokenized) equal](https://www.artfish.ai/p/all-languages-are-not-created-tokenized)

 short_description: Comparing LLM tokenizers in multiple languages
 ---
+# All languages are NOT created (tokenized) equal! 🌐
+Gradio app that compares the tokenization length for different languages across various LLM tokenizers.
+For some tokenizers, tokenizing a message in one language may result in 10-20x more tokens than a comparable message in another language (e.g., try English vs. Burmese).
+> 📺 Live version available at [hf.co/spaces/jgalego/tokenizers-languages](https://hf.co/spaces/jgalego/tokenizers-languages)
+> 🙏 Adapted, modified and updated from [All languages are NOT created (tokenized) equal](https://www.artfish.ai/p/all-languages-are-not-created-tokenized)
+![app](app.png)
+## Features ✨
+- **Interactive Tokenizer Comparison**: Select from 16 different tokenizers including GPT-4, Claude, Llama 3, Mistral, Gemma, and more
+- **Multi-Language Analysis**: Compare tokenization across 51 languages from the Amazon Massive dataset
+- **Visual Analytics**:
+  - Token distribution plots with customizable histograms
+  - Median token length metrics for selected languages
+  - Bar charts showing languages with shortest/longest token counts
+  - Random example texts with token counts
+- **Real-time Updates**: Dynamic visualizations that update as you change selections
+## Data Source 💾
+The data is from the validation set of the [Amazon Massive](https://huggingface.co/datasets/AmazonScience/massive) dataset, consisting of 2,033 short sentences and phrases translated into 51 different languages.
+Learn more from [Amazon's blog post](https://www.amazon.science/blog/amazon-releases-51-language-dataset-for-language-understanding).
+## Getting Started 🚀
+```bash
+# Install dependencies
+pip install -r requirements.txt
+# Run the app
+python app.py
+```
+The app will be available at `http://localhost:7860`

app.png ADDED Viewed

Git LFS Details

SHA256: cf697d7a4466ebc64b145b06f24e24254490713afd7620da9206b0ed7408a490
Pointer size: 131 Bytes
Size of remote file: 169 kB

app.py CHANGED Viewed

@@ -121,8 +121,7 @@ with gr.Blocks(title="Tokenizer Language Comparison") as demo:
         (e.g. try English vs. Burmese).
         This is part of a larger project of measuring inequality in NLP.
-        See the original article: [All languages are NOT created (tokenized) equal]
-        (https://www.artfish.ai/p/all-languages-are-not-created-tokenized)
         on [Art Fish Intelligence](https://www.artfish.ai/).
         """
     )

         (e.g. try English vs. Burmese).
         This is part of a larger project of measuring inequality in NLP.
+        See the original article 'All languages are NOT created (tokenized) equal'
         on [Art Fish Intelligence](https://www.artfish.ai/).
         """
     )