jgalego commited on
Commit
e86313c
·
1 Parent(s): 8bedd21

Update docs

Browse files
Files changed (4) hide show
  1. .gitattributes +1 -0
  2. README.md +39 -2
  3. app.png +3 -0
  4. app.py +1 -2
.gitattributes CHANGED
@@ -34,3 +34,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  *.csv filter=lfs diff=lfs merge=lfs -text
 
 
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  *.csv filter=lfs diff=lfs merge=lfs -text
37
+ *.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -11,6 +11,43 @@ license: apache-2.0
11
  short_description: Comparing LLM tokenizers in multiple languages
12
  ---
13
 
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
15
 
16
- > Adapted from [All languages are NOT created (tokenized) equal](https://www.artfish.ai/p/all-languages-are-not-created-tokenized)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  short_description: Comparing LLM tokenizers in multiple languages
12
  ---
13
 
14
+ # All languages are NOT created (tokenized) equal! 🌐
15
 
16
+ Gradio app that compares the tokenization length for different languages across various LLM tokenizers.
17
+
18
+ For some tokenizers, tokenizing a message in one language may result in 10-20x more tokens than a comparable message in another language (e.g., try English vs. Burmese).
19
+
20
+ > 📺 Live version available at [hf.co/spaces/jgalego/tokenizers-languages](https://hf.co/spaces/jgalego/tokenizers-languages)
21
+
22
+ > 🙏 Adapted, modified and updated from [All languages are NOT created (tokenized) equal](https://www.artfish.ai/p/all-languages-are-not-created-tokenized)
23
+
24
+ ![app](app.png)
25
+
26
+ ## Features ✨
27
+
28
+ - **Interactive Tokenizer Comparison**: Select from 16 different tokenizers including GPT-4, Claude, Llama 3, Mistral, Gemma, and more
29
+ - **Multi-Language Analysis**: Compare tokenization across 51 languages from the Amazon Massive dataset
30
+ - **Visual Analytics**:
31
+ - Token distribution plots with customizable histograms
32
+ - Median token length metrics for selected languages
33
+ - Bar charts showing languages with shortest/longest token counts
34
+ - Random example texts with token counts
35
+ - **Real-time Updates**: Dynamic visualizations that update as you change selections
36
+
37
+ ## Data Source 💾
38
+
39
+ The data is from the validation set of the [Amazon Massive](https://huggingface.co/datasets/AmazonScience/massive) dataset, consisting of 2,033 short sentences and phrases translated into 51 different languages.
40
+
41
+ Learn more from [Amazon's blog post](https://www.amazon.science/blog/amazon-releases-51-language-dataset-for-language-understanding).
42
+
43
+ ## Getting Started 🚀
44
+
45
+ ```bash
46
+ # Install dependencies
47
+ pip install -r requirements.txt
48
+
49
+ # Run the app
50
+ python app.py
51
+ ```
52
+
53
+ The app will be available at `http://localhost:7860`
app.png ADDED

Git LFS Details

  • SHA256: cf697d7a4466ebc64b145b06f24e24254490713afd7620da9206b0ed7408a490
  • Pointer size: 131 Bytes
  • Size of remote file: 169 kB
app.py CHANGED
@@ -121,8 +121,7 @@ with gr.Blocks(title="Tokenizer Language Comparison") as demo:
121
  (e.g. try English vs. Burmese).
122
 
123
  This is part of a larger project of measuring inequality in NLP.
124
- See the original article: [All languages are NOT created (tokenized) equal]
125
- (https://www.artfish.ai/p/all-languages-are-not-created-tokenized)
126
  on [Art Fish Intelligence](https://www.artfish.ai/).
127
  """
128
  )
 
121
  (e.g. try English vs. Burmese).
122
 
123
  This is part of a larger project of measuring inequality in NLP.
124
+ See the original article 'All languages are NOT created (tokenized) equal'
 
125
  on [Art Fish Intelligence](https://www.artfish.ai/).
126
  """
127
  )