Spaces:

jgalego
/

tokenizers-languages

Sleeping

App Files Files Community

tokenizers-languages / README.md

jgalego

Update docs

e86313c about 1 month ago

preview code

raw

history blame contribute delete

2 kB

	---
	title: Tokenizers Languages
	emoji: 📉
	colorFrom: red
	colorTo: yellow
	sdk: gradio
	sdk_version: 6.4.0
	app_file: app.py
	pinned: false
	license: apache-2.0
	short_description: Comparing LLM tokenizers in multiple languages
	---

	# All languages are NOT created (tokenized) equal! 🌐

	Gradio app that compares the tokenization length for different languages across various LLM tokenizers.

	For some tokenizers, tokenizing a message in one language may result in 10-20x more tokens than a comparable message in another language (e.g., try English vs. Burmese).

	> 📺 Live version available at [hf.co/spaces/jgalego/tokenizers-languages](https://hf.co/spaces/jgalego/tokenizers-languages)

	> 🙏 Adapted, modified and updated from [All languages are NOT created (tokenized) equal](https://www.artfish.ai/p/all-languages-are-not-created-tokenized)

	![app](app.png)

	## Features ✨

	- Interactive Tokenizer Comparison: Select from 16 different tokenizers including GPT-4, Claude, Llama 3, Mistral, Gemma, and more
	- Multi-Language Analysis: Compare tokenization across 51 languages from the Amazon Massive dataset
	- Visual Analytics:
	- Token distribution plots with customizable histograms
	- Median token length metrics for selected languages
	- Bar charts showing languages with shortest/longest token counts
	- Random example texts with token counts
	- Real-time Updates: Dynamic visualizations that update as you change selections

	## Data Source 💾

	The data is from the validation set of the [Amazon Massive](https://huggingface.co/datasets/AmazonScience/massive) dataset, consisting of 2,033 short sentences and phrases translated into 51 different languages.

	Learn more from [Amazon's blog post](https://www.amazon.science/blog/amazon-releases-51-language-dataset-for-language-understanding).

	## Getting Started 🚀

	```bash
	# Install dependencies
	pip install -r requirements.txt

	# Run the app
	python app.py
	```

	The app will be available at `http://localhost:7860`