Spaces:

afeng
/

tokenizers

Running

App Files Files Community

tokenizers / README.md

afeng

update output window

405302e about 1 month ago

preview code

raw

history blame contribute delete

3.45 kB

	---
	title: Tokenizer Playground
	emoji: 🔤
	colorFrom: blue
	colorTo: purple
	sdk: gradio
	sdk_version: 4.44.1
	app_file: app.py
	pinned: true
	license: mit
	models:
	- Qwen/Qwen3-0.6B
	- Qwen/Qwen2.5-7B
	- meta-llama/Llama-3.1-8B
	- openai-community/gpt2
	- mistralai/Mistral-7B-v0.1
	- google/gemma-7b
	tags:
	- tokenizer
	- nlp
	- text-processing
	- research-tool
	short_description: Interactive tokenizer tool for NLP researchers
	---

	# 🔤 Tokenizer Playground

	An interactive web application for experimenting with various Hugging Face tokenizers. Perfect for NLP researchers and developers who need to quickly test and compare different tokenization strategies.

	## Features

	### 🔤 Tokenize Tab
	- Convert any text into tokens using popular models
	- View tokens, token IDs, and detailed token information
	- See tokenization statistics (tokens per character, vocabulary size, etc.)
	- Support for adding/removing special tokens
	- Custom model support via Hugging Face model IDs

	### 🔄 Detokenize Tab
	- Convert token IDs back to text
	- Support for various input formats (list, comma-separated, space-separated)
	- Option to skip special tokens
	- Verification of round-trip tokenization

	### 📊 Compare Tab
	- Compare tokenization across multiple models simultaneously
	- See token count differences and efficiency metrics
	- Identify which tokenizer is most efficient for your use case
	- Sort results by token count

	### 📖 Vocabulary Tab
	- Explore tokenizer vocabulary details
	- View special tokens and their configurations
	- See vocabulary size and tokenizer type
	- Browse first 100 tokens in the vocabulary

	## Supported Models

	### Pre-configured Models
	- Qwen Series: Qwen 3, Qwen 2.5, Qwen 2, Qwen 1 (multiple sizes)
	- Llama Series: Llama 3.2, Llama 3.1, Llama 2 (multiple sizes)
	- GPT Models: GPT-2, GPT-NeoX
	- Google Models: Gemma, T5, BERT
	- Mistral Models: Mistral 7B, Mixtral 8x7B
	- Other Models: DeepSeek, Phi, Yi, BLOOM, OPT, StableLM

	### Custom Models
	You can use any tokenizer available on the Hugging Face Hub by entering its model ID in the "Custom Model ID" field. Examples:
	- `facebook/bart-base`
	- `EleutherAI/gpt-j-6b`
	- `bigscience/bloom`
	- `stabilityai/stablelm-2-1_6b`

	## Technical Details

	- Built with Gradio for an intuitive web interface
	- Uses Hugging Face Transformers for tokenizer support
	- Supports both fast (Rust-based) and slow (Python-based) tokenizers
	- Caches loaded tokenizers for improved performance
	- Handles special tokens and custom vocabularies

	## Quick Start

	1. Select a tokenizer from the dropdown or enter a custom model ID
	2. Enter your text in the input field
	3. Click the action button (Tokenize, Decode, Compare, or Analyze)
	4. View the results in the output fields

	## Tips

	- Different tokenizers can produce significantly different token counts for the same text
	- Special tokens (like `[CLS]`, `[SEP]`, `<s>`, `</s>`) are model-specific
	- Subword tokenization allows handling of out-of-vocabulary words
	- Token efficiency directly impacts model inference costs and API usage

	## Local Development

	To run this application locally:

	```bash
	# Clone the repository
	git clone <your-repo-url>
	cd tokenizer-playground

	# Install dependencies
	pip install -r requirements.txt

	# Run the application
	python app.py
	```

	The application will be available at `http://localhost:7860`

	## License

	This project is licensed under the MIT License.