File size: 3,446 Bytes
9093377
af99c46
 
9093377
af99c46
9093377
af99c46
9093377
af99c46
 
 
405302e
af99c46
 
 
 
 
 
 
 
 
 
 
9093377
 
af99c46
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
405302e
af99c46
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
---
title: Tokenizer Playground
emoji: πŸ”€
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.44.1
app_file: app.py
pinned: true
license: mit
models:
  - Qwen/Qwen3-0.6B
  - Qwen/Qwen2.5-7B
  - meta-llama/Llama-3.1-8B
  - openai-community/gpt2
  - mistralai/Mistral-7B-v0.1
  - google/gemma-7b
tags:
  - tokenizer
  - nlp
  - text-processing
  - research-tool
short_description: Interactive tokenizer tool for NLP researchers
---

# πŸ”€ Tokenizer Playground

An interactive web application for experimenting with various Hugging Face tokenizers. Perfect for NLP researchers and developers who need to quickly test and compare different tokenization strategies.

## Features

### πŸ”€ Tokenize Tab
- Convert any text into tokens using popular models
- View tokens, token IDs, and detailed token information
- See tokenization statistics (tokens per character, vocabulary size, etc.)
- Support for adding/removing special tokens
- Custom model support via Hugging Face model IDs

### πŸ”„ Detokenize Tab
- Convert token IDs back to text
- Support for various input formats (list, comma-separated, space-separated)
- Option to skip special tokens
- Verification of round-trip tokenization

### πŸ“Š Compare Tab
- Compare tokenization across multiple models simultaneously
- See token count differences and efficiency metrics
- Identify which tokenizer is most efficient for your use case
- Sort results by token count

### πŸ“– Vocabulary Tab
- Explore tokenizer vocabulary details
- View special tokens and their configurations
- See vocabulary size and tokenizer type
- Browse first 100 tokens in the vocabulary

## Supported Models

### Pre-configured Models
- **Qwen Series**: Qwen 3, Qwen 2.5, Qwen 2, Qwen 1 (multiple sizes)
- **Llama Series**: Llama 3.2, Llama 3.1, Llama 2 (multiple sizes)
- **GPT Models**: GPT-2, GPT-NeoX
- **Google Models**: Gemma, T5, BERT
- **Mistral Models**: Mistral 7B, Mixtral 8x7B
- **Other Models**: DeepSeek, Phi, Yi, BLOOM, OPT, StableLM

### Custom Models
You can use any tokenizer available on the Hugging Face Hub by entering its model ID in the "Custom Model ID" field. Examples:
- `facebook/bart-base`
- `EleutherAI/gpt-j-6b`
- `bigscience/bloom`
- `stabilityai/stablelm-2-1_6b`

## Technical Details

- Built with Gradio for an intuitive web interface
- Uses Hugging Face Transformers for tokenizer support
- Supports both fast (Rust-based) and slow (Python-based) tokenizers
- Caches loaded tokenizers for improved performance
- Handles special tokens and custom vocabularies

## Quick Start

1. **Select a tokenizer** from the dropdown or enter a custom model ID
2. **Enter your text** in the input field
3. **Click the action button** (Tokenize, Decode, Compare, or Analyze)
4. **View the results** in the output fields

## Tips

- Different tokenizers can produce significantly different token counts for the same text
- Special tokens (like `[CLS]`, `[SEP]`, `<s>`, `</s>`) are model-specific
- Subword tokenization allows handling of out-of-vocabulary words
- Token efficiency directly impacts model inference costs and API usage

## Local Development

To run this application locally:

```bash
# Clone the repository
git clone <your-repo-url>
cd tokenizer-playground

# Install dependencies
pip install -r requirements.txt

# Run the application
python app.py
```

The application will be available at `http://localhost:7860`

## License

This project is licensed under the MIT License.