Spaces:

afeng
/

tokenizers

Running

File size: 3,446 Bytes

---
title: Tokenizer Playground
emoji: 🔤
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.44.1
app_file: app.py
pinned: true
license: mit
models:
  - Qwen/Qwen3-0.6B
  - Qwen/Qwen2.5-7B
  - meta-llama/Llama-3.1-8B
  - openai-community/gpt2
  - mistralai/Mistral-7B-v0.1
  - google/gemma-7b
tags:
  - tokenizer
  - nlp
  - text-processing
  - research-tool
short_description: Interactive tokenizer tool for NLP researchers
---

# 🔤 Tokenizer Playground

An interactive web application for experimenting with various Hugging Face tokenizers. Perfect for NLP researchers and developers who need to quickly test and compare different tokenization strategies.

## Features

### 🔤 Tokenize Tab
- Convert any text into tokens using popular models
- View tokens, token IDs, and detailed token information
- See tokenization statistics (tokens per character, vocabulary size, etc.)
- Support for adding/removing special tokens
- Custom model support via Hugging Face model IDs

### 🔄 Detokenize Tab
- Convert token IDs back to text
- Support for various input formats (list, comma-separated, space-separated)
- Option to skip special tokens
- Verification of round-trip tokenization

### 📊 Compare Tab
- Compare tokenization across multiple models simultaneously
- See token count differences and efficiency metrics
- Identify which tokenizer is most efficient for your use case
- Sort results by token count

### 📖 Vocabulary Tab
- Explore tokenizer vocabulary details
- View special tokens and their configurations
- See vocabulary size and tokenizer type
- Browse first 100 tokens in the vocabulary

## Supported Models

### Pre-configured Models
- **Qwen Series**: Qwen 3, Qwen 2.5, Qwen 2, Qwen 1 (multiple sizes)
- **Llama Series**: Llama 3.2, Llama 3.1, Llama 2 (multiple sizes)
- **GPT Models**: GPT-2, GPT-NeoX
- **Google Models**: Gemma, T5, BERT
- **Mistral Models**: Mistral 7B, Mixtral 8x7B
- **Other Models**: DeepSeek, Phi, Yi, BLOOM, OPT, StableLM

### Custom Models
You can use any tokenizer available on the Hugging Face Hub by entering its model ID in the "Custom Model ID" field. Examples:
- `facebook/bart-base`
- `EleutherAI/gpt-j-6b`
- `bigscience/bloom`
- `stabilityai/stablelm-2-1_6b`

## Technical Details

- Built with Gradio for an intuitive web interface
- Uses Hugging Face Transformers for tokenizer support
- Supports both fast (Rust-based) and slow (Python-based) tokenizers
- Caches loaded tokenizers for improved performance
- Handles special tokens and custom vocabularies

## Quick Start

1. **Select a tokenizer** from the dropdown or enter a custom model ID
2. **Enter your text** in the input field
3. **Click the action button** (Tokenize, Decode, Compare, or Analyze)
4. **View the results** in the output fields

## Tips

- Different tokenizers can produce significantly different token counts for the same text
- Special tokens (like `[CLS]`, `[SEP]`, `<s>`, `</s>`) are model-specific
- Subword tokenization allows handling of out-of-vocabulary words
- Token efficiency directly impacts model inference costs and API usage

## Local Development

To run this application locally:

```bash
# Clone the repository
git clone <your-repo-url>
cd tokenizer-playground

# Install dependencies
pip install -r requirements.txt

# Run the application
python app.py
```

The application will be available at `http://localhost:7860`

## License

This project is licensed under the MIT License.