Spaces:
Running
Running
File size: 3,446 Bytes
9093377 af99c46 9093377 af99c46 9093377 af99c46 9093377 af99c46 405302e af99c46 9093377 af99c46 405302e af99c46 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 |
---
title: Tokenizer Playground
emoji: π€
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.44.1
app_file: app.py
pinned: true
license: mit
models:
- Qwen/Qwen3-0.6B
- Qwen/Qwen2.5-7B
- meta-llama/Llama-3.1-8B
- openai-community/gpt2
- mistralai/Mistral-7B-v0.1
- google/gemma-7b
tags:
- tokenizer
- nlp
- text-processing
- research-tool
short_description: Interactive tokenizer tool for NLP researchers
---
# π€ Tokenizer Playground
An interactive web application for experimenting with various Hugging Face tokenizers. Perfect for NLP researchers and developers who need to quickly test and compare different tokenization strategies.
## Features
### π€ Tokenize Tab
- Convert any text into tokens using popular models
- View tokens, token IDs, and detailed token information
- See tokenization statistics (tokens per character, vocabulary size, etc.)
- Support for adding/removing special tokens
- Custom model support via Hugging Face model IDs
### π Detokenize Tab
- Convert token IDs back to text
- Support for various input formats (list, comma-separated, space-separated)
- Option to skip special tokens
- Verification of round-trip tokenization
### π Compare Tab
- Compare tokenization across multiple models simultaneously
- See token count differences and efficiency metrics
- Identify which tokenizer is most efficient for your use case
- Sort results by token count
### π Vocabulary Tab
- Explore tokenizer vocabulary details
- View special tokens and their configurations
- See vocabulary size and tokenizer type
- Browse first 100 tokens in the vocabulary
## Supported Models
### Pre-configured Models
- **Qwen Series**: Qwen 3, Qwen 2.5, Qwen 2, Qwen 1 (multiple sizes)
- **Llama Series**: Llama 3.2, Llama 3.1, Llama 2 (multiple sizes)
- **GPT Models**: GPT-2, GPT-NeoX
- **Google Models**: Gemma, T5, BERT
- **Mistral Models**: Mistral 7B, Mixtral 8x7B
- **Other Models**: DeepSeek, Phi, Yi, BLOOM, OPT, StableLM
### Custom Models
You can use any tokenizer available on the Hugging Face Hub by entering its model ID in the "Custom Model ID" field. Examples:
- `facebook/bart-base`
- `EleutherAI/gpt-j-6b`
- `bigscience/bloom`
- `stabilityai/stablelm-2-1_6b`
## Technical Details
- Built with Gradio for an intuitive web interface
- Uses Hugging Face Transformers for tokenizer support
- Supports both fast (Rust-based) and slow (Python-based) tokenizers
- Caches loaded tokenizers for improved performance
- Handles special tokens and custom vocabularies
## Quick Start
1. **Select a tokenizer** from the dropdown or enter a custom model ID
2. **Enter your text** in the input field
3. **Click the action button** (Tokenize, Decode, Compare, or Analyze)
4. **View the results** in the output fields
## Tips
- Different tokenizers can produce significantly different token counts for the same text
- Special tokens (like `[CLS]`, `[SEP]`, `<s>`, `</s>`) are model-specific
- Subword tokenization allows handling of out-of-vocabulary words
- Token efficiency directly impacts model inference costs and API usage
## Local Development
To run this application locally:
```bash
# Clone the repository
git clone <your-repo-url>
cd tokenizer-playground
# Install dependencies
pip install -r requirements.txt
# Run the application
python app.py
```
The application will be available at `http://localhost:7860`
## License
This project is licensed under the MIT License.
|