Spaces:
Running
Running
| title: Tokenizer Playground | |
| emoji: π€ | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: gradio | |
| sdk_version: 4.44.1 | |
| app_file: app.py | |
| pinned: true | |
| license: mit | |
| models: | |
| - Qwen/Qwen3-0.6B | |
| - Qwen/Qwen2.5-7B | |
| - meta-llama/Llama-3.1-8B | |
| - openai-community/gpt2 | |
| - mistralai/Mistral-7B-v0.1 | |
| - google/gemma-7b | |
| tags: | |
| - tokenizer | |
| - nlp | |
| - text-processing | |
| - research-tool | |
| short_description: Interactive tokenizer tool for NLP researchers | |
| # π€ Tokenizer Playground | |
| An interactive web application for experimenting with various Hugging Face tokenizers. Perfect for NLP researchers and developers who need to quickly test and compare different tokenization strategies. | |
| ## Features | |
| ### π€ Tokenize Tab | |
| - Convert any text into tokens using popular models | |
| - View tokens, token IDs, and detailed token information | |
| - See tokenization statistics (tokens per character, vocabulary size, etc.) | |
| - Support for adding/removing special tokens | |
| - Custom model support via Hugging Face model IDs | |
| ### π Detokenize Tab | |
| - Convert token IDs back to text | |
| - Support for various input formats (list, comma-separated, space-separated) | |
| - Option to skip special tokens | |
| - Verification of round-trip tokenization | |
| ### π Compare Tab | |
| - Compare tokenization across multiple models simultaneously | |
| - See token count differences and efficiency metrics | |
| - Identify which tokenizer is most efficient for your use case | |
| - Sort results by token count | |
| ### π Vocabulary Tab | |
| - Explore tokenizer vocabulary details | |
| - View special tokens and their configurations | |
| - See vocabulary size and tokenizer type | |
| - Browse first 100 tokens in the vocabulary | |
| ## Supported Models | |
| ### Pre-configured Models | |
| - **Qwen Series**: Qwen 3, Qwen 2.5, Qwen 2, Qwen 1 (multiple sizes) | |
| - **Llama Series**: Llama 3.2, Llama 3.1, Llama 2 (multiple sizes) | |
| - **GPT Models**: GPT-2, GPT-NeoX | |
| - **Google Models**: Gemma, T5, BERT | |
| - **Mistral Models**: Mistral 7B, Mixtral 8x7B | |
| - **Other Models**: DeepSeek, Phi, Yi, BLOOM, OPT, StableLM | |
| ### Custom Models | |
| You can use any tokenizer available on the Hugging Face Hub by entering its model ID in the "Custom Model ID" field. Examples: | |
| - `facebook/bart-base` | |
| - `EleutherAI/gpt-j-6b` | |
| - `bigscience/bloom` | |
| - `stabilityai/stablelm-2-1_6b` | |
| ## Technical Details | |
| - Built with Gradio for an intuitive web interface | |
| - Uses Hugging Face Transformers for tokenizer support | |
| - Supports both fast (Rust-based) and slow (Python-based) tokenizers | |
| - Caches loaded tokenizers for improved performance | |
| - Handles special tokens and custom vocabularies | |
| ## Quick Start | |
| 1. **Select a tokenizer** from the dropdown or enter a custom model ID | |
| 2. **Enter your text** in the input field | |
| 3. **Click the action button** (Tokenize, Decode, Compare, or Analyze) | |
| 4. **View the results** in the output fields | |
| ## Tips | |
| - Different tokenizers can produce significantly different token counts for the same text | |
| - Special tokens (like `[CLS]`, `[SEP]`, `<s>`, `</s>`) are model-specific | |
| - Subword tokenization allows handling of out-of-vocabulary words | |
| - Token efficiency directly impacts model inference costs and API usage | |
| ## Local Development | |
| To run this application locally: | |
| ```bash | |
| # Clone the repository | |
| git clone <your-repo-url> | |
| cd tokenizer-playground | |
| # Install dependencies | |
| pip install -r requirements.txt | |
| # Run the application | |
| python app.py | |
| ``` | |
| The application will be available at `http://localhost:7860` | |
| ## License | |
| This project is licensed under the MIT License. | |