Spaces:
Running
Running
add tokenizer
Browse files- DEPLOYMENT.md +191 -0
- README.md +110 -8
- app.py +467 -0
- requirements.txt +8 -0
- test_tokenizer.py +165 -0
DEPLOYMENT.md
ADDED
|
@@ -0,0 +1,191 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Deployment Instructions for Hugging Face Spaces
|
| 2 |
+
|
| 3 |
+
This guide will help you deploy the Tokenizer Playground to Hugging Face Spaces.
|
| 4 |
+
|
| 5 |
+
## Prerequisites
|
| 6 |
+
|
| 7 |
+
1. A Hugging Face account (create one at https://huggingface.co/join)
|
| 8 |
+
2. Git installed on your local machine
|
| 9 |
+
3. (Optional) Hugging Face CLI installed: `pip install huggingface-hub`
|
| 10 |
+
|
| 11 |
+
## Step 1: Create a New Space
|
| 12 |
+
|
| 13 |
+
1. Go to https://huggingface.co/spaces
|
| 14 |
+
2. Click on "Create new Space"
|
| 15 |
+
3. Fill in the following:
|
| 16 |
+
- **Space name**: Choose a unique name (e.g., "tokenizer-playground")
|
| 17 |
+
- **Select the Space SDK**: Choose **Gradio**
|
| 18 |
+
- **Select the Space hardware**: Start with **CPU basic** (free tier)
|
| 19 |
+
- **Repo type**: Public or Private (your choice)
|
| 20 |
+
4. Click "Create Space"
|
| 21 |
+
|
| 22 |
+
## Step 2: Clone Your Space Repository
|
| 23 |
+
|
| 24 |
+
After creating the space, you'll be redirected to your space page. Clone the repository:
|
| 25 |
+
|
| 26 |
+
```bash
|
| 27 |
+
git clone https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME
|
| 28 |
+
cd YOUR_SPACE_NAME
|
| 29 |
+
```
|
| 30 |
+
|
| 31 |
+
## Step 3: Add the Application Files
|
| 32 |
+
|
| 33 |
+
Copy all the files from this project to your Space repository:
|
| 34 |
+
|
| 35 |
+
```bash
|
| 36 |
+
# Copy the application files
|
| 37 |
+
cp path/to/tokenizer/app.py .
|
| 38 |
+
cp path/to/tokenizer/requirements.txt .
|
| 39 |
+
cp path/to/tokenizer/README.md .
|
| 40 |
+
cp path/to/tokenizer/.gitignore .
|
| 41 |
+
```
|
| 42 |
+
|
| 43 |
+
## Step 4: Commit and Push
|
| 44 |
+
|
| 45 |
+
```bash
|
| 46 |
+
git add .
|
| 47 |
+
git commit -m "Initial deployment of Tokenizer Playground"
|
| 48 |
+
git push
|
| 49 |
+
```
|
| 50 |
+
|
| 51 |
+
## Step 5: Monitor the Build
|
| 52 |
+
|
| 53 |
+
1. Go to your Space URL: https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME
|
| 54 |
+
2. Click on the "Files" tab to verify all files are uploaded
|
| 55 |
+
3. Click on the "Logs" tab to monitor the build process
|
| 56 |
+
4. The space will automatically build and deploy
|
| 57 |
+
|
| 58 |
+
## Step 6: (Optional) Configure Settings
|
| 59 |
+
|
| 60 |
+
### Secrets and Environment Variables
|
| 61 |
+
|
| 62 |
+
If you want to use private models or add API keys:
|
| 63 |
+
|
| 64 |
+
1. Go to your Space settings
|
| 65 |
+
2. Add secrets under "Repository secrets"
|
| 66 |
+
3. Access them in your code using `os.environ['SECRET_NAME']`
|
| 67 |
+
|
| 68 |
+
### Hardware Upgrade
|
| 69 |
+
|
| 70 |
+
For better performance:
|
| 71 |
+
|
| 72 |
+
1. Go to Settings β Hardware
|
| 73 |
+
2. Select a GPU tier (T4 small, T4 medium, A10G small, etc.)
|
| 74 |
+
3. Note: GPU tiers are paid options
|
| 75 |
+
|
| 76 |
+
### Persistent Storage
|
| 77 |
+
|
| 78 |
+
For caching tokenizers:
|
| 79 |
+
|
| 80 |
+
1. Go to Settings β Persistent storage
|
| 81 |
+
2. Enable persistent storage (paid feature)
|
| 82 |
+
3. This will cache downloaded models between restarts
|
| 83 |
+
|
| 84 |
+
## Troubleshooting
|
| 85 |
+
|
| 86 |
+
### Common Issues
|
| 87 |
+
|
| 88 |
+
1. **Build fails with dependency errors**
|
| 89 |
+
- Check that all packages in requirements.txt are compatible
|
| 90 |
+
- Try pinning specific versions if conflicts occur
|
| 91 |
+
|
| 92 |
+
2. **Space crashes on startup**
|
| 93 |
+
- Check the logs for error messages
|
| 94 |
+
- Ensure the app.py file has `app.launch()` at the end
|
| 95 |
+
- Verify Python syntax is correct
|
| 96 |
+
|
| 97 |
+
3. **Models fail to load**
|
| 98 |
+
- Some models require authentication
|
| 99 |
+
- Add your HF token as a secret if needed
|
| 100 |
+
- Some models might be too large for free tier
|
| 101 |
+
|
| 102 |
+
4. **Slow performance**
|
| 103 |
+
- Consider upgrading to GPU hardware
|
| 104 |
+
- Enable persistent storage to cache models
|
| 105 |
+
- Reduce the number of pre-loaded models
|
| 106 |
+
|
| 107 |
+
### Resource Limits
|
| 108 |
+
|
| 109 |
+
**Free Tier (CPU basic):**
|
| 110 |
+
- 2 vCPU
|
| 111 |
+
- 16 GB RAM
|
| 112 |
+
- No GPU
|
| 113 |
+
- Limited concurrent users
|
| 114 |
+
|
| 115 |
+
**Recommendations for Production:**
|
| 116 |
+
- Use T4 small or medium for good balance of cost/performance
|
| 117 |
+
- Enable persistent storage to avoid re-downloading models
|
| 118 |
+
- Consider implementing request queuing for high traffic
|
| 119 |
+
|
| 120 |
+
## Local Testing Before Deployment
|
| 121 |
+
|
| 122 |
+
Always test locally before deploying:
|
| 123 |
+
|
| 124 |
+
```bash
|
| 125 |
+
# Install dependencies
|
| 126 |
+
pip install -r requirements.txt
|
| 127 |
+
|
| 128 |
+
# Run the application
|
| 129 |
+
python app.py
|
| 130 |
+
|
| 131 |
+
# Test in browser at http://localhost:7860
|
| 132 |
+
```
|
| 133 |
+
|
| 134 |
+
## Updating Your Space
|
| 135 |
+
|
| 136 |
+
To update your deployed Space:
|
| 137 |
+
|
| 138 |
+
```bash
|
| 139 |
+
# Make changes to your files
|
| 140 |
+
git add .
|
| 141 |
+
git commit -m "Update: description of changes"
|
| 142 |
+
git push
|
| 143 |
+
```
|
| 144 |
+
|
| 145 |
+
The Space will automatically rebuild and redeploy.
|
| 146 |
+
|
| 147 |
+
## Using the Hugging Face CLI (Alternative Method)
|
| 148 |
+
|
| 149 |
+
If you have the Hugging Face CLI installed:
|
| 150 |
+
|
| 151 |
+
```bash
|
| 152 |
+
# Login to Hugging Face
|
| 153 |
+
huggingface-cli login
|
| 154 |
+
|
| 155 |
+
# Upload files directly
|
| 156 |
+
huggingface-cli upload YOUR_USERNAME/YOUR_SPACE_NAME . . --repo-type=space
|
| 157 |
+
```
|
| 158 |
+
|
| 159 |
+
## Performance Optimization Tips
|
| 160 |
+
|
| 161 |
+
1. **Lazy Loading**: The app already implements tokenizer caching
|
| 162 |
+
2. **Model Selection**: Start with smaller models for testing
|
| 163 |
+
3. **Batch Processing**: The compare feature processes models efficiently
|
| 164 |
+
4. **Error Handling**: Comprehensive error handling is implemented
|
| 165 |
+
|
| 166 |
+
## Security Considerations
|
| 167 |
+
|
| 168 |
+
1. **Never commit secrets**: Use environment variables for sensitive data
|
| 169 |
+
2. **Model Access**: Some models require authentication tokens
|
| 170 |
+
3. **Input Validation**: The app validates all inputs
|
| 171 |
+
4. **Rate Limiting**: Consider implementing rate limiting for production
|
| 172 |
+
|
| 173 |
+
## Support
|
| 174 |
+
|
| 175 |
+
- For Space-specific issues: https://huggingface.co/docs/hub/spaces
|
| 176 |
+
- For Gradio issues: https://gradio.app/docs
|
| 177 |
+
- For tokenizer issues: https://huggingface.co/docs/transformers/main_classes/tokenizer
|
| 178 |
+
|
| 179 |
+
## Next Steps
|
| 180 |
+
|
| 181 |
+
After successful deployment:
|
| 182 |
+
|
| 183 |
+
1. Share your Space URL with colleagues
|
| 184 |
+
2. Embed the Space in websites using the embed feature
|
| 185 |
+
3. Monitor usage in the Analytics tab
|
| 186 |
+
4. Collect feedback and iterate on features
|
| 187 |
+
5. Consider adding more tokenizers based on user needs
|
| 188 |
+
|
| 189 |
+
---
|
| 190 |
+
|
| 191 |
+
Good luck with your deployment! The Tokenizer Playground should provide a valuable tool for the NLP research community.
|
README.md
CHANGED
|
@@ -1,14 +1,116 @@
|
|
| 1 |
---
|
| 2 |
-
title:
|
| 3 |
-
emoji:
|
| 4 |
colorFrom: blue
|
| 5 |
-
colorTo:
|
| 6 |
sdk: gradio
|
| 7 |
-
sdk_version:
|
| 8 |
app_file: app.py
|
| 9 |
-
pinned:
|
| 10 |
-
license:
|
| 11 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
---
|
| 13 |
|
| 14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
title: Tokenizer Playground
|
| 3 |
+
emoji: π€
|
| 4 |
colorFrom: blue
|
| 5 |
+
colorTo: purple
|
| 6 |
sdk: gradio
|
| 7 |
+
sdk_version: 4.44.1
|
| 8 |
app_file: app.py
|
| 9 |
+
pinned: true
|
| 10 |
+
license: mit
|
| 11 |
+
models:
|
| 12 |
+
- Qwen/Qwen2.5-7B
|
| 13 |
+
- meta-llama/Llama-3.1-8B
|
| 14 |
+
- openai-community/gpt2
|
| 15 |
+
- mistralai/Mistral-7B-v0.1
|
| 16 |
+
- google/gemma-7b
|
| 17 |
+
tags:
|
| 18 |
+
- tokenizer
|
| 19 |
+
- nlp
|
| 20 |
+
- text-processing
|
| 21 |
+
- research-tool
|
| 22 |
+
short_description: Interactive tokenizer tool for NLP researchers
|
| 23 |
---
|
| 24 |
|
| 25 |
+
# π€ Tokenizer Playground
|
| 26 |
+
|
| 27 |
+
An interactive web application for experimenting with various Hugging Face tokenizers. Perfect for NLP researchers and developers who need to quickly test and compare different tokenization strategies.
|
| 28 |
+
|
| 29 |
+
## Features
|
| 30 |
+
|
| 31 |
+
### π€ Tokenize Tab
|
| 32 |
+
- Convert any text into tokens using popular models
|
| 33 |
+
- View tokens, token IDs, and detailed token information
|
| 34 |
+
- See tokenization statistics (tokens per character, vocabulary size, etc.)
|
| 35 |
+
- Support for adding/removing special tokens
|
| 36 |
+
- Custom model support via Hugging Face model IDs
|
| 37 |
+
|
| 38 |
+
### π Detokenize Tab
|
| 39 |
+
- Convert token IDs back to text
|
| 40 |
+
- Support for various input formats (list, comma-separated, space-separated)
|
| 41 |
+
- Option to skip special tokens
|
| 42 |
+
- Verification of round-trip tokenization
|
| 43 |
+
|
| 44 |
+
### π Compare Tab
|
| 45 |
+
- Compare tokenization across multiple models simultaneously
|
| 46 |
+
- See token count differences and efficiency metrics
|
| 47 |
+
- Identify which tokenizer is most efficient for your use case
|
| 48 |
+
- Sort results by token count
|
| 49 |
+
|
| 50 |
+
### π Vocabulary Tab
|
| 51 |
+
- Explore tokenizer vocabulary details
|
| 52 |
+
- View special tokens and their configurations
|
| 53 |
+
- See vocabulary size and tokenizer type
|
| 54 |
+
- Browse first 100 tokens in the vocabulary
|
| 55 |
+
|
| 56 |
+
## Supported Models
|
| 57 |
+
|
| 58 |
+
### Pre-configured Models
|
| 59 |
+
- **Qwen Series**: Qwen 2.5, Qwen 2, Qwen 1 (multiple sizes)
|
| 60 |
+
- **Llama Series**: Llama 3.2, Llama 3.1, Llama 2 (multiple sizes)
|
| 61 |
+
- **GPT Models**: GPT-2, GPT-NeoX
|
| 62 |
+
- **Google Models**: Gemma, T5, BERT
|
| 63 |
+
- **Mistral Models**: Mistral 7B, Mixtral 8x7B
|
| 64 |
+
- **Other Models**: DeepSeek, Phi, Yi, BLOOM, OPT, StableLM
|
| 65 |
+
|
| 66 |
+
### Custom Models
|
| 67 |
+
You can use any tokenizer available on the Hugging Face Hub by entering its model ID in the "Custom Model ID" field. Examples:
|
| 68 |
+
- `facebook/bart-base`
|
| 69 |
+
- `EleutherAI/gpt-j-6b`
|
| 70 |
+
- `bigscience/bloom`
|
| 71 |
+
- `stabilityai/stablelm-2-1_6b`
|
| 72 |
+
|
| 73 |
+
## Technical Details
|
| 74 |
+
|
| 75 |
+
- Built with Gradio for an intuitive web interface
|
| 76 |
+
- Uses Hugging Face Transformers for tokenizer support
|
| 77 |
+
- Supports both fast (Rust-based) and slow (Python-based) tokenizers
|
| 78 |
+
- Caches loaded tokenizers for improved performance
|
| 79 |
+
- Handles special tokens and custom vocabularies
|
| 80 |
+
|
| 81 |
+
## Quick Start
|
| 82 |
+
|
| 83 |
+
1. **Select a tokenizer** from the dropdown or enter a custom model ID
|
| 84 |
+
2. **Enter your text** in the input field
|
| 85 |
+
3. **Click the action button** (Tokenize, Decode, Compare, or Analyze)
|
| 86 |
+
4. **View the results** in the output fields
|
| 87 |
+
|
| 88 |
+
## Tips
|
| 89 |
+
|
| 90 |
+
- Different tokenizers can produce significantly different token counts for the same text
|
| 91 |
+
- Special tokens (like `[CLS]`, `[SEP]`, `<s>`, `</s>`) are model-specific
|
| 92 |
+
- Subword tokenization allows handling of out-of-vocabulary words
|
| 93 |
+
- Token efficiency directly impacts model inference costs and API usage
|
| 94 |
+
|
| 95 |
+
## Local Development
|
| 96 |
+
|
| 97 |
+
To run this application locally:
|
| 98 |
+
|
| 99 |
+
```bash
|
| 100 |
+
# Clone the repository
|
| 101 |
+
git clone <your-repo-url>
|
| 102 |
+
cd tokenizer-playground
|
| 103 |
+
|
| 104 |
+
# Install dependencies
|
| 105 |
+
pip install -r requirements.txt
|
| 106 |
+
|
| 107 |
+
# Run the application
|
| 108 |
+
python app.py
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
The application will be available at `http://localhost:7860`
|
| 112 |
+
|
| 113 |
+
## License
|
| 114 |
+
|
| 115 |
+
This project is licensed under the MIT License.
|
| 116 |
+
|
app.py
ADDED
|
@@ -0,0 +1,467 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import gradio as gr
|
| 2 |
+
from transformers import AutoTokenizer
|
| 3 |
+
import json
|
| 4 |
+
import traceback
|
| 5 |
+
from typing import Optional, Dict, List, Tuple
|
| 6 |
+
|
| 7 |
+
# Popular tokenizer models
|
| 8 |
+
TOKENIZER_OPTIONS = {
|
| 9 |
+
# Qwen Series
|
| 10 |
+
"Qwen/Qwen2.5-7B": "Qwen 2.5 (7B)",
|
| 11 |
+
"Qwen/Qwen2.5-72B": "Qwen 2.5 (72B)",
|
| 12 |
+
"Qwen/Qwen2-7B": "Qwen 2 (7B)",
|
| 13 |
+
"Qwen/Qwen2-72B": "Qwen 2 (72B)",
|
| 14 |
+
"Qwen/Qwen-7B": "Qwen 1 (7B)",
|
| 15 |
+
|
| 16 |
+
# Llama Series
|
| 17 |
+
"meta-llama/Llama-3.2-1B": "Llama 3.2 (1B)",
|
| 18 |
+
"meta-llama/Llama-3.2-3B": "Llama 3.2 (3B)",
|
| 19 |
+
"meta-llama/Llama-3.1-8B": "Llama 3.1 (8B)",
|
| 20 |
+
"meta-llama/Llama-3.1-70B": "Llama 3.1 (70B)",
|
| 21 |
+
"meta-llama/Llama-2-7b-hf": "Llama 2 (7B)",
|
| 22 |
+
"meta-llama/Llama-2-13b-hf": "Llama 2 (13B)",
|
| 23 |
+
"meta-llama/Llama-2-70b-hf": "Llama 2 (70B)",
|
| 24 |
+
|
| 25 |
+
# Other Popular Models
|
| 26 |
+
"openai-community/gpt2": "GPT-2",
|
| 27 |
+
"google/gemma-2b": "Gemma (2B)",
|
| 28 |
+
"google/gemma-7b": "Gemma (7B)",
|
| 29 |
+
"mistralai/Mistral-7B-v0.1": "Mistral (7B)",
|
| 30 |
+
"mistralai/Mixtral-8x7B-v0.1": "Mixtral (8x7B)",
|
| 31 |
+
"deepseek-ai/deepseek-coder-6.7b-base": "DeepSeek Coder (6.7B)",
|
| 32 |
+
"microsoft/phi-2": "Phi-2",
|
| 33 |
+
"microsoft/phi-3-mini-4k-instruct": "Phi-3 Mini",
|
| 34 |
+
"01-ai/Yi-6B": "Yi (6B)",
|
| 35 |
+
"01-ai/Yi-34B": "Yi (34B)",
|
| 36 |
+
"google-t5/t5-base": "T5 Base",
|
| 37 |
+
"google-bert/bert-base-uncased": "BERT Base (uncased)",
|
| 38 |
+
"google-bert/bert-base-cased": "BERT Base (cased)",
|
| 39 |
+
"EleutherAI/gpt-neox-20b": "GPT-NeoX (20B)",
|
| 40 |
+
"bigscience/bloom-560m": "BLOOM (560M)",
|
| 41 |
+
"facebook/opt-350m": "OPT (350M)",
|
| 42 |
+
"stabilityai/stablelm-base-alpha-7b": "StableLM (7B)",
|
| 43 |
+
}
|
| 44 |
+
|
| 45 |
+
# Cache for loaded tokenizers
|
| 46 |
+
tokenizer_cache = {}
|
| 47 |
+
|
| 48 |
+
def load_tokenizer(model_id: str):
|
| 49 |
+
"""Load a tokenizer with caching."""
|
| 50 |
+
if model_id not in tokenizer_cache:
|
| 51 |
+
try:
|
| 52 |
+
tokenizer_cache[model_id] = AutoTokenizer.from_pretrained(
|
| 53 |
+
model_id,
|
| 54 |
+
trust_remote_code=True,
|
| 55 |
+
use_fast=True # Use fast tokenizer when available
|
| 56 |
+
)
|
| 57 |
+
except Exception as e:
|
| 58 |
+
# Fallback to slow tokenizer if fast is not available
|
| 59 |
+
try:
|
| 60 |
+
tokenizer_cache[model_id] = AutoTokenizer.from_pretrained(
|
| 61 |
+
model_id,
|
| 62 |
+
trust_remote_code=True,
|
| 63 |
+
use_fast=False
|
| 64 |
+
)
|
| 65 |
+
except:
|
| 66 |
+
raise e
|
| 67 |
+
return tokenizer_cache[model_id]
|
| 68 |
+
|
| 69 |
+
def tokenize_text(
|
| 70 |
+
text: str,
|
| 71 |
+
model_id: str,
|
| 72 |
+
add_special_tokens: bool = True,
|
| 73 |
+
show_special_tokens: bool = True,
|
| 74 |
+
custom_model_id: Optional[str] = None
|
| 75 |
+
) -> Tuple[str, str, str, str]:
|
| 76 |
+
"""
|
| 77 |
+
Tokenize text using the selected tokenizer.
|
| 78 |
+
|
| 79 |
+
Returns:
|
| 80 |
+
Tuple of (tokens_json, token_ids, decoded_text, stats)
|
| 81 |
+
"""
|
| 82 |
+
try:
|
| 83 |
+
# Use custom model ID if provided
|
| 84 |
+
actual_model_id = custom_model_id.strip() if custom_model_id and custom_model_id.strip() else model_id
|
| 85 |
+
|
| 86 |
+
if not actual_model_id:
|
| 87 |
+
return "", "", "", "Please select or enter a tokenizer model."
|
| 88 |
+
|
| 89 |
+
# Load tokenizer
|
| 90 |
+
tokenizer = load_tokenizer(actual_model_id)
|
| 91 |
+
|
| 92 |
+
# Tokenize
|
| 93 |
+
encoded = tokenizer.encode(text, add_special_tokens=add_special_tokens)
|
| 94 |
+
tokens = tokenizer.convert_ids_to_tokens(encoded)
|
| 95 |
+
|
| 96 |
+
# Decode
|
| 97 |
+
decoded = tokenizer.decode(encoded, skip_special_tokens=not show_special_tokens)
|
| 98 |
+
|
| 99 |
+
# Create detailed token information
|
| 100 |
+
token_info = []
|
| 101 |
+
for i, (token, token_id) in enumerate(zip(tokens, encoded)):
|
| 102 |
+
# Try to get the actual string representation of the token
|
| 103 |
+
try:
|
| 104 |
+
token_str = tokenizer.convert_tokens_to_string([token])
|
| 105 |
+
except:
|
| 106 |
+
token_str = token
|
| 107 |
+
|
| 108 |
+
token_info.append({
|
| 109 |
+
"index": i,
|
| 110 |
+
"token": token,
|
| 111 |
+
"token_id": token_id,
|
| 112 |
+
"text": token_str,
|
| 113 |
+
"is_special": token_id in (tokenizer.all_special_ids if hasattr(tokenizer, 'all_special_ids') else [])
|
| 114 |
+
})
|
| 115 |
+
|
| 116 |
+
# Format outputs
|
| 117 |
+
tokens_display = json.dumps(tokens, ensure_ascii=False, indent=2)
|
| 118 |
+
token_ids_display = str(encoded)
|
| 119 |
+
token_info_json = json.dumps(token_info, ensure_ascii=False, indent=2)
|
| 120 |
+
|
| 121 |
+
# Statistics
|
| 122 |
+
stats = f"""Statistics:
|
| 123 |
+
β’ Model: {actual_model_id}
|
| 124 |
+
β’ Number of tokens: {len(tokens)}
|
| 125 |
+
β’ Number of characters: {len(text)}
|
| 126 |
+
β’ Tokens per character: {len(tokens)/len(text):.2f}
|
| 127 |
+
β’ Characters per token: {len(text)/len(tokens):.2f}
|
| 128 |
+
β’ Vocabulary size: {tokenizer.vocab_size if hasattr(tokenizer, 'vocab_size') else 'N/A'}
|
| 129 |
+
β’ Special tokens: {', '.join(tokenizer.all_special_tokens) if hasattr(tokenizer, 'all_special_tokens') else 'N/A'}"""
|
| 130 |
+
|
| 131 |
+
return tokens_display, token_ids_display, decoded, token_info_json, stats
|
| 132 |
+
|
| 133 |
+
except Exception as e:
|
| 134 |
+
error_msg = f"Error: {str(e)}\n{traceback.format_exc()}"
|
| 135 |
+
return error_msg, "", "", "", ""
|
| 136 |
+
|
| 137 |
+
def decode_tokens(
|
| 138 |
+
token_ids_str: str,
|
| 139 |
+
model_id: str,
|
| 140 |
+
skip_special_tokens: bool = False,
|
| 141 |
+
custom_model_id: Optional[str] = None
|
| 142 |
+
) -> str:
|
| 143 |
+
"""Decode token IDs back to text."""
|
| 144 |
+
try:
|
| 145 |
+
# Use custom model ID if provided
|
| 146 |
+
actual_model_id = custom_model_id.strip() if custom_model_id and custom_model_id.strip() else model_id
|
| 147 |
+
|
| 148 |
+
if not actual_model_id:
|
| 149 |
+
return "Please select or enter a tokenizer model."
|
| 150 |
+
|
| 151 |
+
# Parse token IDs
|
| 152 |
+
token_ids_str = token_ids_str.strip()
|
| 153 |
+
if token_ids_str.startswith('[') and token_ids_str.endswith(']'):
|
| 154 |
+
token_ids = json.loads(token_ids_str)
|
| 155 |
+
else:
|
| 156 |
+
# Try to parse as comma or space separated values
|
| 157 |
+
token_ids = [int(x.strip()) for x in token_ids_str.replace(',', ' ').split()]
|
| 158 |
+
|
| 159 |
+
# Load tokenizer and decode
|
| 160 |
+
tokenizer = load_tokenizer(actual_model_id)
|
| 161 |
+
decoded = tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
|
| 162 |
+
|
| 163 |
+
# Also show tokens
|
| 164 |
+
tokens = tokenizer.convert_ids_to_tokens(token_ids)
|
| 165 |
+
|
| 166 |
+
result = f"""Decoded Text:
|
| 167 |
+
{decoded}
|
| 168 |
+
|
| 169 |
+
Tokens:
|
| 170 |
+
{json.dumps(tokens, ensure_ascii=False, indent=2)}
|
| 171 |
+
|
| 172 |
+
Token Count: {len(tokens)}"""
|
| 173 |
+
|
| 174 |
+
return result
|
| 175 |
+
|
| 176 |
+
except Exception as e:
|
| 177 |
+
return f"Error: {str(e)}\n{traceback.format_exc()}"
|
| 178 |
+
|
| 179 |
+
def compare_tokenizers(
|
| 180 |
+
text: str,
|
| 181 |
+
model_ids: List[str],
|
| 182 |
+
add_special_tokens: bool = True
|
| 183 |
+
) -> str:
|
| 184 |
+
"""Compare tokenization across multiple models."""
|
| 185 |
+
if not model_ids:
|
| 186 |
+
return "Please select at least one model to compare."
|
| 187 |
+
|
| 188 |
+
results = []
|
| 189 |
+
|
| 190 |
+
for model_id in model_ids:
|
| 191 |
+
try:
|
| 192 |
+
tokenizer = load_tokenizer(model_id)
|
| 193 |
+
encoded = tokenizer.encode(text, add_special_tokens=add_special_tokens)
|
| 194 |
+
tokens = tokenizer.convert_ids_to_tokens(encoded)
|
| 195 |
+
|
| 196 |
+
results.append({
|
| 197 |
+
"model": model_id,
|
| 198 |
+
"token_count": len(tokens),
|
| 199 |
+
"tokens": tokens[:50], # Show first 50 tokens
|
| 200 |
+
"token_ids": encoded[:50] # Show first 50 IDs
|
| 201 |
+
})
|
| 202 |
+
except Exception as e:
|
| 203 |
+
results.append({
|
| 204 |
+
"model": model_id,
|
| 205 |
+
"error": str(e)
|
| 206 |
+
})
|
| 207 |
+
|
| 208 |
+
# Sort by token count
|
| 209 |
+
results.sort(key=lambda x: x.get("token_count", float('inf')))
|
| 210 |
+
|
| 211 |
+
# Format output
|
| 212 |
+
output = "# Tokenizer Comparison\n\n"
|
| 213 |
+
output += f"Input text length: {len(text)} characters\n\n"
|
| 214 |
+
|
| 215 |
+
for result in results:
|
| 216 |
+
if "error" in result:
|
| 217 |
+
output += f"## {result['model']}\n"
|
| 218 |
+
output += f"Error: {result['error']}\n\n"
|
| 219 |
+
else:
|
| 220 |
+
output += f"## {result['model']}\n"
|
| 221 |
+
output += f"**Token count:** {result['token_count']} "
|
| 222 |
+
output += f"(ratio: {result['token_count']/len(text):.2f} tokens/char)\n\n"
|
| 223 |
+
output += f"**First tokens:** {result['tokens']}\n\n"
|
| 224 |
+
if len(result['tokens']) == 50:
|
| 225 |
+
output += "*(showing first 50 tokens)*\n\n"
|
| 226 |
+
|
| 227 |
+
return output
|
| 228 |
+
|
| 229 |
+
def analyze_vocabulary(model_id: str, custom_model_id: Optional[str] = None) -> str:
|
| 230 |
+
"""Analyze tokenizer vocabulary."""
|
| 231 |
+
try:
|
| 232 |
+
actual_model_id = custom_model_id.strip() if custom_model_id and custom_model_id.strip() else model_id
|
| 233 |
+
|
| 234 |
+
if not actual_model_id:
|
| 235 |
+
return "Please select or enter a tokenizer model."
|
| 236 |
+
|
| 237 |
+
tokenizer = load_tokenizer(actual_model_id)
|
| 238 |
+
|
| 239 |
+
# Get vocabulary information
|
| 240 |
+
vocab_size = tokenizer.vocab_size if hasattr(tokenizer, 'vocab_size') else len(tokenizer.get_vocab())
|
| 241 |
+
|
| 242 |
+
# Get special tokens
|
| 243 |
+
special_tokens = {}
|
| 244 |
+
if hasattr(tokenizer, 'special_tokens_map'):
|
| 245 |
+
special_tokens = tokenizer.special_tokens_map
|
| 246 |
+
|
| 247 |
+
# Get some example tokens
|
| 248 |
+
vocab = tokenizer.get_vocab()
|
| 249 |
+
sorted_vocab = sorted(vocab.items(), key=lambda x: x[1])[:100] # First 100 tokens
|
| 250 |
+
|
| 251 |
+
output = f"""# Tokenizer Vocabulary Analysis
|
| 252 |
+
|
| 253 |
+
**Model:** {actual_model_id}
|
| 254 |
+
**Vocabulary Size:** {vocab_size:,}
|
| 255 |
+
**Tokenizer Type:** {tokenizer.__class__.__name__}
|
| 256 |
+
|
| 257 |
+
## Special Tokens
|
| 258 |
+
```json
|
| 259 |
+
{json.dumps(special_tokens, ensure_ascii=False, indent=2)}
|
| 260 |
+
```
|
| 261 |
+
|
| 262 |
+
## Token Settings
|
| 263 |
+
β’ Padding Token: {tokenizer.pad_token if tokenizer.pad_token else 'None'}
|
| 264 |
+
β’ BOS Token: {tokenizer.bos_token if tokenizer.bos_token else 'None'}
|
| 265 |
+
β’ EOS Token: {tokenizer.eos_token if tokenizer.eos_token else 'None'}
|
| 266 |
+
β’ UNK Token: {tokenizer.unk_token if tokenizer.unk_token else 'None'}
|
| 267 |
+
β’ SEP Token: {tokenizer.sep_token if hasattr(tokenizer, 'sep_token') and tokenizer.sep_token else 'None'}
|
| 268 |
+
β’ CLS Token: {tokenizer.cls_token if hasattr(tokenizer, 'cls_token') and tokenizer.cls_token else 'None'}
|
| 269 |
+
β’ Mask Token: {tokenizer.mask_token if hasattr(tokenizer, 'mask_token') and tokenizer.mask_token else 'None'}
|
| 270 |
+
|
| 271 |
+
## First 100 Tokens in Vocabulary
|
| 272 |
+
Token β ID
|
| 273 |
+
"""
|
| 274 |
+
for token, token_id in sorted_vocab:
|
| 275 |
+
# Escape special characters for display
|
| 276 |
+
display_token = repr(token) if not token.isprintable() else token
|
| 277 |
+
output += f"{display_token} β {token_id}\n"
|
| 278 |
+
|
| 279 |
+
return output
|
| 280 |
+
|
| 281 |
+
except Exception as e:
|
| 282 |
+
return f"Error: {str(e)}\n{traceback.format_exc()}"
|
| 283 |
+
|
| 284 |
+
# Create Gradio interface
|
| 285 |
+
with gr.Blocks(title="π€ Tokenizer Playground", theme=gr.themes.Soft()) as app:
|
| 286 |
+
gr.Markdown("""
|
| 287 |
+
# π€ Tokenizer Playground
|
| 288 |
+
|
| 289 |
+
A comprehensive tool for NLP researchers to experiment with various Hugging Face tokenizers.
|
| 290 |
+
Supports popular models including **Qwen**, **Llama**, **Mistral**, **GPT**, and many more.
|
| 291 |
+
|
| 292 |
+
### Features:
|
| 293 |
+
- π€ **Tokenize & Detokenize** text with any Hugging Face tokenizer
|
| 294 |
+
- π **Compare** tokenization across multiple models
|
| 295 |
+
- π **Analyze** vocabulary and special tokens
|
| 296 |
+
- π― **Support** for custom model IDs from Hugging Face Hub
|
| 297 |
+
""")
|
| 298 |
+
|
| 299 |
+
with gr.Tab("π€ Tokenize"):
|
| 300 |
+
with gr.Row():
|
| 301 |
+
with gr.Column(scale=3):
|
| 302 |
+
tokenize_input = gr.Textbox(
|
| 303 |
+
label="Input Text",
|
| 304 |
+
placeholder="Enter text to tokenize...",
|
| 305 |
+
lines=5
|
| 306 |
+
)
|
| 307 |
+
with gr.Column(scale=1):
|
| 308 |
+
tokenize_model = gr.Dropdown(
|
| 309 |
+
label="Select Tokenizer",
|
| 310 |
+
choices=list(TOKENIZER_OPTIONS.keys()),
|
| 311 |
+
value="Qwen/Qwen2.5-7B",
|
| 312 |
+
allow_custom_value=False
|
| 313 |
+
)
|
| 314 |
+
tokenize_custom_model = gr.Textbox(
|
| 315 |
+
label="Or Enter Custom Model ID",
|
| 316 |
+
placeholder="e.g., facebook/bart-base",
|
| 317 |
+
info="Override selection above with any HF model"
|
| 318 |
+
)
|
| 319 |
+
add_special = gr.Checkbox(label="Add Special Tokens", value=True)
|
| 320 |
+
show_special = gr.Checkbox(label="Show Special Tokens in Decoded", value=True)
|
| 321 |
+
tokenize_btn = gr.Button("Tokenize", variant="primary")
|
| 322 |
+
|
| 323 |
+
with gr.Row():
|
| 324 |
+
with gr.Column():
|
| 325 |
+
tokens_output = gr.Textbox(label="Tokens", lines=10, max_lines=20)
|
| 326 |
+
with gr.Column():
|
| 327 |
+
token_ids_output = gr.Textbox(label="Token IDs", lines=10, max_lines=20)
|
| 328 |
+
|
| 329 |
+
with gr.Row():
|
| 330 |
+
with gr.Column():
|
| 331 |
+
decoded_output = gr.Textbox(label="Decoded Text (Verification)", lines=5)
|
| 332 |
+
with gr.Column():
|
| 333 |
+
token_info_output = gr.Textbox(label="Detailed Token Information", lines=10, max_lines=20)
|
| 334 |
+
|
| 335 |
+
stats_output = gr.Textbox(label="Statistics", lines=7)
|
| 336 |
+
|
| 337 |
+
tokenize_btn.click(
|
| 338 |
+
fn=tokenize_text,
|
| 339 |
+
inputs=[tokenize_input, tokenize_model, add_special, show_special, tokenize_custom_model],
|
| 340 |
+
outputs=[tokens_output, token_ids_output, decoded_output, token_info_output, stats_output]
|
| 341 |
+
)
|
| 342 |
+
|
| 343 |
+
with gr.Tab("π Detokenize"):
|
| 344 |
+
with gr.Row():
|
| 345 |
+
with gr.Column(scale=3):
|
| 346 |
+
decode_input = gr.Textbox(
|
| 347 |
+
label="Token IDs",
|
| 348 |
+
placeholder="Enter token IDs as a list [101, 2023, ...] or space/comma separated",
|
| 349 |
+
lines=5
|
| 350 |
+
)
|
| 351 |
+
with gr.Column(scale=1):
|
| 352 |
+
decode_model = gr.Dropdown(
|
| 353 |
+
label="Select Tokenizer",
|
| 354 |
+
choices=list(TOKENIZER_OPTIONS.keys()),
|
| 355 |
+
value="Qwen/Qwen2.5-7B"
|
| 356 |
+
)
|
| 357 |
+
decode_custom_model = gr.Textbox(
|
| 358 |
+
label="Or Enter Custom Model ID",
|
| 359 |
+
placeholder="e.g., facebook/bart-base"
|
| 360 |
+
)
|
| 361 |
+
skip_special = gr.Checkbox(label="Skip Special Tokens", value=False)
|
| 362 |
+
decode_btn = gr.Button("Decode", variant="primary")
|
| 363 |
+
|
| 364 |
+
decode_output = gr.Textbox(label="Decoded Result", lines=10)
|
| 365 |
+
|
| 366 |
+
decode_btn.click(
|
| 367 |
+
fn=decode_tokens,
|
| 368 |
+
inputs=[decode_input, decode_model, skip_special, decode_custom_model],
|
| 369 |
+
outputs=decode_output
|
| 370 |
+
)
|
| 371 |
+
|
| 372 |
+
with gr.Tab("π Compare"):
|
| 373 |
+
compare_input = gr.Textbox(
|
| 374 |
+
label="Input Text",
|
| 375 |
+
placeholder="Enter text to compare tokenization across models...",
|
| 376 |
+
lines=5
|
| 377 |
+
)
|
| 378 |
+
|
| 379 |
+
compare_models = gr.CheckboxGroup(
|
| 380 |
+
label="Select Models to Compare",
|
| 381 |
+
choices=list(TOKENIZER_OPTIONS.keys()),
|
| 382 |
+
value=["Qwen/Qwen2.5-7B", "meta-llama/Llama-3.1-8B", "openai-community/gpt2"]
|
| 383 |
+
)
|
| 384 |
+
|
| 385 |
+
compare_add_special = gr.Checkbox(label="Add Special Tokens", value=True)
|
| 386 |
+
compare_btn = gr.Button("Compare Tokenizers", variant="primary")
|
| 387 |
+
|
| 388 |
+
compare_output = gr.Markdown()
|
| 389 |
+
|
| 390 |
+
compare_btn.click(
|
| 391 |
+
fn=compare_tokenizers,
|
| 392 |
+
inputs=[compare_input, compare_models, compare_add_special],
|
| 393 |
+
outputs=compare_output
|
| 394 |
+
)
|
| 395 |
+
|
| 396 |
+
with gr.Tab("π Vocabulary"):
|
| 397 |
+
with gr.Row():
|
| 398 |
+
vocab_model = gr.Dropdown(
|
| 399 |
+
label="Select Tokenizer",
|
| 400 |
+
choices=list(TOKENIZER_OPTIONS.keys()),
|
| 401 |
+
value="Qwen/Qwen2.5-7B"
|
| 402 |
+
)
|
| 403 |
+
vocab_custom_model = gr.Textbox(
|
| 404 |
+
label="Or Enter Custom Model ID",
|
| 405 |
+
placeholder="e.g., facebook/bart-base"
|
| 406 |
+
)
|
| 407 |
+
vocab_btn = gr.Button("Analyze Vocabulary", variant="primary")
|
| 408 |
+
|
| 409 |
+
vocab_output = gr.Markdown()
|
| 410 |
+
|
| 411 |
+
vocab_btn.click(
|
| 412 |
+
fn=analyze_vocabulary,
|
| 413 |
+
inputs=[vocab_model, vocab_custom_model],
|
| 414 |
+
outputs=vocab_output
|
| 415 |
+
)
|
| 416 |
+
|
| 417 |
+
with gr.Tab("βΉοΈ About"):
|
| 418 |
+
gr.Markdown("""
|
| 419 |
+
## About This Tool
|
| 420 |
+
|
| 421 |
+
This tokenizer playground provides researchers and developers with an easy way to experiment
|
| 422 |
+
with various tokenizers from the Hugging Face Model Hub.
|
| 423 |
+
|
| 424 |
+
### Supported Models
|
| 425 |
+
|
| 426 |
+
**Qwen Series:** Qwen 2.5, Qwen 2, Qwen 1 (various sizes)
|
| 427 |
+
|
| 428 |
+
**Llama Series:** Llama 3.2, Llama 3.1, Llama 2 (various sizes)
|
| 429 |
+
|
| 430 |
+
**Other Popular Models:** GPT-2, Gemma, Mistral, Mixtral, DeepSeek, Phi, Yi, T5, BERT, GPT-NeoX, BLOOM, OPT, StableLM
|
| 431 |
+
|
| 432 |
+
### Custom Models
|
| 433 |
+
|
| 434 |
+
You can use any tokenizer from the Hugging Face Hub by entering its model ID in the "Custom Model ID" field.
|
| 435 |
+
For example:
|
| 436 |
+
- `facebook/bart-base`
|
| 437 |
+
- `EleutherAI/gpt-j-6b`
|
| 438 |
+
- `bigscience/bloom`
|
| 439 |
+
|
| 440 |
+
### Features Explanation
|
| 441 |
+
|
| 442 |
+
- **Tokenize:** Convert text into tokens and token IDs
|
| 443 |
+
- **Detokenize:** Convert token IDs back to text
|
| 444 |
+
- **Compare:** See how different tokenizers handle the same text
|
| 445 |
+
- **Vocabulary:** Explore tokenizer vocabulary and special tokens
|
| 446 |
+
|
| 447 |
+
### Tips
|
| 448 |
+
|
| 449 |
+
1. Different tokenizers can produce very different token counts for the same text
|
| 450 |
+
2. Special tokens (like [CLS], [SEP], <s>, </s>) are model-specific
|
| 451 |
+
3. Subword tokenization (used by most modern models) allows handling of out-of-vocabulary words
|
| 452 |
+
4. Token efficiency affects model performance and API costs
|
| 453 |
+
|
| 454 |
+
### Resources
|
| 455 |
+
|
| 456 |
+
- [Hugging Face Tokenizers Documentation](https://huggingface.co/docs/transformers/main_classes/tokenizer)
|
| 457 |
+
- [Understanding Tokenization](https://huggingface.co/docs/transformers/tokenizer_summary)
|
| 458 |
+
- [Model Hub](https://huggingface.co/models)
|
| 459 |
+
|
| 460 |
+
---
|
| 461 |
+
|
| 462 |
+
Made with β€οΈ for the NLP research community
|
| 463 |
+
""")
|
| 464 |
+
|
| 465 |
+
# Launch the app
|
| 466 |
+
if __name__ == "__main__":
|
| 467 |
+
app.launch()
|
requirements.txt
ADDED
|
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
gradio==4.44.1
|
| 2 |
+
transformers==4.46.0
|
| 3 |
+
torch==2.5.0
|
| 4 |
+
sentencepiece==0.2.0
|
| 5 |
+
protobuf==5.28.2
|
| 6 |
+
tokenizers==0.20.1
|
| 7 |
+
huggingface-hub==0.26.0
|
| 8 |
+
tiktoken==0.8.0
|
test_tokenizer.py
ADDED
|
@@ -0,0 +1,165 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Simple test script to verify tokenizer functionality.
|
| 4 |
+
This tests the core functions without launching the Gradio interface.
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
import sys
|
| 8 |
+
import json
|
| 9 |
+
|
| 10 |
+
# Test imports
|
| 11 |
+
try:
|
| 12 |
+
from transformers import AutoTokenizer
|
| 13 |
+
print("β transformers imported successfully")
|
| 14 |
+
except ImportError as e:
|
| 15 |
+
print(f"β Failed to import transformers: {e}")
|
| 16 |
+
sys.exit(1)
|
| 17 |
+
|
| 18 |
+
try:
|
| 19 |
+
import gradio as gr
|
| 20 |
+
print("β gradio imported successfully")
|
| 21 |
+
except ImportError as e:
|
| 22 |
+
print(f"β Failed to import gradio: {e}")
|
| 23 |
+
sys.exit(1)
|
| 24 |
+
|
| 25 |
+
# Test basic tokenization
|
| 26 |
+
def test_basic_tokenization():
|
| 27 |
+
"""Test basic tokenization with a small model."""
|
| 28 |
+
print("\n--- Testing Basic Tokenization ---")
|
| 29 |
+
try:
|
| 30 |
+
# Use GPT-2 as it's small and commonly available
|
| 31 |
+
model_id = "openai-community/gpt2"
|
| 32 |
+
text = "Hello, world! This is a test."
|
| 33 |
+
|
| 34 |
+
print(f"Loading tokenizer: {model_id}")
|
| 35 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
| 36 |
+
print("β Tokenizer loaded successfully")
|
| 37 |
+
|
| 38 |
+
# Test encoding
|
| 39 |
+
encoded = tokenizer.encode(text)
|
| 40 |
+
print(f"β Text encoded: {encoded[:10]}...") # Show first 10 tokens
|
| 41 |
+
|
| 42 |
+
# Test decoding
|
| 43 |
+
decoded = tokenizer.decode(encoded)
|
| 44 |
+
print(f"β Text decoded: {decoded}")
|
| 45 |
+
|
| 46 |
+
# Verify round-trip
|
| 47 |
+
assert decoded == text, "Round-trip tokenization failed"
|
| 48 |
+
print("β Round-trip tokenization successful")
|
| 49 |
+
|
| 50 |
+
# Test token conversion
|
| 51 |
+
tokens = tokenizer.convert_ids_to_tokens(encoded)
|
| 52 |
+
print(f"β Tokens: {tokens[:5]}...") # Show first 5 tokens
|
| 53 |
+
|
| 54 |
+
return True
|
| 55 |
+
except Exception as e:
|
| 56 |
+
print(f"β Test failed: {e}")
|
| 57 |
+
return False
|
| 58 |
+
|
| 59 |
+
def test_special_tokens():
|
| 60 |
+
"""Test special token handling."""
|
| 61 |
+
print("\n--- Testing Special Tokens ---")
|
| 62 |
+
try:
|
| 63 |
+
model_id = "openai-community/gpt2"
|
| 64 |
+
text = "Test text"
|
| 65 |
+
|
| 66 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
| 67 |
+
|
| 68 |
+
# With special tokens
|
| 69 |
+
encoded_with = tokenizer.encode(text, add_special_tokens=True)
|
| 70 |
+
# Without special tokens
|
| 71 |
+
encoded_without = tokenizer.encode(text, add_special_tokens=False)
|
| 72 |
+
|
| 73 |
+
print(f"β With special tokens: {len(encoded_with)} tokens")
|
| 74 |
+
print(f"β Without special tokens: {len(encoded_without)} tokens")
|
| 75 |
+
|
| 76 |
+
# Decode with and without special tokens
|
| 77 |
+
decoded_with = tokenizer.decode(encoded_with, skip_special_tokens=False)
|
| 78 |
+
decoded_without = tokenizer.decode(encoded_with, skip_special_tokens=True)
|
| 79 |
+
|
| 80 |
+
print(f"β Decoded with special: {decoded_with}")
|
| 81 |
+
print(f"β Decoded without special: {decoded_without}")
|
| 82 |
+
|
| 83 |
+
return True
|
| 84 |
+
except Exception as e:
|
| 85 |
+
print(f"β Test failed: {e}")
|
| 86 |
+
return False
|
| 87 |
+
|
| 88 |
+
def test_app_functions():
|
| 89 |
+
"""Test the main app functions."""
|
| 90 |
+
print("\n--- Testing App Functions ---")
|
| 91 |
+
try:
|
| 92 |
+
# Import app functions
|
| 93 |
+
from app import tokenize_text, decode_tokens, analyze_vocabulary
|
| 94 |
+
|
| 95 |
+
# Test tokenize_text
|
| 96 |
+
print("Testing tokenize_text function...")
|
| 97 |
+
result = tokenize_text(
|
| 98 |
+
text="Hello world",
|
| 99 |
+
model_id="openai-community/gpt2",
|
| 100 |
+
add_special_tokens=True,
|
| 101 |
+
show_special_tokens=True,
|
| 102 |
+
custom_model_id=None
|
| 103 |
+
)
|
| 104 |
+
assert len(result) == 5, "tokenize_text should return 5 values"
|
| 105 |
+
print("β tokenize_text function works")
|
| 106 |
+
|
| 107 |
+
# Test decode_tokens
|
| 108 |
+
print("Testing decode_tokens function...")
|
| 109 |
+
decode_result = decode_tokens(
|
| 110 |
+
token_ids_str="[15496, 11, 995]", # "Hello, world" in GPT-2
|
| 111 |
+
model_id="openai-community/gpt2",
|
| 112 |
+
skip_special_tokens=False,
|
| 113 |
+
custom_model_id=None
|
| 114 |
+
)
|
| 115 |
+
assert "Decoded Text:" in decode_result, "decode_tokens should return decoded text"
|
| 116 |
+
print("β decode_tokens function works")
|
| 117 |
+
|
| 118 |
+
# Test analyze_vocabulary
|
| 119 |
+
print("Testing analyze_vocabulary function...")
|
| 120 |
+
vocab_result = analyze_vocabulary(
|
| 121 |
+
model_id="openai-community/gpt2",
|
| 122 |
+
custom_model_id=None
|
| 123 |
+
)
|
| 124 |
+
assert "Vocabulary Size:" in vocab_result, "analyze_vocabulary should return vocabulary info"
|
| 125 |
+
print("β analyze_vocabulary function works")
|
| 126 |
+
|
| 127 |
+
return True
|
| 128 |
+
except Exception as e:
|
| 129 |
+
print(f"β Test failed: {e}")
|
| 130 |
+
import traceback
|
| 131 |
+
traceback.print_exc()
|
| 132 |
+
return False
|
| 133 |
+
|
| 134 |
+
def main():
|
| 135 |
+
"""Run all tests."""
|
| 136 |
+
print("=" * 50)
|
| 137 |
+
print("Tokenizer Playground Test Suite")
|
| 138 |
+
print("=" * 50)
|
| 139 |
+
|
| 140 |
+
tests = [
|
| 141 |
+
test_basic_tokenization,
|
| 142 |
+
test_special_tokens,
|
| 143 |
+
test_app_functions
|
| 144 |
+
]
|
| 145 |
+
|
| 146 |
+
results = []
|
| 147 |
+
for test in tests:
|
| 148 |
+
results.append(test())
|
| 149 |
+
|
| 150 |
+
print("\n" + "=" * 50)
|
| 151 |
+
print("Test Summary")
|
| 152 |
+
print("=" * 50)
|
| 153 |
+
passed = sum(results)
|
| 154 |
+
total = len(results)
|
| 155 |
+
print(f"Passed: {passed}/{total}")
|
| 156 |
+
|
| 157 |
+
if passed == total:
|
| 158 |
+
print("β
All tests passed!")
|
| 159 |
+
return 0
|
| 160 |
+
else:
|
| 161 |
+
print("β Some tests failed")
|
| 162 |
+
return 1
|
| 163 |
+
|
| 164 |
+
if __name__ == "__main__":
|
| 165 |
+
sys.exit(main())
|