afeng commited on
Commit
af99c46
Β·
1 Parent(s): 9093377

add tokenizer

Browse files
Files changed (5) hide show
  1. DEPLOYMENT.md +191 -0
  2. README.md +110 -8
  3. app.py +467 -0
  4. requirements.txt +8 -0
  5. test_tokenizer.py +165 -0
DEPLOYMENT.md ADDED
@@ -0,0 +1,191 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Deployment Instructions for Hugging Face Spaces
2
+
3
+ This guide will help you deploy the Tokenizer Playground to Hugging Face Spaces.
4
+
5
+ ## Prerequisites
6
+
7
+ 1. A Hugging Face account (create one at https://huggingface.co/join)
8
+ 2. Git installed on your local machine
9
+ 3. (Optional) Hugging Face CLI installed: `pip install huggingface-hub`
10
+
11
+ ## Step 1: Create a New Space
12
+
13
+ 1. Go to https://huggingface.co/spaces
14
+ 2. Click on "Create new Space"
15
+ 3. Fill in the following:
16
+ - **Space name**: Choose a unique name (e.g., "tokenizer-playground")
17
+ - **Select the Space SDK**: Choose **Gradio**
18
+ - **Select the Space hardware**: Start with **CPU basic** (free tier)
19
+ - **Repo type**: Public or Private (your choice)
20
+ 4. Click "Create Space"
21
+
22
+ ## Step 2: Clone Your Space Repository
23
+
24
+ After creating the space, you'll be redirected to your space page. Clone the repository:
25
+
26
+ ```bash
27
+ git clone https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME
28
+ cd YOUR_SPACE_NAME
29
+ ```
30
+
31
+ ## Step 3: Add the Application Files
32
+
33
+ Copy all the files from this project to your Space repository:
34
+
35
+ ```bash
36
+ # Copy the application files
37
+ cp path/to/tokenizer/app.py .
38
+ cp path/to/tokenizer/requirements.txt .
39
+ cp path/to/tokenizer/README.md .
40
+ cp path/to/tokenizer/.gitignore .
41
+ ```
42
+
43
+ ## Step 4: Commit and Push
44
+
45
+ ```bash
46
+ git add .
47
+ git commit -m "Initial deployment of Tokenizer Playground"
48
+ git push
49
+ ```
50
+
51
+ ## Step 5: Monitor the Build
52
+
53
+ 1. Go to your Space URL: https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME
54
+ 2. Click on the "Files" tab to verify all files are uploaded
55
+ 3. Click on the "Logs" tab to monitor the build process
56
+ 4. The space will automatically build and deploy
57
+
58
+ ## Step 6: (Optional) Configure Settings
59
+
60
+ ### Secrets and Environment Variables
61
+
62
+ If you want to use private models or add API keys:
63
+
64
+ 1. Go to your Space settings
65
+ 2. Add secrets under "Repository secrets"
66
+ 3. Access them in your code using `os.environ['SECRET_NAME']`
67
+
68
+ ### Hardware Upgrade
69
+
70
+ For better performance:
71
+
72
+ 1. Go to Settings β†’ Hardware
73
+ 2. Select a GPU tier (T4 small, T4 medium, A10G small, etc.)
74
+ 3. Note: GPU tiers are paid options
75
+
76
+ ### Persistent Storage
77
+
78
+ For caching tokenizers:
79
+
80
+ 1. Go to Settings β†’ Persistent storage
81
+ 2. Enable persistent storage (paid feature)
82
+ 3. This will cache downloaded models between restarts
83
+
84
+ ## Troubleshooting
85
+
86
+ ### Common Issues
87
+
88
+ 1. **Build fails with dependency errors**
89
+ - Check that all packages in requirements.txt are compatible
90
+ - Try pinning specific versions if conflicts occur
91
+
92
+ 2. **Space crashes on startup**
93
+ - Check the logs for error messages
94
+ - Ensure the app.py file has `app.launch()` at the end
95
+ - Verify Python syntax is correct
96
+
97
+ 3. **Models fail to load**
98
+ - Some models require authentication
99
+ - Add your HF token as a secret if needed
100
+ - Some models might be too large for free tier
101
+
102
+ 4. **Slow performance**
103
+ - Consider upgrading to GPU hardware
104
+ - Enable persistent storage to cache models
105
+ - Reduce the number of pre-loaded models
106
+
107
+ ### Resource Limits
108
+
109
+ **Free Tier (CPU basic):**
110
+ - 2 vCPU
111
+ - 16 GB RAM
112
+ - No GPU
113
+ - Limited concurrent users
114
+
115
+ **Recommendations for Production:**
116
+ - Use T4 small or medium for good balance of cost/performance
117
+ - Enable persistent storage to avoid re-downloading models
118
+ - Consider implementing request queuing for high traffic
119
+
120
+ ## Local Testing Before Deployment
121
+
122
+ Always test locally before deploying:
123
+
124
+ ```bash
125
+ # Install dependencies
126
+ pip install -r requirements.txt
127
+
128
+ # Run the application
129
+ python app.py
130
+
131
+ # Test in browser at http://localhost:7860
132
+ ```
133
+
134
+ ## Updating Your Space
135
+
136
+ To update your deployed Space:
137
+
138
+ ```bash
139
+ # Make changes to your files
140
+ git add .
141
+ git commit -m "Update: description of changes"
142
+ git push
143
+ ```
144
+
145
+ The Space will automatically rebuild and redeploy.
146
+
147
+ ## Using the Hugging Face CLI (Alternative Method)
148
+
149
+ If you have the Hugging Face CLI installed:
150
+
151
+ ```bash
152
+ # Login to Hugging Face
153
+ huggingface-cli login
154
+
155
+ # Upload files directly
156
+ huggingface-cli upload YOUR_USERNAME/YOUR_SPACE_NAME . . --repo-type=space
157
+ ```
158
+
159
+ ## Performance Optimization Tips
160
+
161
+ 1. **Lazy Loading**: The app already implements tokenizer caching
162
+ 2. **Model Selection**: Start with smaller models for testing
163
+ 3. **Batch Processing**: The compare feature processes models efficiently
164
+ 4. **Error Handling**: Comprehensive error handling is implemented
165
+
166
+ ## Security Considerations
167
+
168
+ 1. **Never commit secrets**: Use environment variables for sensitive data
169
+ 2. **Model Access**: Some models require authentication tokens
170
+ 3. **Input Validation**: The app validates all inputs
171
+ 4. **Rate Limiting**: Consider implementing rate limiting for production
172
+
173
+ ## Support
174
+
175
+ - For Space-specific issues: https://huggingface.co/docs/hub/spaces
176
+ - For Gradio issues: https://gradio.app/docs
177
+ - For tokenizer issues: https://huggingface.co/docs/transformers/main_classes/tokenizer
178
+
179
+ ## Next Steps
180
+
181
+ After successful deployment:
182
+
183
+ 1. Share your Space URL with colleagues
184
+ 2. Embed the Space in websites using the embed feature
185
+ 3. Monitor usage in the Analytics tab
186
+ 4. Collect feedback and iterate on features
187
+ 5. Consider adding more tokenizers based on user needs
188
+
189
+ ---
190
+
191
+ Good luck with your deployment! The Tokenizer Playground should provide a valuable tool for the NLP research community.
README.md CHANGED
@@ -1,14 +1,116 @@
1
  ---
2
- title: Tokenizers
3
- emoji: πŸŒ–
4
  colorFrom: blue
5
- colorTo: green
6
  sdk: gradio
7
- sdk_version: 5.49.1
8
  app_file: app.py
9
- pinned: false
10
- license: apache-2.0
11
- short_description: 'a collection of tokenizers '
 
 
 
 
 
 
 
 
 
 
 
12
  ---
13
 
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Tokenizer Playground
3
+ emoji: πŸ”€
4
  colorFrom: blue
5
+ colorTo: purple
6
  sdk: gradio
7
+ sdk_version: 4.44.1
8
  app_file: app.py
9
+ pinned: true
10
+ license: mit
11
+ models:
12
+ - Qwen/Qwen2.5-7B
13
+ - meta-llama/Llama-3.1-8B
14
+ - openai-community/gpt2
15
+ - mistralai/Mistral-7B-v0.1
16
+ - google/gemma-7b
17
+ tags:
18
+ - tokenizer
19
+ - nlp
20
+ - text-processing
21
+ - research-tool
22
+ short_description: Interactive tokenizer tool for NLP researchers
23
  ---
24
 
25
+ # πŸ”€ Tokenizer Playground
26
+
27
+ An interactive web application for experimenting with various Hugging Face tokenizers. Perfect for NLP researchers and developers who need to quickly test and compare different tokenization strategies.
28
+
29
+ ## Features
30
+
31
+ ### πŸ”€ Tokenize Tab
32
+ - Convert any text into tokens using popular models
33
+ - View tokens, token IDs, and detailed token information
34
+ - See tokenization statistics (tokens per character, vocabulary size, etc.)
35
+ - Support for adding/removing special tokens
36
+ - Custom model support via Hugging Face model IDs
37
+
38
+ ### πŸ”„ Detokenize Tab
39
+ - Convert token IDs back to text
40
+ - Support for various input formats (list, comma-separated, space-separated)
41
+ - Option to skip special tokens
42
+ - Verification of round-trip tokenization
43
+
44
+ ### πŸ“Š Compare Tab
45
+ - Compare tokenization across multiple models simultaneously
46
+ - See token count differences and efficiency metrics
47
+ - Identify which tokenizer is most efficient for your use case
48
+ - Sort results by token count
49
+
50
+ ### πŸ“– Vocabulary Tab
51
+ - Explore tokenizer vocabulary details
52
+ - View special tokens and their configurations
53
+ - See vocabulary size and tokenizer type
54
+ - Browse first 100 tokens in the vocabulary
55
+
56
+ ## Supported Models
57
+
58
+ ### Pre-configured Models
59
+ - **Qwen Series**: Qwen 2.5, Qwen 2, Qwen 1 (multiple sizes)
60
+ - **Llama Series**: Llama 3.2, Llama 3.1, Llama 2 (multiple sizes)
61
+ - **GPT Models**: GPT-2, GPT-NeoX
62
+ - **Google Models**: Gemma, T5, BERT
63
+ - **Mistral Models**: Mistral 7B, Mixtral 8x7B
64
+ - **Other Models**: DeepSeek, Phi, Yi, BLOOM, OPT, StableLM
65
+
66
+ ### Custom Models
67
+ You can use any tokenizer available on the Hugging Face Hub by entering its model ID in the "Custom Model ID" field. Examples:
68
+ - `facebook/bart-base`
69
+ - `EleutherAI/gpt-j-6b`
70
+ - `bigscience/bloom`
71
+ - `stabilityai/stablelm-2-1_6b`
72
+
73
+ ## Technical Details
74
+
75
+ - Built with Gradio for an intuitive web interface
76
+ - Uses Hugging Face Transformers for tokenizer support
77
+ - Supports both fast (Rust-based) and slow (Python-based) tokenizers
78
+ - Caches loaded tokenizers for improved performance
79
+ - Handles special tokens and custom vocabularies
80
+
81
+ ## Quick Start
82
+
83
+ 1. **Select a tokenizer** from the dropdown or enter a custom model ID
84
+ 2. **Enter your text** in the input field
85
+ 3. **Click the action button** (Tokenize, Decode, Compare, or Analyze)
86
+ 4. **View the results** in the output fields
87
+
88
+ ## Tips
89
+
90
+ - Different tokenizers can produce significantly different token counts for the same text
91
+ - Special tokens (like `[CLS]`, `[SEP]`, `<s>`, `</s>`) are model-specific
92
+ - Subword tokenization allows handling of out-of-vocabulary words
93
+ - Token efficiency directly impacts model inference costs and API usage
94
+
95
+ ## Local Development
96
+
97
+ To run this application locally:
98
+
99
+ ```bash
100
+ # Clone the repository
101
+ git clone <your-repo-url>
102
+ cd tokenizer-playground
103
+
104
+ # Install dependencies
105
+ pip install -r requirements.txt
106
+
107
+ # Run the application
108
+ python app.py
109
+ ```
110
+
111
+ The application will be available at `http://localhost:7860`
112
+
113
+ ## License
114
+
115
+ This project is licensed under the MIT License.
116
+
app.py ADDED
@@ -0,0 +1,467 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ from transformers import AutoTokenizer
3
+ import json
4
+ import traceback
5
+ from typing import Optional, Dict, List, Tuple
6
+
7
+ # Popular tokenizer models
8
+ TOKENIZER_OPTIONS = {
9
+ # Qwen Series
10
+ "Qwen/Qwen2.5-7B": "Qwen 2.5 (7B)",
11
+ "Qwen/Qwen2.5-72B": "Qwen 2.5 (72B)",
12
+ "Qwen/Qwen2-7B": "Qwen 2 (7B)",
13
+ "Qwen/Qwen2-72B": "Qwen 2 (72B)",
14
+ "Qwen/Qwen-7B": "Qwen 1 (7B)",
15
+
16
+ # Llama Series
17
+ "meta-llama/Llama-3.2-1B": "Llama 3.2 (1B)",
18
+ "meta-llama/Llama-3.2-3B": "Llama 3.2 (3B)",
19
+ "meta-llama/Llama-3.1-8B": "Llama 3.1 (8B)",
20
+ "meta-llama/Llama-3.1-70B": "Llama 3.1 (70B)",
21
+ "meta-llama/Llama-2-7b-hf": "Llama 2 (7B)",
22
+ "meta-llama/Llama-2-13b-hf": "Llama 2 (13B)",
23
+ "meta-llama/Llama-2-70b-hf": "Llama 2 (70B)",
24
+
25
+ # Other Popular Models
26
+ "openai-community/gpt2": "GPT-2",
27
+ "google/gemma-2b": "Gemma (2B)",
28
+ "google/gemma-7b": "Gemma (7B)",
29
+ "mistralai/Mistral-7B-v0.1": "Mistral (7B)",
30
+ "mistralai/Mixtral-8x7B-v0.1": "Mixtral (8x7B)",
31
+ "deepseek-ai/deepseek-coder-6.7b-base": "DeepSeek Coder (6.7B)",
32
+ "microsoft/phi-2": "Phi-2",
33
+ "microsoft/phi-3-mini-4k-instruct": "Phi-3 Mini",
34
+ "01-ai/Yi-6B": "Yi (6B)",
35
+ "01-ai/Yi-34B": "Yi (34B)",
36
+ "google-t5/t5-base": "T5 Base",
37
+ "google-bert/bert-base-uncased": "BERT Base (uncased)",
38
+ "google-bert/bert-base-cased": "BERT Base (cased)",
39
+ "EleutherAI/gpt-neox-20b": "GPT-NeoX (20B)",
40
+ "bigscience/bloom-560m": "BLOOM (560M)",
41
+ "facebook/opt-350m": "OPT (350M)",
42
+ "stabilityai/stablelm-base-alpha-7b": "StableLM (7B)",
43
+ }
44
+
45
+ # Cache for loaded tokenizers
46
+ tokenizer_cache = {}
47
+
48
+ def load_tokenizer(model_id: str):
49
+ """Load a tokenizer with caching."""
50
+ if model_id not in tokenizer_cache:
51
+ try:
52
+ tokenizer_cache[model_id] = AutoTokenizer.from_pretrained(
53
+ model_id,
54
+ trust_remote_code=True,
55
+ use_fast=True # Use fast tokenizer when available
56
+ )
57
+ except Exception as e:
58
+ # Fallback to slow tokenizer if fast is not available
59
+ try:
60
+ tokenizer_cache[model_id] = AutoTokenizer.from_pretrained(
61
+ model_id,
62
+ trust_remote_code=True,
63
+ use_fast=False
64
+ )
65
+ except:
66
+ raise e
67
+ return tokenizer_cache[model_id]
68
+
69
+ def tokenize_text(
70
+ text: str,
71
+ model_id: str,
72
+ add_special_tokens: bool = True,
73
+ show_special_tokens: bool = True,
74
+ custom_model_id: Optional[str] = None
75
+ ) -> Tuple[str, str, str, str]:
76
+ """
77
+ Tokenize text using the selected tokenizer.
78
+
79
+ Returns:
80
+ Tuple of (tokens_json, token_ids, decoded_text, stats)
81
+ """
82
+ try:
83
+ # Use custom model ID if provided
84
+ actual_model_id = custom_model_id.strip() if custom_model_id and custom_model_id.strip() else model_id
85
+
86
+ if not actual_model_id:
87
+ return "", "", "", "Please select or enter a tokenizer model."
88
+
89
+ # Load tokenizer
90
+ tokenizer = load_tokenizer(actual_model_id)
91
+
92
+ # Tokenize
93
+ encoded = tokenizer.encode(text, add_special_tokens=add_special_tokens)
94
+ tokens = tokenizer.convert_ids_to_tokens(encoded)
95
+
96
+ # Decode
97
+ decoded = tokenizer.decode(encoded, skip_special_tokens=not show_special_tokens)
98
+
99
+ # Create detailed token information
100
+ token_info = []
101
+ for i, (token, token_id) in enumerate(zip(tokens, encoded)):
102
+ # Try to get the actual string representation of the token
103
+ try:
104
+ token_str = tokenizer.convert_tokens_to_string([token])
105
+ except:
106
+ token_str = token
107
+
108
+ token_info.append({
109
+ "index": i,
110
+ "token": token,
111
+ "token_id": token_id,
112
+ "text": token_str,
113
+ "is_special": token_id in (tokenizer.all_special_ids if hasattr(tokenizer, 'all_special_ids') else [])
114
+ })
115
+
116
+ # Format outputs
117
+ tokens_display = json.dumps(tokens, ensure_ascii=False, indent=2)
118
+ token_ids_display = str(encoded)
119
+ token_info_json = json.dumps(token_info, ensure_ascii=False, indent=2)
120
+
121
+ # Statistics
122
+ stats = f"""Statistics:
123
+ β€’ Model: {actual_model_id}
124
+ β€’ Number of tokens: {len(tokens)}
125
+ β€’ Number of characters: {len(text)}
126
+ β€’ Tokens per character: {len(tokens)/len(text):.2f}
127
+ β€’ Characters per token: {len(text)/len(tokens):.2f}
128
+ β€’ Vocabulary size: {tokenizer.vocab_size if hasattr(tokenizer, 'vocab_size') else 'N/A'}
129
+ β€’ Special tokens: {', '.join(tokenizer.all_special_tokens) if hasattr(tokenizer, 'all_special_tokens') else 'N/A'}"""
130
+
131
+ return tokens_display, token_ids_display, decoded, token_info_json, stats
132
+
133
+ except Exception as e:
134
+ error_msg = f"Error: {str(e)}\n{traceback.format_exc()}"
135
+ return error_msg, "", "", "", ""
136
+
137
+ def decode_tokens(
138
+ token_ids_str: str,
139
+ model_id: str,
140
+ skip_special_tokens: bool = False,
141
+ custom_model_id: Optional[str] = None
142
+ ) -> str:
143
+ """Decode token IDs back to text."""
144
+ try:
145
+ # Use custom model ID if provided
146
+ actual_model_id = custom_model_id.strip() if custom_model_id and custom_model_id.strip() else model_id
147
+
148
+ if not actual_model_id:
149
+ return "Please select or enter a tokenizer model."
150
+
151
+ # Parse token IDs
152
+ token_ids_str = token_ids_str.strip()
153
+ if token_ids_str.startswith('[') and token_ids_str.endswith(']'):
154
+ token_ids = json.loads(token_ids_str)
155
+ else:
156
+ # Try to parse as comma or space separated values
157
+ token_ids = [int(x.strip()) for x in token_ids_str.replace(',', ' ').split()]
158
+
159
+ # Load tokenizer and decode
160
+ tokenizer = load_tokenizer(actual_model_id)
161
+ decoded = tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
162
+
163
+ # Also show tokens
164
+ tokens = tokenizer.convert_ids_to_tokens(token_ids)
165
+
166
+ result = f"""Decoded Text:
167
+ {decoded}
168
+
169
+ Tokens:
170
+ {json.dumps(tokens, ensure_ascii=False, indent=2)}
171
+
172
+ Token Count: {len(tokens)}"""
173
+
174
+ return result
175
+
176
+ except Exception as e:
177
+ return f"Error: {str(e)}\n{traceback.format_exc()}"
178
+
179
+ def compare_tokenizers(
180
+ text: str,
181
+ model_ids: List[str],
182
+ add_special_tokens: bool = True
183
+ ) -> str:
184
+ """Compare tokenization across multiple models."""
185
+ if not model_ids:
186
+ return "Please select at least one model to compare."
187
+
188
+ results = []
189
+
190
+ for model_id in model_ids:
191
+ try:
192
+ tokenizer = load_tokenizer(model_id)
193
+ encoded = tokenizer.encode(text, add_special_tokens=add_special_tokens)
194
+ tokens = tokenizer.convert_ids_to_tokens(encoded)
195
+
196
+ results.append({
197
+ "model": model_id,
198
+ "token_count": len(tokens),
199
+ "tokens": tokens[:50], # Show first 50 tokens
200
+ "token_ids": encoded[:50] # Show first 50 IDs
201
+ })
202
+ except Exception as e:
203
+ results.append({
204
+ "model": model_id,
205
+ "error": str(e)
206
+ })
207
+
208
+ # Sort by token count
209
+ results.sort(key=lambda x: x.get("token_count", float('inf')))
210
+
211
+ # Format output
212
+ output = "# Tokenizer Comparison\n\n"
213
+ output += f"Input text length: {len(text)} characters\n\n"
214
+
215
+ for result in results:
216
+ if "error" in result:
217
+ output += f"## {result['model']}\n"
218
+ output += f"Error: {result['error']}\n\n"
219
+ else:
220
+ output += f"## {result['model']}\n"
221
+ output += f"**Token count:** {result['token_count']} "
222
+ output += f"(ratio: {result['token_count']/len(text):.2f} tokens/char)\n\n"
223
+ output += f"**First tokens:** {result['tokens']}\n\n"
224
+ if len(result['tokens']) == 50:
225
+ output += "*(showing first 50 tokens)*\n\n"
226
+
227
+ return output
228
+
229
+ def analyze_vocabulary(model_id: str, custom_model_id: Optional[str] = None) -> str:
230
+ """Analyze tokenizer vocabulary."""
231
+ try:
232
+ actual_model_id = custom_model_id.strip() if custom_model_id and custom_model_id.strip() else model_id
233
+
234
+ if not actual_model_id:
235
+ return "Please select or enter a tokenizer model."
236
+
237
+ tokenizer = load_tokenizer(actual_model_id)
238
+
239
+ # Get vocabulary information
240
+ vocab_size = tokenizer.vocab_size if hasattr(tokenizer, 'vocab_size') else len(tokenizer.get_vocab())
241
+
242
+ # Get special tokens
243
+ special_tokens = {}
244
+ if hasattr(tokenizer, 'special_tokens_map'):
245
+ special_tokens = tokenizer.special_tokens_map
246
+
247
+ # Get some example tokens
248
+ vocab = tokenizer.get_vocab()
249
+ sorted_vocab = sorted(vocab.items(), key=lambda x: x[1])[:100] # First 100 tokens
250
+
251
+ output = f"""# Tokenizer Vocabulary Analysis
252
+
253
+ **Model:** {actual_model_id}
254
+ **Vocabulary Size:** {vocab_size:,}
255
+ **Tokenizer Type:** {tokenizer.__class__.__name__}
256
+
257
+ ## Special Tokens
258
+ ```json
259
+ {json.dumps(special_tokens, ensure_ascii=False, indent=2)}
260
+ ```
261
+
262
+ ## Token Settings
263
+ β€’ Padding Token: {tokenizer.pad_token if tokenizer.pad_token else 'None'}
264
+ β€’ BOS Token: {tokenizer.bos_token if tokenizer.bos_token else 'None'}
265
+ β€’ EOS Token: {tokenizer.eos_token if tokenizer.eos_token else 'None'}
266
+ β€’ UNK Token: {tokenizer.unk_token if tokenizer.unk_token else 'None'}
267
+ β€’ SEP Token: {tokenizer.sep_token if hasattr(tokenizer, 'sep_token') and tokenizer.sep_token else 'None'}
268
+ β€’ CLS Token: {tokenizer.cls_token if hasattr(tokenizer, 'cls_token') and tokenizer.cls_token else 'None'}
269
+ β€’ Mask Token: {tokenizer.mask_token if hasattr(tokenizer, 'mask_token') and tokenizer.mask_token else 'None'}
270
+
271
+ ## First 100 Tokens in Vocabulary
272
+ Token β†’ ID
273
+ """
274
+ for token, token_id in sorted_vocab:
275
+ # Escape special characters for display
276
+ display_token = repr(token) if not token.isprintable() else token
277
+ output += f"{display_token} β†’ {token_id}\n"
278
+
279
+ return output
280
+
281
+ except Exception as e:
282
+ return f"Error: {str(e)}\n{traceback.format_exc()}"
283
+
284
+ # Create Gradio interface
285
+ with gr.Blocks(title="πŸ€— Tokenizer Playground", theme=gr.themes.Soft()) as app:
286
+ gr.Markdown("""
287
+ # πŸ€— Tokenizer Playground
288
+
289
+ A comprehensive tool for NLP researchers to experiment with various Hugging Face tokenizers.
290
+ Supports popular models including **Qwen**, **Llama**, **Mistral**, **GPT**, and many more.
291
+
292
+ ### Features:
293
+ - πŸ”€ **Tokenize & Detokenize** text with any Hugging Face tokenizer
294
+ - πŸ“Š **Compare** tokenization across multiple models
295
+ - πŸ“– **Analyze** vocabulary and special tokens
296
+ - 🎯 **Support** for custom model IDs from Hugging Face Hub
297
+ """)
298
+
299
+ with gr.Tab("πŸ”€ Tokenize"):
300
+ with gr.Row():
301
+ with gr.Column(scale=3):
302
+ tokenize_input = gr.Textbox(
303
+ label="Input Text",
304
+ placeholder="Enter text to tokenize...",
305
+ lines=5
306
+ )
307
+ with gr.Column(scale=1):
308
+ tokenize_model = gr.Dropdown(
309
+ label="Select Tokenizer",
310
+ choices=list(TOKENIZER_OPTIONS.keys()),
311
+ value="Qwen/Qwen2.5-7B",
312
+ allow_custom_value=False
313
+ )
314
+ tokenize_custom_model = gr.Textbox(
315
+ label="Or Enter Custom Model ID",
316
+ placeholder="e.g., facebook/bart-base",
317
+ info="Override selection above with any HF model"
318
+ )
319
+ add_special = gr.Checkbox(label="Add Special Tokens", value=True)
320
+ show_special = gr.Checkbox(label="Show Special Tokens in Decoded", value=True)
321
+ tokenize_btn = gr.Button("Tokenize", variant="primary")
322
+
323
+ with gr.Row():
324
+ with gr.Column():
325
+ tokens_output = gr.Textbox(label="Tokens", lines=10, max_lines=20)
326
+ with gr.Column():
327
+ token_ids_output = gr.Textbox(label="Token IDs", lines=10, max_lines=20)
328
+
329
+ with gr.Row():
330
+ with gr.Column():
331
+ decoded_output = gr.Textbox(label="Decoded Text (Verification)", lines=5)
332
+ with gr.Column():
333
+ token_info_output = gr.Textbox(label="Detailed Token Information", lines=10, max_lines=20)
334
+
335
+ stats_output = gr.Textbox(label="Statistics", lines=7)
336
+
337
+ tokenize_btn.click(
338
+ fn=tokenize_text,
339
+ inputs=[tokenize_input, tokenize_model, add_special, show_special, tokenize_custom_model],
340
+ outputs=[tokens_output, token_ids_output, decoded_output, token_info_output, stats_output]
341
+ )
342
+
343
+ with gr.Tab("πŸ”„ Detokenize"):
344
+ with gr.Row():
345
+ with gr.Column(scale=3):
346
+ decode_input = gr.Textbox(
347
+ label="Token IDs",
348
+ placeholder="Enter token IDs as a list [101, 2023, ...] or space/comma separated",
349
+ lines=5
350
+ )
351
+ with gr.Column(scale=1):
352
+ decode_model = gr.Dropdown(
353
+ label="Select Tokenizer",
354
+ choices=list(TOKENIZER_OPTIONS.keys()),
355
+ value="Qwen/Qwen2.5-7B"
356
+ )
357
+ decode_custom_model = gr.Textbox(
358
+ label="Or Enter Custom Model ID",
359
+ placeholder="e.g., facebook/bart-base"
360
+ )
361
+ skip_special = gr.Checkbox(label="Skip Special Tokens", value=False)
362
+ decode_btn = gr.Button("Decode", variant="primary")
363
+
364
+ decode_output = gr.Textbox(label="Decoded Result", lines=10)
365
+
366
+ decode_btn.click(
367
+ fn=decode_tokens,
368
+ inputs=[decode_input, decode_model, skip_special, decode_custom_model],
369
+ outputs=decode_output
370
+ )
371
+
372
+ with gr.Tab("πŸ“Š Compare"):
373
+ compare_input = gr.Textbox(
374
+ label="Input Text",
375
+ placeholder="Enter text to compare tokenization across models...",
376
+ lines=5
377
+ )
378
+
379
+ compare_models = gr.CheckboxGroup(
380
+ label="Select Models to Compare",
381
+ choices=list(TOKENIZER_OPTIONS.keys()),
382
+ value=["Qwen/Qwen2.5-7B", "meta-llama/Llama-3.1-8B", "openai-community/gpt2"]
383
+ )
384
+
385
+ compare_add_special = gr.Checkbox(label="Add Special Tokens", value=True)
386
+ compare_btn = gr.Button("Compare Tokenizers", variant="primary")
387
+
388
+ compare_output = gr.Markdown()
389
+
390
+ compare_btn.click(
391
+ fn=compare_tokenizers,
392
+ inputs=[compare_input, compare_models, compare_add_special],
393
+ outputs=compare_output
394
+ )
395
+
396
+ with gr.Tab("πŸ“– Vocabulary"):
397
+ with gr.Row():
398
+ vocab_model = gr.Dropdown(
399
+ label="Select Tokenizer",
400
+ choices=list(TOKENIZER_OPTIONS.keys()),
401
+ value="Qwen/Qwen2.5-7B"
402
+ )
403
+ vocab_custom_model = gr.Textbox(
404
+ label="Or Enter Custom Model ID",
405
+ placeholder="e.g., facebook/bart-base"
406
+ )
407
+ vocab_btn = gr.Button("Analyze Vocabulary", variant="primary")
408
+
409
+ vocab_output = gr.Markdown()
410
+
411
+ vocab_btn.click(
412
+ fn=analyze_vocabulary,
413
+ inputs=[vocab_model, vocab_custom_model],
414
+ outputs=vocab_output
415
+ )
416
+
417
+ with gr.Tab("ℹ️ About"):
418
+ gr.Markdown("""
419
+ ## About This Tool
420
+
421
+ This tokenizer playground provides researchers and developers with an easy way to experiment
422
+ with various tokenizers from the Hugging Face Model Hub.
423
+
424
+ ### Supported Models
425
+
426
+ **Qwen Series:** Qwen 2.5, Qwen 2, Qwen 1 (various sizes)
427
+
428
+ **Llama Series:** Llama 3.2, Llama 3.1, Llama 2 (various sizes)
429
+
430
+ **Other Popular Models:** GPT-2, Gemma, Mistral, Mixtral, DeepSeek, Phi, Yi, T5, BERT, GPT-NeoX, BLOOM, OPT, StableLM
431
+
432
+ ### Custom Models
433
+
434
+ You can use any tokenizer from the Hugging Face Hub by entering its model ID in the "Custom Model ID" field.
435
+ For example:
436
+ - `facebook/bart-base`
437
+ - `EleutherAI/gpt-j-6b`
438
+ - `bigscience/bloom`
439
+
440
+ ### Features Explanation
441
+
442
+ - **Tokenize:** Convert text into tokens and token IDs
443
+ - **Detokenize:** Convert token IDs back to text
444
+ - **Compare:** See how different tokenizers handle the same text
445
+ - **Vocabulary:** Explore tokenizer vocabulary and special tokens
446
+
447
+ ### Tips
448
+
449
+ 1. Different tokenizers can produce very different token counts for the same text
450
+ 2. Special tokens (like [CLS], [SEP], <s>, </s>) are model-specific
451
+ 3. Subword tokenization (used by most modern models) allows handling of out-of-vocabulary words
452
+ 4. Token efficiency affects model performance and API costs
453
+
454
+ ### Resources
455
+
456
+ - [Hugging Face Tokenizers Documentation](https://huggingface.co/docs/transformers/main_classes/tokenizer)
457
+ - [Understanding Tokenization](https://huggingface.co/docs/transformers/tokenizer_summary)
458
+ - [Model Hub](https://huggingface.co/models)
459
+
460
+ ---
461
+
462
+ Made with ❀️ for the NLP research community
463
+ """)
464
+
465
+ # Launch the app
466
+ if __name__ == "__main__":
467
+ app.launch()
requirements.txt ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ gradio==4.44.1
2
+ transformers==4.46.0
3
+ torch==2.5.0
4
+ sentencepiece==0.2.0
5
+ protobuf==5.28.2
6
+ tokenizers==0.20.1
7
+ huggingface-hub==0.26.0
8
+ tiktoken==0.8.0
test_tokenizer.py ADDED
@@ -0,0 +1,165 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Simple test script to verify tokenizer functionality.
4
+ This tests the core functions without launching the Gradio interface.
5
+ """
6
+
7
+ import sys
8
+ import json
9
+
10
+ # Test imports
11
+ try:
12
+ from transformers import AutoTokenizer
13
+ print("βœ“ transformers imported successfully")
14
+ except ImportError as e:
15
+ print(f"βœ— Failed to import transformers: {e}")
16
+ sys.exit(1)
17
+
18
+ try:
19
+ import gradio as gr
20
+ print("βœ“ gradio imported successfully")
21
+ except ImportError as e:
22
+ print(f"βœ— Failed to import gradio: {e}")
23
+ sys.exit(1)
24
+
25
+ # Test basic tokenization
26
+ def test_basic_tokenization():
27
+ """Test basic tokenization with a small model."""
28
+ print("\n--- Testing Basic Tokenization ---")
29
+ try:
30
+ # Use GPT-2 as it's small and commonly available
31
+ model_id = "openai-community/gpt2"
32
+ text = "Hello, world! This is a test."
33
+
34
+ print(f"Loading tokenizer: {model_id}")
35
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
36
+ print("βœ“ Tokenizer loaded successfully")
37
+
38
+ # Test encoding
39
+ encoded = tokenizer.encode(text)
40
+ print(f"βœ“ Text encoded: {encoded[:10]}...") # Show first 10 tokens
41
+
42
+ # Test decoding
43
+ decoded = tokenizer.decode(encoded)
44
+ print(f"βœ“ Text decoded: {decoded}")
45
+
46
+ # Verify round-trip
47
+ assert decoded == text, "Round-trip tokenization failed"
48
+ print("βœ“ Round-trip tokenization successful")
49
+
50
+ # Test token conversion
51
+ tokens = tokenizer.convert_ids_to_tokens(encoded)
52
+ print(f"βœ“ Tokens: {tokens[:5]}...") # Show first 5 tokens
53
+
54
+ return True
55
+ except Exception as e:
56
+ print(f"βœ— Test failed: {e}")
57
+ return False
58
+
59
+ def test_special_tokens():
60
+ """Test special token handling."""
61
+ print("\n--- Testing Special Tokens ---")
62
+ try:
63
+ model_id = "openai-community/gpt2"
64
+ text = "Test text"
65
+
66
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
67
+
68
+ # With special tokens
69
+ encoded_with = tokenizer.encode(text, add_special_tokens=True)
70
+ # Without special tokens
71
+ encoded_without = tokenizer.encode(text, add_special_tokens=False)
72
+
73
+ print(f"βœ“ With special tokens: {len(encoded_with)} tokens")
74
+ print(f"βœ“ Without special tokens: {len(encoded_without)} tokens")
75
+
76
+ # Decode with and without special tokens
77
+ decoded_with = tokenizer.decode(encoded_with, skip_special_tokens=False)
78
+ decoded_without = tokenizer.decode(encoded_with, skip_special_tokens=True)
79
+
80
+ print(f"βœ“ Decoded with special: {decoded_with}")
81
+ print(f"βœ“ Decoded without special: {decoded_without}")
82
+
83
+ return True
84
+ except Exception as e:
85
+ print(f"βœ— Test failed: {e}")
86
+ return False
87
+
88
+ def test_app_functions():
89
+ """Test the main app functions."""
90
+ print("\n--- Testing App Functions ---")
91
+ try:
92
+ # Import app functions
93
+ from app import tokenize_text, decode_tokens, analyze_vocabulary
94
+
95
+ # Test tokenize_text
96
+ print("Testing tokenize_text function...")
97
+ result = tokenize_text(
98
+ text="Hello world",
99
+ model_id="openai-community/gpt2",
100
+ add_special_tokens=True,
101
+ show_special_tokens=True,
102
+ custom_model_id=None
103
+ )
104
+ assert len(result) == 5, "tokenize_text should return 5 values"
105
+ print("βœ“ tokenize_text function works")
106
+
107
+ # Test decode_tokens
108
+ print("Testing decode_tokens function...")
109
+ decode_result = decode_tokens(
110
+ token_ids_str="[15496, 11, 995]", # "Hello, world" in GPT-2
111
+ model_id="openai-community/gpt2",
112
+ skip_special_tokens=False,
113
+ custom_model_id=None
114
+ )
115
+ assert "Decoded Text:" in decode_result, "decode_tokens should return decoded text"
116
+ print("βœ“ decode_tokens function works")
117
+
118
+ # Test analyze_vocabulary
119
+ print("Testing analyze_vocabulary function...")
120
+ vocab_result = analyze_vocabulary(
121
+ model_id="openai-community/gpt2",
122
+ custom_model_id=None
123
+ )
124
+ assert "Vocabulary Size:" in vocab_result, "analyze_vocabulary should return vocabulary info"
125
+ print("βœ“ analyze_vocabulary function works")
126
+
127
+ return True
128
+ except Exception as e:
129
+ print(f"βœ— Test failed: {e}")
130
+ import traceback
131
+ traceback.print_exc()
132
+ return False
133
+
134
+ def main():
135
+ """Run all tests."""
136
+ print("=" * 50)
137
+ print("Tokenizer Playground Test Suite")
138
+ print("=" * 50)
139
+
140
+ tests = [
141
+ test_basic_tokenization,
142
+ test_special_tokens,
143
+ test_app_functions
144
+ ]
145
+
146
+ results = []
147
+ for test in tests:
148
+ results.append(test())
149
+
150
+ print("\n" + "=" * 50)
151
+ print("Test Summary")
152
+ print("=" * 50)
153
+ passed = sum(results)
154
+ total = len(results)
155
+ print(f"Passed: {passed}/{total}")
156
+
157
+ if passed == total:
158
+ print("βœ… All tests passed!")
159
+ return 0
160
+ else:
161
+ print("❌ Some tests failed")
162
+ return 1
163
+
164
+ if __name__ == "__main__":
165
+ sys.exit(main())