Spaces:

afeng
/

tokenizers

Runtime error

App Files Files Community

afeng commited on Nov 9, 2025

Commit

af99c46

1 Parent(s): 9093377

add tokenizer

Browse files

Files changed (5) hide show

DEPLOYMENT.md +191 -0
README.md +110 -8
app.py +467 -0
requirements.txt +8 -0
test_tokenizer.py +165 -0

DEPLOYMENT.md ADDED Viewed

	@@ -0,0 +1,191 @@

+# Deployment Instructions for Hugging Face Spaces
+This guide will help you deploy the Tokenizer Playground to Hugging Face Spaces.
+## Prerequisites
+1. A Hugging Face account (create one at https://huggingface.co/join)
+2. Git installed on your local machine
+3. (Optional) Hugging Face CLI installed: `pip install huggingface-hub`
+## Step 1: Create a New Space
+1. Go to https://huggingface.co/spaces
+2. Click on "Create new Space"
+3. Fill in the following:
+   - **Space name**: Choose a unique name (e.g., "tokenizer-playground")
+   - **Select the Space SDK**: Choose **Gradio**
+   - **Select the Space hardware**: Start with **CPU basic** (free tier)
+   - **Repo type**: Public or Private (your choice)
+4. Click "Create Space"
+## Step 2: Clone Your Space Repository
+After creating the space, you'll be redirected to your space page. Clone the repository:
+```bash
+git clone https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME
+cd YOUR_SPACE_NAME
+```
+## Step 3: Add the Application Files
+Copy all the files from this project to your Space repository:
+```bash
+# Copy the application files
+cp path/to/tokenizer/app.py .
+cp path/to/tokenizer/requirements.txt .
+cp path/to/tokenizer/README.md .
+cp path/to/tokenizer/.gitignore .
+```
+## Step 4: Commit and Push
+```bash
+git add .
+git commit -m "Initial deployment of Tokenizer Playground"
+git push
+```
+## Step 5: Monitor the Build
+1. Go to your Space URL: https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME
+2. Click on the "Files" tab to verify all files are uploaded
+3. Click on the "Logs" tab to monitor the build process
+4. The space will automatically build and deploy
+## Step 6: (Optional) Configure Settings
+### Secrets and Environment Variables
+If you want to use private models or add API keys:
+1. Go to your Space settings
+2. Add secrets under "Repository secrets"
+3. Access them in your code using `os.environ['SECRET_NAME']`
+### Hardware Upgrade
+For better performance:
+1. Go to Settings → Hardware
+2. Select a GPU tier (T4 small, T4 medium, A10G small, etc.)
+3. Note: GPU tiers are paid options
+### Persistent Storage
+For caching tokenizers:
+1. Go to Settings → Persistent storage
+2. Enable persistent storage (paid feature)
+3. This will cache downloaded models between restarts
+## Troubleshooting
+### Common Issues
+1. **Build fails with dependency errors**
+   - Check that all packages in requirements.txt are compatible
+   - Try pinning specific versions if conflicts occur
+2. **Space crashes on startup**
+   - Check the logs for error messages
+   - Ensure the app.py file has `app.launch()` at the end
+   - Verify Python syntax is correct
+3. **Models fail to load**
+   - Some models require authentication
+   - Add your HF token as a secret if needed
+   - Some models might be too large for free tier
+4. **Slow performance**
+   - Consider upgrading to GPU hardware
+   - Enable persistent storage to cache models
+   - Reduce the number of pre-loaded models
+### Resource Limits
+**Free Tier (CPU basic):**
+- 2 vCPU
+- 16 GB RAM
+- No GPU
+- Limited concurrent users
+**Recommendations for Production:**
+- Use T4 small or medium for good balance of cost/performance
+- Enable persistent storage to avoid re-downloading models
+- Consider implementing request queuing for high traffic
+## Local Testing Before Deployment
+Always test locally before deploying:
+```bash
+# Install dependencies
+pip install -r requirements.txt
+# Run the application
+python app.py
+# Test in browser at http://localhost:7860
+```
+## Updating Your Space
+To update your deployed Space:
+```bash
+# Make changes to your files
+git add .
+git commit -m "Update: description of changes"
+git push
+```
+The Space will automatically rebuild and redeploy.
+## Using the Hugging Face CLI (Alternative Method)
+If you have the Hugging Face CLI installed:
+```bash
+# Login to Hugging Face
+huggingface-cli login
+# Upload files directly
+huggingface-cli upload YOUR_USERNAME/YOUR_SPACE_NAME . . --repo-type=space
+```
+## Performance Optimization Tips
+1. **Lazy Loading**: The app already implements tokenizer caching
+2. **Model Selection**: Start with smaller models for testing
+3. **Batch Processing**: The compare feature processes models efficiently
+4. **Error Handling**: Comprehensive error handling is implemented
+## Security Considerations
+1. **Never commit secrets**: Use environment variables for sensitive data
+2. **Model Access**: Some models require authentication tokens
+3. **Input Validation**: The app validates all inputs
+4. **Rate Limiting**: Consider implementing rate limiting for production
+## Support
+- For Space-specific issues: https://huggingface.co/docs/hub/spaces
+- For Gradio issues: https://gradio.app/docs
+- For tokenizer issues: https://huggingface.co/docs/transformers/main_classes/tokenizer
+## Next Steps
+After successful deployment:
+1. Share your Space URL with colleagues
+2. Embed the Space in websites using the embed feature
+3. Monitor usage in the Analytics tab
+4. Collect feedback and iterate on features
+5. Consider adding more tokenizers based on user needs
+---
+Good luck with your deployment! The Tokenizer Playground should provide a valuable tool for the NLP research community.

README.md CHANGED Viewed

@@ -1,14 +1,116 @@
 ---
-title: Tokenizers
-emoji: 🌖
 colorFrom: blue
-colorTo: green
 sdk: gradio
-sdk_version: 5.49.1
 app_file: app.py
-pinned: false
-license: apache-2.0
-short_description: 'a collection of tokenizers '
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Tokenizer Playground
+emoji: 🔤
 colorFrom: blue
+colorTo: purple
 sdk: gradio
+sdk_version: 4.44.1
 app_file: app.py
+pinned: true
+license: mit
+models:
+  - Qwen/Qwen2.5-7B
+  - meta-llama/Llama-3.1-8B
+  - openai-community/gpt2
+  - mistralai/Mistral-7B-v0.1
+  - google/gemma-7b
+tags:
+  - tokenizer
+  - nlp
+  - text-processing
+  - research-tool
+short_description: Interactive tokenizer tool for NLP researchers
 ---
+# 🔤 Tokenizer Playground
+An interactive web application for experimenting with various Hugging Face tokenizers. Perfect for NLP researchers and developers who need to quickly test and compare different tokenization strategies.
+## Features
+### 🔤 Tokenize Tab
+- Convert any text into tokens using popular models
+- View tokens, token IDs, and detailed token information
+- See tokenization statistics (tokens per character, vocabulary size, etc.)
+- Support for adding/removing special tokens
+- Custom model support via Hugging Face model IDs
+### 🔄 Detokenize Tab
+- Convert token IDs back to text
+- Support for various input formats (list, comma-separated, space-separated)
+- Option to skip special tokens
+- Verification of round-trip tokenization
+### 📊 Compare Tab
+- Compare tokenization across multiple models simultaneously
+- See token count differences and efficiency metrics
+- Identify which tokenizer is most efficient for your use case
+- Sort results by token count
+### 📖 Vocabulary Tab
+- Explore tokenizer vocabulary details
+- View special tokens and their configurations
+- See vocabulary size and tokenizer type
+- Browse first 100 tokens in the vocabulary
+## Supported Models
+### Pre-configured Models
+- **Qwen Series**: Qwen 2.5, Qwen 2, Qwen 1 (multiple sizes)
+- **Llama Series**: Llama 3.2, Llama 3.1, Llama 2 (multiple sizes)
+- **GPT Models**: GPT-2, GPT-NeoX
+- **Google Models**: Gemma, T5, BERT
+- **Mistral Models**: Mistral 7B, Mixtral 8x7B
+- **Other Models**: DeepSeek, Phi, Yi, BLOOM, OPT, StableLM
+### Custom Models
+You can use any tokenizer available on the Hugging Face Hub by entering its model ID in the "Custom Model ID" field. Examples:
+- `facebook/bart-base`
+- `EleutherAI/gpt-j-6b`
+- `bigscience/bloom`
+- `stabilityai/stablelm-2-1_6b`
+## Technical Details
+- Built with Gradio for an intuitive web interface
+- Uses Hugging Face Transformers for tokenizer support
+- Supports both fast (Rust-based) and slow (Python-based) tokenizers
+- Caches loaded tokenizers for improved performance
+- Handles special tokens and custom vocabularies
+## Quick Start
+1. **Select a tokenizer** from the dropdown or enter a custom model ID
+2. **Enter your text** in the input field
+3. **Click the action button** (Tokenize, Decode, Compare, or Analyze)
+4. **View the results** in the output fields
+## Tips
+- Different tokenizers can produce significantly different token counts for the same text
+- Special tokens (like `[CLS]`, `[SEP]`, `<s>`, `</s>`) are model-specific
+- Subword tokenization allows handling of out-of-vocabulary words
+- Token efficiency directly impacts model inference costs and API usage
+## Local Development
+To run this application locally:
+```bash
+# Clone the repository
+git clone <your-repo-url>
+cd tokenizer-playground
+# Install dependencies
+pip install -r requirements.txt
+# Run the application
+python app.py
+```
+The application will be available at `http://localhost:7860`
+## License
+This project is licensed under the MIT License.

app.py ADDED Viewed

	@@ -0,0 +1,467 @@

+import gradio as gr
+from transformers import AutoTokenizer
+import json
+import traceback
+from typing import Optional, Dict, List, Tuple
+# Popular tokenizer models
+TOKENIZER_OPTIONS = {
+    # Qwen Series
+    "Qwen/Qwen2.5-7B": "Qwen 2.5 (7B)",
+    "Qwen/Qwen2.5-72B": "Qwen 2.5 (72B)",
+    "Qwen/Qwen2-7B": "Qwen 2 (7B)",
+    "Qwen/Qwen2-72B": "Qwen 2 (72B)",
+    "Qwen/Qwen-7B": "Qwen 1 (7B)",
+    # Llama Series
+    "meta-llama/Llama-3.2-1B": "Llama 3.2 (1B)",
+    "meta-llama/Llama-3.2-3B": "Llama 3.2 (3B)",
+    "meta-llama/Llama-3.1-8B": "Llama 3.1 (8B)",
+    "meta-llama/Llama-3.1-70B": "Llama 3.1 (70B)",
+    "meta-llama/Llama-2-7b-hf": "Llama 2 (7B)",
+    "meta-llama/Llama-2-13b-hf": "Llama 2 (13B)",
+    "meta-llama/Llama-2-70b-hf": "Llama 2 (70B)",
+    # Other Popular Models
+    "openai-community/gpt2": "GPT-2",
+    "google/gemma-2b": "Gemma (2B)",
+    "google/gemma-7b": "Gemma (7B)",
+    "mistralai/Mistral-7B-v0.1": "Mistral (7B)",
+    "mistralai/Mixtral-8x7B-v0.1": "Mixtral (8x7B)",
+    "deepseek-ai/deepseek-coder-6.7b-base": "DeepSeek Coder (6.7B)",
+    "microsoft/phi-2": "Phi-2",
+    "microsoft/phi-3-mini-4k-instruct": "Phi-3 Mini",
+    "01-ai/Yi-6B": "Yi (6B)",
+    "01-ai/Yi-34B": "Yi (34B)",
+    "google-t5/t5-base": "T5 Base",
+    "google-bert/bert-base-uncased": "BERT Base (uncased)",
+    "google-bert/bert-base-cased": "BERT Base (cased)",
+    "EleutherAI/gpt-neox-20b": "GPT-NeoX (20B)",
+    "bigscience/bloom-560m": "BLOOM (560M)",
+    "facebook/opt-350m": "OPT (350M)",
+    "stabilityai/stablelm-base-alpha-7b": "StableLM (7B)",
+}
+# Cache for loaded tokenizers
+tokenizer_cache = {}
+def load_tokenizer(model_id: str):
+    """Load a tokenizer with caching."""
+    if model_id not in tokenizer_cache:
+        try:
+            tokenizer_cache[model_id] = AutoTokenizer.from_pretrained(
+                model_id,
+                trust_remote_code=True,
+                use_fast=True  # Use fast tokenizer when available
+            )
+        except Exception as e:
+            # Fallback to slow tokenizer if fast is not available
+            try:
+                tokenizer_cache[model_id] = AutoTokenizer.from_pretrained(
+                    model_id,
+                    trust_remote_code=True,
+                    use_fast=False
+                )
+            except:
+                raise e
+    return tokenizer_cache[model_id]
+def tokenize_text(
+    text: str,
+    model_id: str,
+    add_special_tokens: bool = True,
+    show_special_tokens: bool = True,
+    custom_model_id: Optional[str] = None
+) -> Tuple[str, str, str, str]:
+    """
+    Tokenize text using the selected tokenizer.
+    Returns:
+        Tuple of (tokens_json, token_ids, decoded_text, stats)
+    """
+    try:
+        # Use custom model ID if provided
+        actual_model_id = custom_model_id.strip() if custom_model_id and custom_model_id.strip() else model_id
+        if not actual_model_id:
+            return "", "", "", "Please select or enter a tokenizer model."
+        # Load tokenizer
+        tokenizer = load_tokenizer(actual_model_id)
+        # Tokenize
+        encoded = tokenizer.encode(text, add_special_tokens=add_special_tokens)
+        tokens = tokenizer.convert_ids_to_tokens(encoded)
+        # Decode
+        decoded = tokenizer.decode(encoded, skip_special_tokens=not show_special_tokens)
+        # Create detailed token information
+        token_info = []
+        for i, (token, token_id) in enumerate(zip(tokens, encoded)):
+            # Try to get the actual string representation of the token
+            try:
+                token_str = tokenizer.convert_tokens_to_string([token])
+            except:
+                token_str = token
+            token_info.append({
+                "index": i,
+                "token": token,
+                "token_id": token_id,
+                "text": token_str,
+                "is_special": token_id in (tokenizer.all_special_ids if hasattr(tokenizer, 'all_special_ids') else [])
+            })
+        # Format outputs
+        tokens_display = json.dumps(tokens, ensure_ascii=False, indent=2)
+        token_ids_display = str(encoded)
+        token_info_json = json.dumps(token_info, ensure_ascii=False, indent=2)
+        # Statistics
+        stats = f"""Statistics:
+• Model: {actual_model_id}
+• Number of tokens: {len(tokens)}
+• Number of characters: {len(text)}
+• Tokens per character: {len(tokens)/len(text):.2f}
+• Characters per token: {len(text)/len(tokens):.2f}
+• Vocabulary size: {tokenizer.vocab_size if hasattr(tokenizer, 'vocab_size') else 'N/A'}
+• Special tokens: {', '.join(tokenizer.all_special_tokens) if hasattr(tokenizer, 'all_special_tokens') else 'N/A'}"""
+        return tokens_display, token_ids_display, decoded, token_info_json, stats
+    except Exception as e:
+        error_msg = f"Error: {str(e)}\n{traceback.format_exc()}"
+        return error_msg, "", "", "", ""
+def decode_tokens(
+    token_ids_str: str,
+    model_id: str,
+    skip_special_tokens: bool = False,
+    custom_model_id: Optional[str] = None
+) -> str:
+    """Decode token IDs back to text."""
+    try:
+        # Use custom model ID if provided
+        actual_model_id = custom_model_id.strip() if custom_model_id and custom_model_id.strip() else model_id
+        if not actual_model_id:
+            return "Please select or enter a tokenizer model."
+        # Parse token IDs
+        token_ids_str = token_ids_str.strip()
+        if token_ids_str.startswith('[') and token_ids_str.endswith(']'):
+            token_ids = json.loads(token_ids_str)
+        else:
+            # Try to parse as comma or space separated values
+            token_ids = [int(x.strip()) for x in token_ids_str.replace(',', ' ').split()]
+        # Load tokenizer and decode
+        tokenizer = load_tokenizer(actual_model_id)
+        decoded = tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
+        # Also show tokens
+        tokens = tokenizer.convert_ids_to_tokens(token_ids)
+        result = f"""Decoded Text:
+{decoded}
+Tokens:
+{json.dumps(tokens, ensure_ascii=False, indent=2)}
+Token Count: {len(tokens)}"""
+        return result
+    except Exception as e:
+        return f"Error: {str(e)}\n{traceback.format_exc()}"
+def compare_tokenizers(
+    text: str,
+    model_ids: List[str],
+    add_special_tokens: bool = True
+) -> str:
+    """Compare tokenization across multiple models."""
+    if not model_ids:
+        return "Please select at least one model to compare."
+    results = []
+    for model_id in model_ids:
+        try:
+            tokenizer = load_tokenizer(model_id)
+            encoded = tokenizer.encode(text, add_special_tokens=add_special_tokens)
+            tokens = tokenizer.convert_ids_to_tokens(encoded)
+            results.append({
+                "model": model_id,
+                "token_count": len(tokens),
+                "tokens": tokens[:50],  # Show first 50 tokens
+                "token_ids": encoded[:50]  # Show first 50 IDs
+            })
+        except Exception as e:
+            results.append({
+                "model": model_id,
+                "error": str(e)
+            })
+    # Sort by token count
+    results.sort(key=lambda x: x.get("token_count", float('inf')))
+    # Format output
+    output = "# Tokenizer Comparison\n\n"
+    output += f"Input text length: {len(text)} characters\n\n"
+    for result in results:
+        if "error" in result:
+            output += f"## {result['model']}\n"
+            output += f"Error: {result['error']}\n\n"
+        else:
+            output += f"## {result['model']}\n"
+            output += f"**Token count:** {result['token_count']} "
+            output += f"(ratio: {result['token_count']/len(text):.2f} tokens/char)\n\n"
+            output += f"**First tokens:** {result['tokens']}\n\n"
+            if len(result['tokens']) == 50:
+                output += "*(showing first 50 tokens)*\n\n"
+    return output
+def analyze_vocabulary(model_id: str, custom_model_id: Optional[str] = None) -> str:
+    """Analyze tokenizer vocabulary."""
+    try:
+        actual_model_id = custom_model_id.strip() if custom_model_id and custom_model_id.strip() else model_id
+        if not actual_model_id:
+            return "Please select or enter a tokenizer model."
+        tokenizer = load_tokenizer(actual_model_id)
+        # Get vocabulary information
+        vocab_size = tokenizer.vocab_size if hasattr(tokenizer, 'vocab_size') else len(tokenizer.get_vocab())
+        # Get special tokens
+        special_tokens = {}
+        if hasattr(tokenizer, 'special_tokens_map'):
+            special_tokens = tokenizer.special_tokens_map
+        # Get some example tokens
+        vocab = tokenizer.get_vocab()
+        sorted_vocab = sorted(vocab.items(), key=lambda x: x[1])[:100]  # First 100 tokens
+        output = f"""# Tokenizer Vocabulary Analysis
+**Model:** {actual_model_id}
+**Vocabulary Size:** {vocab_size:,}
+**Tokenizer Type:** {tokenizer.__class__.__name__}
+## Special Tokens
+```json
+{json.dumps(special_tokens, ensure_ascii=False, indent=2)}
+```
+## Token Settings
+• Padding Token: {tokenizer.pad_token if tokenizer.pad_token else 'None'}
+• BOS Token: {tokenizer.bos_token if tokenizer.bos_token else 'None'}
+• EOS Token: {tokenizer.eos_token if tokenizer.eos_token else 'None'}
+• UNK Token: {tokenizer.unk_token if tokenizer.unk_token else 'None'}
+• SEP Token: {tokenizer.sep_token if hasattr(tokenizer, 'sep_token') and tokenizer.sep_token else 'None'}
+• CLS Token: {tokenizer.cls_token if hasattr(tokenizer, 'cls_token') and tokenizer.cls_token else 'None'}
+• Mask Token: {tokenizer.mask_token if hasattr(tokenizer, 'mask_token') and tokenizer.mask_token else 'None'}
+## First 100 Tokens in Vocabulary
+Token → ID
+"""
+        for token, token_id in sorted_vocab:
+            # Escape special characters for display
+            display_token = repr(token) if not token.isprintable() else token
+            output += f"{display_token} → {token_id}\n"
+        return output
+    except Exception as e:
+        return f"Error: {str(e)}\n{traceback.format_exc()}"
+# Create Gradio interface
+with gr.Blocks(title="🤗 Tokenizer Playground", theme=gr.themes.Soft()) as app:
+    gr.Markdown("""
+    # 🤗 Tokenizer Playground
+    A comprehensive tool for NLP researchers to experiment with various Hugging Face tokenizers.
+    Supports popular models including **Qwen**, **Llama**, **Mistral**, **GPT**, and many more.
+    ### Features:
+    - 🔤 **Tokenize & Detokenize** text with any Hugging Face tokenizer
+    - 📊 **Compare** tokenization across multiple models
+    - 📖 **Analyze** vocabulary and special tokens
+    - 🎯 **Support** for custom model IDs from Hugging Face Hub
+    """)
+    with gr.Tab("🔤 Tokenize"):
+        with gr.Row():
+            with gr.Column(scale=3):
+                tokenize_input = gr.Textbox(
+                    label="Input Text",
+                    placeholder="Enter text to tokenize...",
+                    lines=5
+                )
+            with gr.Column(scale=1):
+                tokenize_model = gr.Dropdown(
+                    label="Select Tokenizer",
+                    choices=list(TOKENIZER_OPTIONS.keys()),
+                    value="Qwen/Qwen2.5-7B",
+                    allow_custom_value=False
+                )
+                tokenize_custom_model = gr.Textbox(
+                    label="Or Enter Custom Model ID",
+                    placeholder="e.g., facebook/bart-base",
+                    info="Override selection above with any HF model"
+                )
+                add_special = gr.Checkbox(label="Add Special Tokens", value=True)
+                show_special = gr.Checkbox(label="Show Special Tokens in Decoded", value=True)
+                tokenize_btn = gr.Button("Tokenize", variant="primary")
+        with gr.Row():
+            with gr.Column():
+                tokens_output = gr.Textbox(label="Tokens", lines=10, max_lines=20)
+            with gr.Column():
+                token_ids_output = gr.Textbox(label="Token IDs", lines=10, max_lines=20)
+        with gr.Row():
+            with gr.Column():
+                decoded_output = gr.Textbox(label="Decoded Text (Verification)", lines=5)
+            with gr.Column():
+                token_info_output = gr.Textbox(label="Detailed Token Information", lines=10, max_lines=20)
+        stats_output = gr.Textbox(label="Statistics", lines=7)
+        tokenize_btn.click(
+            fn=tokenize_text,
+            inputs=[tokenize_input, tokenize_model, add_special, show_special, tokenize_custom_model],
+            outputs=[tokens_output, token_ids_output, decoded_output, token_info_output, stats_output]
+        )
+    with gr.Tab("🔄 Detokenize"):
+        with gr.Row():
+            with gr.Column(scale=3):
+                decode_input = gr.Textbox(
+                    label="Token IDs",
+                    placeholder="Enter token IDs as a list [101, 2023, ...] or space/comma separated",
+                    lines=5
+                )
+            with gr.Column(scale=1):
+                decode_model = gr.Dropdown(
+                    label="Select Tokenizer",
+                    choices=list(TOKENIZER_OPTIONS.keys()),
+                    value="Qwen/Qwen2.5-7B"
+                )
+                decode_custom_model = gr.Textbox(
+                    label="Or Enter Custom Model ID",
+                    placeholder="e.g., facebook/bart-base"
+                )
+                skip_special = gr.Checkbox(label="Skip Special Tokens", value=False)
+                decode_btn = gr.Button("Decode", variant="primary")
+        decode_output = gr.Textbox(label="Decoded Result", lines=10)
+        decode_btn.click(
+            fn=decode_tokens,
+            inputs=[decode_input, decode_model, skip_special, decode_custom_model],
+            outputs=decode_output
+        )
+    with gr.Tab("📊 Compare"):
+        compare_input = gr.Textbox(
+            label="Input Text",
+            placeholder="Enter text to compare tokenization across models...",
+            lines=5
+        )
+        compare_models = gr.CheckboxGroup(
+            label="Select Models to Compare",
+            choices=list(TOKENIZER_OPTIONS.keys()),
+            value=["Qwen/Qwen2.5-7B", "meta-llama/Llama-3.1-8B", "openai-community/gpt2"]
+        )
+        compare_add_special = gr.Checkbox(label="Add Special Tokens", value=True)
+        compare_btn = gr.Button("Compare Tokenizers", variant="primary")
+        compare_output = gr.Markdown()
+        compare_btn.click(
+            fn=compare_tokenizers,
+            inputs=[compare_input, compare_models, compare_add_special],
+            outputs=compare_output
+        )
+    with gr.Tab("📖 Vocabulary"):
+        with gr.Row():
+            vocab_model = gr.Dropdown(
+                label="Select Tokenizer",
+                choices=list(TOKENIZER_OPTIONS.keys()),
+                value="Qwen/Qwen2.5-7B"
+            )
+            vocab_custom_model = gr.Textbox(
+                label="Or Enter Custom Model ID",
+                placeholder="e.g., facebook/bart-base"
+            )
+            vocab_btn = gr.Button("Analyze Vocabulary", variant="primary")
+        vocab_output = gr.Markdown()
+        vocab_btn.click(
+            fn=analyze_vocabulary,
+            inputs=[vocab_model, vocab_custom_model],
+            outputs=vocab_output
+        )
+    with gr.Tab("ℹ️ About"):
+        gr.Markdown("""
+        ## About This Tool
+        This tokenizer playground provides researchers and developers with an easy way to experiment
+        with various tokenizers from the Hugging Face Model Hub.
+        ### Supported Models
+        **Qwen Series:** Qwen 2.5, Qwen 2, Qwen 1 (various sizes)
+        **Llama Series:** Llama 3.2, Llama 3.1, Llama 2 (various sizes)
+        **Other Popular Models:** GPT-2, Gemma, Mistral, Mixtral, DeepSeek, Phi, Yi, T5, BERT, GPT-NeoX, BLOOM, OPT, StableLM
+        ### Custom Models
+        You can use any tokenizer from the Hugging Face Hub by entering its model ID in the "Custom Model ID" field.
+        For example:
+        - `facebook/bart-base`
+        - `EleutherAI/gpt-j-6b`
+        - `bigscience/bloom`
+        ### Features Explanation
+        - **Tokenize:** Convert text into tokens and token IDs
+        - **Detokenize:** Convert token IDs back to text
+        - **Compare:** See how different tokenizers handle the same text
+        - **Vocabulary:** Explore tokenizer vocabulary and special tokens
+        ### Tips
+        1. Different tokenizers can produce very different token counts for the same text
+        2. Special tokens (like [CLS], [SEP], <s>, </s>) are model-specific
+        3. Subword tokenization (used by most modern models) allows handling of out-of-vocabulary words
+        4. Token efficiency affects model performance and API costs
+        ### Resources
+        - [Hugging Face Tokenizers Documentation](https://huggingface.co/docs/transformers/main_classes/tokenizer)
+        - [Understanding Tokenization](https://huggingface.co/docs/transformers/tokenizer_summary)
+        - [Model Hub](https://huggingface.co/models)
+        ---
+        Made with ❤️ for the NLP research community
+        """)
+# Launch the app
+if __name__ == "__main__":
+    app.launch()

requirements.txt ADDED Viewed

	@@ -0,0 +1,8 @@

+gradio==4.44.1
+transformers==4.46.0
+torch==2.5.0
+sentencepiece==0.2.0
+protobuf==5.28.2
+tokenizers==0.20.1
+huggingface-hub==0.26.0
+tiktoken==0.8.0

test_tokenizer.py ADDED Viewed

	@@ -0,0 +1,165 @@

+#!/usr/bin/env python3
+"""
+Simple test script to verify tokenizer functionality.
+This tests the core functions without launching the Gradio interface.
+"""
+import sys
+import json
+# Test imports
+try:
+    from transformers import AutoTokenizer
+    print("✓ transformers imported successfully")
+except ImportError as e:
+    print(f"✗ Failed to import transformers: {e}")
+    sys.exit(1)
+try:
+    import gradio as gr
+    print("✓ gradio imported successfully")
+except ImportError as e:
+    print(f"✗ Failed to import gradio: {e}")
+    sys.exit(1)
+# Test basic tokenization
+def test_basic_tokenization():
+    """Test basic tokenization with a small model."""
+    print("\n--- Testing Basic Tokenization ---")
+    try:
+        # Use GPT-2 as it's small and commonly available
+        model_id = "openai-community/gpt2"
+        text = "Hello, world! This is a test."
+        print(f"Loading tokenizer: {model_id}")
+        tokenizer = AutoTokenizer.from_pretrained(model_id)
+        print("✓ Tokenizer loaded successfully")
+        # Test encoding
+        encoded = tokenizer.encode(text)
+        print(f"✓ Text encoded: {encoded[:10]}...")  # Show first 10 tokens
+        # Test decoding
+        decoded = tokenizer.decode(encoded)
+        print(f"✓ Text decoded: {decoded}")
+        # Verify round-trip
+        assert decoded == text, "Round-trip tokenization failed"
+        print("✓ Round-trip tokenization successful")
+        # Test token conversion
+        tokens = tokenizer.convert_ids_to_tokens(encoded)
+        print(f"✓ Tokens: {tokens[:5]}...")  # Show first 5 tokens
+        return True
+    except Exception as e:
+        print(f"✗ Test failed: {e}")
+        return False
+def test_special_tokens():
+    """Test special token handling."""
+    print("\n--- Testing Special Tokens ---")
+    try:
+        model_id = "openai-community/gpt2"
+        text = "Test text"
+        tokenizer = AutoTokenizer.from_pretrained(model_id)
+        # With special tokens
+        encoded_with = tokenizer.encode(text, add_special_tokens=True)
+        # Without special tokens
+        encoded_without = tokenizer.encode(text, add_special_tokens=False)
+        print(f"✓ With special tokens: {len(encoded_with)} tokens")
+        print(f"✓ Without special tokens: {len(encoded_without)} tokens")
+        # Decode with and without special tokens
+        decoded_with = tokenizer.decode(encoded_with, skip_special_tokens=False)
+        decoded_without = tokenizer.decode(encoded_with, skip_special_tokens=True)
+        print(f"✓ Decoded with special: {decoded_with}")
+        print(f"✓ Decoded without special: {decoded_without}")
+        return True
+    except Exception as e:
+        print(f"✗ Test failed: {e}")
+        return False
+def test_app_functions():
+    """Test the main app functions."""
+    print("\n--- Testing App Functions ---")
+    try:
+        # Import app functions
+        from app import tokenize_text, decode_tokens, analyze_vocabulary
+        # Test tokenize_text
+        print("Testing tokenize_text function...")
+        result = tokenize_text(
+            text="Hello world",
+            model_id="openai-community/gpt2",
+            add_special_tokens=True,
+            show_special_tokens=True,
+            custom_model_id=None
+        )
+        assert len(result) == 5, "tokenize_text should return 5 values"
+        print("✓ tokenize_text function works")
+        # Test decode_tokens
+        print("Testing decode_tokens function...")
+        decode_result = decode_tokens(
+            token_ids_str="[15496, 11, 995]",  # "Hello, world" in GPT-2
+            model_id="openai-community/gpt2",
+            skip_special_tokens=False,
+            custom_model_id=None
+        )
+        assert "Decoded Text:" in decode_result, "decode_tokens should return decoded text"
+        print("✓ decode_tokens function works")
+        # Test analyze_vocabulary
+        print("Testing analyze_vocabulary function...")
+        vocab_result = analyze_vocabulary(
+            model_id="openai-community/gpt2",
+            custom_model_id=None
+        )
+        assert "Vocabulary Size:" in vocab_result, "analyze_vocabulary should return vocabulary info"
+        print("✓ analyze_vocabulary function works")
+        return True
+    except Exception as e:
+        print(f"✗ Test failed: {e}")
+        import traceback
+        traceback.print_exc()
+        return False
+def main():
+    """Run all tests."""
+    print("=" * 50)
+    print("Tokenizer Playground Test Suite")
+    print("=" * 50)
+    tests = [
+        test_basic_tokenization,
+        test_special_tokens,
+        test_app_functions
+    ]
+    results = []
+    for test in tests:
+        results.append(test())
+    print("\n" + "=" * 50)
+    print("Test Summary")
+    print("=" * 50)
+    passed = sum(results)
+    total = len(results)
+    print(f"Passed: {passed}/{total}")
+    if passed == total:
+        print("✅ All tests passed!")
+        return 0
+    else:
+        print("❌ Some tests failed")
+        return 1
+if __name__ == "__main__":
+    sys.exit(main())