# NexForge Tokenizer Examples This directory contains example scripts demonstrating advanced usage of the NexForge tokenizer. ## Quick Start ### Basic Tokenizer Creation ```python from nexforgetokenizer import build_tokenizer # Create a tokenizer with default settings build_tokenizer( input_dir="path/to/your/files", output_path="custom_tokenizer.json", vocab_size=40000, min_frequency=2 ) ``` ### Example Scripts 1. **Basic Example** (`basic_usage.py`) - Simple tokenizer creation and usage - Basic encoding/decoding - Vocabulary inspection 2. **Advanced Usage** (`advanced_usage.py`) - Custom special tokens - Batch processing - Performance optimization - Error handling ## Running Examples ```bash # Install in development mode pip install -e . # Run basic example python examples/basic_usage.py # Run advanced example python examples/advanced_usage.py --input-dir ../Dataset --output my_tokenizer.json ``` ## Example: Creating a Custom Tokenizer ```python from nexforgetokenizer import build_tokenizer # Create a tokenizer with custom settings build_tokenizer( input_dir="../Dataset", output_path="my_tokenizer.json", vocab_size=30000, # Smaller vocabulary for specific domain min_frequency=3, # Only include tokens appearing at least 3 times max_files=1000, # Limit number of files to process special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"] ) ``` ## Best Practices 1. **For General Use** - Use default settings (40k vocab, min_freq=2) - Process all files in your dataset - Test with the built-in test suite 2. **For Specialized Domains** - Adjust vocabulary size based on domain complexity - Consider increasing min_frequency for smaller vocabularies - Test with domain-specific files ## Need Help? - Check the [main README](../README.md) for basic usage - Review the test cases in `Test_tokenizer/` - Open an issue on GitHub for support ## License MIT License - See [LICENSE](../LICENSE) for details.