| # NexForge Tokenizer Examples | |
| This directory contains example scripts demonstrating advanced usage of the NexForge tokenizer. | |
| ## Quick Start | |
| ### Basic Tokenizer Creation | |
| ```python | |
| from nexforgetokenizer import build_tokenizer | |
| # Create a tokenizer with default settings | |
| build_tokenizer( | |
| input_dir="path/to/your/files", | |
| output_path="custom_tokenizer.json", | |
| vocab_size=40000, | |
| min_frequency=2 | |
| ) | |
| ``` | |
| ### Example Scripts | |
| 1. **Basic Example** (`basic_usage.py`) | |
| - Simple tokenizer creation and usage | |
| - Basic encoding/decoding | |
| - Vocabulary inspection | |
| 2. **Advanced Usage** (`advanced_usage.py`) | |
| - Custom special tokens | |
| - Batch processing | |
| - Performance optimization | |
| - Error handling | |
| ## Running Examples | |
| ```bash | |
| # Install in development mode | |
| pip install -e . | |
| # Run basic example | |
| python examples/basic_usage.py | |
| # Run advanced example | |
| python examples/advanced_usage.py --input-dir ../Dataset --output my_tokenizer.json | |
| ``` | |
| ## Example: Creating a Custom Tokenizer | |
| ```python | |
| from nexforgetokenizer import build_tokenizer | |
| # Create a tokenizer with custom settings | |
| build_tokenizer( | |
| input_dir="../Dataset", | |
| output_path="my_tokenizer.json", | |
| vocab_size=30000, # Smaller vocabulary for specific domain | |
| min_frequency=3, # Only include tokens appearing at least 3 times | |
| max_files=1000, # Limit number of files to process | |
| special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"] | |
| ) | |
| ``` | |
| ## Best Practices | |
| 1. **For General Use** | |
| - Use default settings (40k vocab, min_freq=2) | |
| - Process all files in your dataset | |
| - Test with the built-in test suite | |
| 2. **For Specialized Domains** | |
| - Adjust vocabulary size based on domain complexity | |
| - Consider increasing min_frequency for smaller vocabularies | |
| - Test with domain-specific files | |
| ## Need Help? | |
| - Check the [main README](../README.md) for basic usage | |
| - Review the test cases in `Test_tokenizer/` | |
| - Open an issue on GitHub for support | |
| ## License | |
| MIT License - See [LICENSE](../LICENSE) for details. | |