| Metadata-Version: 2.4
|
| Name: ez-tokenizer
|
| Version: 1.0.0
|
| Summary: High-performance tokenizer builder for code and text datasets with adaptive resource management
|
| Home-page: https://github.com/nexforge/nexforgetokenizer
|
| Author: NexForge Team
|
| Author-email: NexForge <jm.talbot@outlook.com>
|
| Maintainer-email: NexForge <jm.talbot@outlook.com>
|
| License: MIT with Company Restriction
|
| Classifier: Development Status :: 4 - Beta
|
| Classifier: Intended Audience :: Developers
|
| Classifier: Intended Audience :: Science/Research
|
| Classifier: License :: Other/Proprietary License
|
| Classifier: Programming Language :: Python :: 3.8
|
| Classifier: Programming Language :: Python :: 3.9
|
| Classifier: Programming Language :: Python :: 3.10
|
| Classifier: Programming Language :: Python :: 3.11
|
| Classifier: Programming Language :: Python :: 3.12
|
| Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
| Classifier: Topic :: Text Processing :: Linguistic
|
| Requires-Python: >=3.8
|
| Description-Content-Type: text/markdown
|
| License-File: LICENSE
|
| Requires-Dist: torch>=1.9.0
|
| Requires-Dist: tokenizers>=0.12.0
|
| Requires-Dist: tqdm>=4.62.0
|
| Requires-Dist: psutil>=5.9.0
|
| Requires-Dist: python-dateutil>=2.8.2
|
| Provides-Extra: dev
|
| Requires-Dist: pytest>=6.0; extra == "dev"
|
| Requires-Dist: pytest-cov>=2.12.1; extra == "dev"
|
| Requires-Dist: pytest-xdist>=2.4.0; extra == "dev"
|
| Requires-Dist: black>=21.7b0; extra == "dev"
|
| Requires-Dist: isort>=5.0.0; extra == "dev"
|
| Requires-Dist: mypy>=0.910; extra == "dev"
|
| Requires-Dist: pylint>=2.11.0; extra == "dev"
|
| Requires-Dist: pre-commit>=2.15.0; extra == "dev"
|
| Dynamic: author
|
| Dynamic: home-page
|
| Dynamic: license-file
|
| Dynamic: requires-python
|
|
|
|
|
|
|
| A high-performance tool for creating custom tokenizers from your code or text datasets. Automatically adapts to your system resources while providing fine-grained control over tokenizer creation.
|
|
|
| > **Note**: This project was previously known as NexForge Tokenizer. All functionality remains the same, only the name has been updated to better reflect its ease of use and efficiency.
|
|
|
|
|
|
|
| EZ-Tokenizer is released under the MIT License with a company restriction clause. This means:
|
|
|
| - 🆓 **Free for everyone**: Individuals and small businesses can use EZ-Tokenizer for free
|
| - 🏢 **Commercial use**: Companies with more than 10 employees or $1M+ in annual revenue need a commercial license
|
| - 📝 **Full details**: See [LICENSE](LICENSE) for complete terms
|
|
|
|
|
|
|
|
|
| - Windows OS
|
| - Python 3.8 or higher installed
|
| - Administrator privileges
|
| - At least 4GB RAM (8GB+ recommended)
|
|
|
|
|
|
|
| 1. **Download** the latest release or clone this repository
|
| 2. **Add your dataset**: Place training files in the `Dataset` directory
|
| - Supported formats: `.txt`, `.py`, and other text files
|
| - The system will process all compatible files in this directory
|
| 3. **Run as Administrator**: Right-click on `run_ez_tokenizer.bat` and select "Run as administrator"
|
| 4. **Follow the Menu**:
|
| - Option 1: Install Dependencies (first time only)
|
| - Option 2: Create Tokenizer (processes all files in Dataset directory)
|
| - Option 3: Test Tokenizer (after creation)
|
| - Option 4: Open Dataset Directory (to add/check files)
|
| - Option 5: Exit
|
|
|
|
|
| - **Vocabulary Size**: 50,000 tokens
|
| - **Minimum Frequency**: 2 (includes tokens appearing at least twice)
|
| - **File Processing**: All files in Dataset directory
|
| - **Output**: `output/ez_tokenizer.json`
|
| - **Test Results**: `test_result/test_run.txt`
|
|
|
|
|
| Customize tokenizer creation by running manually:
|
| ```bash
|
| python -m ez_tokenizer.adaptive_tokenizer [input_dir] [output_path] [vocab_size] [min_frequency] [max_files]
|
| ```
|
| Example:
|
| ```bash
|
| python -m ez_tokenizer.adaptive_tokenizer "Dataset" "output/custom_tokenizer.json" 50000 2 1000
|
| ```
|
|
|
| ---
|
|
|
| ## Advanced Usage (Manual Setup)
|
|
|
| For users who need more control or are using non-Windows systems:
|
|
|
|
|
|
|
| - **Adaptive Resource Management**: Automatically detects and utilizes available system resources (CPU, RAM, GPU)
|
| - **Progressive Processing**: Processes files in chunks to handle datasets larger than available memory
|
| - **Smart Batching**: Dynamically adjusts batch sizes based on available resources
|
| - **Efficient Memory Usage**: Implements memory conservation strategies for optimal performance
|
| - **High Performance**: Processes over 300,000 tokens per second on average hardware
|
| - **Perfect Reconstruction**: 100% accuracy in round-trip encoding/decoding
|
| - **Optimal Compression**: Achieves ~3.5 characters per token, exceeding industry standards
|
| - 🛠️ **Extensible**: Advanced users can customize all parameters
|
| - ✅ **Tested**: Built-in testing to verify tokenizer quality
|
|
|
|
|
|
|
|
|
|
|
| ```bash
|
|
|
| git clone https://github.com/yourusername/ez_tokenizer.git
|
| cd ez_tokenizer
|
| pip install -e .
|
| ```
|
|
|
| ### Basic Usage
|
|
|
| #### Command Line Interface
|
|
|
| ```bash
|
| # Basic usage
|
| python -m ez_tokenizer.adaptive_tokenizer path/to/your/files output/tokenizer.json
|
|
|
| # With custom parameters
|
| python -m ez_tokenizer.adaptive_tokenizer path/to/your/files output/tokenizer.json 50000 2
|
| ```
|
|
|
| ## Complete Usage Guide
|
|
|
| ### Command Line Arguments
|
|
|
| ```bash
|
| python -m ez_tokenizer.adaptive_tokenizer <input_path> <output_path> [vocab_size] [min_frequency]
|
| ```
|
|
|
| - **input_path**: Path to file or directory containing training data
|
| - **output_path**: Where to save the tokenizer (should end with .json)
|
| - **vocab_size** (optional, default=40000): Target vocabulary size
|
| - **min_frequency** (optional, default=2): Minimum token occurrence count
|
|
|
|
|
|
|
| ```python
|
| from ez_tokenizer import build_tokenizer
|
|
|
| # Basic usage
|
| build_tokenizer(
|
| input_dir="path/to/your/files",
|
| output_path="output/tokenizer.json"
|
| )
|
|
|
| # Advanced usage
|
| build_tokenizer(
|
| input_dir="path/to/your/files",
|
| output_path="output/tokenizer.json",
|
| vocab_size=50000, # Larger vocabulary for specialized domains
|
| min_frequency=2, # Only include tokens appearing at least this many times
|
| chunk_size=1000000, # Characters to process at once
|
| n_threads=4 # Number of threads to use
|
| )
|
| ```
|
|
|
| ## Best Practices
|
|
|
| ### Recommended Settings
|
|
|
| #### For Most Users
|
| - **Vocabulary Size**: 40,000 (default)
|
| - Balanced between coverage and performance
|
| - Works well for most programming languages and natural language
|
| - **Minimum Frequency**: 2 (default)
|
| - Includes tokens that appear at least twice
|
| - Good balance between vocabulary size and token quality
|
|
|
|
|
| - **Larger Vocabularies (50k+)**
|
| - Only needed for very diverse codebases
|
| - Requires more system resources
|
| - **Higher Minimum Frequency**
|
| - Use 3-5 for smaller vocabularies
|
| - Reduces vocabulary size while maintaining quality
|
|
|
|
|
| - The batch file automatically handles large datasets
|
| - Processes files in memory-efficient chunks
|
| - Can be interrupted and resumed if needed
|
|
|
|
|
|
|
| - Supports `.txt`, `.py`, and other text-based formats
|
| - Handles both files and directories
|
| - Automatically filters binary files
|
|
|
|
|
|
|
| - For large datasets (>1GB), use chunking
|
| - On multi-core systems, increase thread count
|
| - Monitor memory usage with large vocabularies
|
|
|
|
|
|
|
| After creating your tokenizer, use the built-in test function:
|
|
|
| 1. From the batch menu, select "Test Tokenizer"
|
| 2. The system will:
|
| - Test with 10,000 random samples
|
| - Measure tokenization speed (typically >300k tokens/sec)
|
| - Verify 100% round-trip accuracy
|
| - Generate a detailed performance report
|
|
|
| python Test_tokenizer\test_tokenizer.py \
|
| --tokenizer output/Nexforge_tokenizer.json \
|
| --input Dataset \
|
| --sample 20000 \
|
| --output test_result/detailed_test.txt
|
| ```
|
|
|
| ### Test Output Includes
|
| - Tokenization success rate
|
| - Sample encoded/decoded text
|
| - Basic statistics (vocab size, special tokens)
|
| - Any encoding/decoding errors
|
|
|
| ## Troubleshooting
|
|
|
| ### Common Issues
|
|
|
| 1. **Out of Memory**
|
| - Reduce chunk size
|
| - Close other memory-intensive applications
|
| - Use a smaller vocabulary
|
|
|
| 2. **Slow Processing**
|
| - Increase thread count
|
| - Process in smaller batches
|
| - Check for system resource constraints
|
|
|
| 3. **Vocabulary Too Large**
|
| - Increase min_frequency
|
| - Use a smaller vocab_size
|
| - Pre-filter your dataset
|
|
|
| ## Performance & Resource Usage
|
|
|
| The tokenizer is optimized to work efficiently across different hardware configurations:
|
|
|
|
|
| - **Minimum**: 4GB RAM, 2-core CPU
|
| - **Recommended**: 8GB+ RAM, 4+ core CPU
|
| - **Disk Space**: At least 1GB free (more for large datasets)
|
|
|
|
|
| - **Memory Usage**: Typically stays under 2GB for most datasets
|
| - **CPU Utilization**: Deliberately capped to prevent system slowdown
|
| - **Processing Speed**: Varies by system, but generally processes:
|
| - Small datasets (100MB): 1-5 minutes
|
| - Medium datasets (1GB): 10-30 minutes
|
| - Large datasets (10GB+): 1-3 hours
|
|
|
|
|
| - The batch file shows progress updates
|
| - Check Task Manager for real-time resource usage
|
| - Process can be safely interrupted (CTRL+C) and resumed
|
|
|
|
|
|
|
| See the `examples/` directory for:
|
| - Training on specific programming languages
|
| - Fine-tuning pre-trained tokenizers
|
| - Batch processing large datasets
|
|
|
|
|
|
|
| Contributions are welcome! Here's how to get started:
|
|
|
| 1. Fork the repository
|
| 2. Create a new branch
|
| 3. Make your changes
|
| 4. Run tests: `pytest`
|
| 5. Submit a pull request
|
|
|
|
|
|
|
| MIT License - see [LICENSE](LICENSE) for details.
|
|
|