Johnnyman1100's picture
Upload 38 files
4265aea verified
Metadata-Version: 2.4
Name: ez-tokenizer
Version: 1.0.0
Summary: High-performance tokenizer builder for code and text datasets with adaptive resource management
Home-page: https://github.com/nexforge/nexforgetokenizer
Author: NexForge Team
Author-email: NexForge <jm.talbot@outlook.com>
Maintainer-email: NexForge <jm.talbot@outlook.com>
License: MIT with Company Restriction
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: Other/Proprietary License
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=1.9.0
Requires-Dist: tokenizers>=0.12.0
Requires-Dist: tqdm>=4.62.0
Requires-Dist: psutil>=5.9.0
Requires-Dist: python-dateutil>=2.8.2
Provides-Extra: dev
Requires-Dist: pytest>=6.0; extra == "dev"
Requires-Dist: pytest-cov>=2.12.1; extra == "dev"
Requires-Dist: pytest-xdist>=2.4.0; extra == "dev"
Requires-Dist: black>=21.7b0; extra == "dev"
Requires-Dist: isort>=5.0.0; extra == "dev"
Requires-Dist: mypy>=0.910; extra == "dev"
Requires-Dist: pylint>=2.11.0; extra == "dev"
Requires-Dist: pre-commit>=2.15.0; extra == "dev"
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python
# EZ-Tokenizer
A high-performance tool for creating custom tokenizers from your code or text datasets. Automatically adapts to your system resources while providing fine-grained control over tokenizer creation.
> **Note**: This project was previously known as NexForge Tokenizer. All functionality remains the same, only the name has been updated to better reflect its ease of use and efficiency.
## 📄 License
EZ-Tokenizer is released under the MIT License with a company restriction clause. This means:
- 🆓 **Free for everyone**: Individuals and small businesses can use EZ-Tokenizer for free
- 🏢 **Commercial use**: Companies with more than 10 employees or $1M+ in annual revenue need a commercial license
- 📝 **Full details**: See [LICENSE](LICENSE) for complete terms
## Quick Start with Batch File (Recommended for Most Users)
### Prerequisites
- Windows OS
- Python 3.8 or higher installed
- Administrator privileges
- At least 4GB RAM (8GB+ recommended)
### Getting Started
1. **Download** the latest release or clone this repository
2. **Add your dataset**: Place training files in the `Dataset` directory
- Supported formats: `.txt`, `.py`, and other text files
- The system will process all compatible files in this directory
3. **Run as Administrator**: Right-click on `run_ez_tokenizer.bat` and select "Run as administrator"
4. **Follow the Menu**:
- Option 1: Install Dependencies (first time only)
- Option 2: Create Tokenizer (processes all files in Dataset directory)
- Option 3: Test Tokenizer (after creation)
- Option 4: Open Dataset Directory (to add/check files)
- Option 5: Exit
### Default Tokenizer Settings
- **Vocabulary Size**: 50,000 tokens
- **Minimum Frequency**: 2 (includes tokens appearing at least twice)
- **File Processing**: All files in Dataset directory
- **Output**: `output/ez_tokenizer.json`
- **Test Results**: `test_result/test_run.txt`
### For Advanced Users
Customize tokenizer creation by running manually:
```bash
python -m ez_tokenizer.adaptive_tokenizer [input_dir] [output_path] [vocab_size] [min_frequency] [max_files]
```
Example:
```bash
python -m ez_tokenizer.adaptive_tokenizer "Dataset" "output/custom_tokenizer.json" 50000 2 1000
```
---
## Advanced Usage (Manual Setup)
For users who need more control or are using non-Windows systems:
## Features
- **Adaptive Resource Management**: Automatically detects and utilizes available system resources (CPU, RAM, GPU)
- **Progressive Processing**: Processes files in chunks to handle datasets larger than available memory
- **Smart Batching**: Dynamically adjusts batch sizes based on available resources
- **Efficient Memory Usage**: Implements memory conservation strategies for optimal performance
- **High Performance**: Processes over 300,000 tokens per second on average hardware
- **Perfect Reconstruction**: 100% accuracy in round-trip encoding/decoding
- **Optimal Compression**: Achieves ~3.5 characters per token, exceeding industry standards
- 🛠️ **Extensible**: Advanced users can customize all parameters
- ✅ **Tested**: Built-in testing to verify tokenizer quality
## Quick Start
### Installation
```bash
# Install from source
git clone https://github.com/yourusername/ez_tokenizer.git
cd ez_tokenizer
pip install -e .
```
### Basic Usage
#### Command Line Interface
```bash
# Basic usage
python -m ez_tokenizer.adaptive_tokenizer path/to/your/files output/tokenizer.json
# With custom parameters
python -m ez_tokenizer.adaptive_tokenizer path/to/your/files output/tokenizer.json 50000 2
```
## Complete Usage Guide
### Command Line Arguments
```bash
python -m ez_tokenizer.adaptive_tokenizer <input_path> <output_path> [vocab_size] [min_frequency]
```
- **input_path**: Path to file or directory containing training data
- **output_path**: Where to save the tokenizer (should end with .json)
- **vocab_size** (optional, default=40000): Target vocabulary size
- **min_frequency** (optional, default=2): Minimum token occurrence count
### Python API
```python
from ez_tokenizer import build_tokenizer
# Basic usage
build_tokenizer(
input_dir="path/to/your/files",
output_path="output/tokenizer.json"
)
# Advanced usage
build_tokenizer(
input_dir="path/to/your/files",
output_path="output/tokenizer.json",
vocab_size=50000, # Larger vocabulary for specialized domains
min_frequency=2, # Only include tokens appearing at least this many times
chunk_size=1000000, # Characters to process at once
n_threads=4 # Number of threads to use
)
```
## Best Practices
### Recommended Settings
#### For Most Users
- **Vocabulary Size**: 40,000 (default)
- Balanced between coverage and performance
- Works well for most programming languages and natural language
- **Minimum Frequency**: 2 (default)
- Includes tokens that appear at least twice
- Good balance between vocabulary size and token quality
#### For Specialized Use Cases
- **Larger Vocabularies (50k+)**
- Only needed for very diverse codebases
- Requires more system resources
- **Higher Minimum Frequency**
- Use 3-5 for smaller vocabularies
- Reduces vocabulary size while maintaining quality
#### Processing Large Datasets
- The batch file automatically handles large datasets
- Processes files in memory-efficient chunks
- Can be interrupted and resumed if needed
### Input Data
- Supports `.txt`, `.py`, and other text-based formats
- Handles both files and directories
- Automatically filters binary files
### Performance Tips
- For large datasets (>1GB), use chunking
- On multi-core systems, increase thread count
- Monitor memory usage with large vocabularies
## Testing Your Tokenizer
After creating your tokenizer, use the built-in test function:
1. From the batch menu, select "Test Tokenizer"
2. The system will:
- Test with 10,000 random samples
- Measure tokenization speed (typically >300k tokens/sec)
- Verify 100% round-trip accuracy
- Generate a detailed performance report
# Custom test with specific sample size
python Test_tokenizer\test_tokenizer.py \
--tokenizer output/Nexforge_tokenizer.json \
--input Dataset \
--sample 20000 \
--output test_result/detailed_test.txt
```
### Test Output Includes
- Tokenization success rate
- Sample encoded/decoded text
- Basic statistics (vocab size, special tokens)
- Any encoding/decoding errors
## Troubleshooting
### Common Issues
1. **Out of Memory**
- Reduce chunk size
- Close other memory-intensive applications
- Use a smaller vocabulary
2. **Slow Processing**
- Increase thread count
- Process in smaller batches
- Check for system resource constraints
3. **Vocabulary Too Large**
- Increase min_frequency
- Use a smaller vocab_size
- Pre-filter your dataset
## Performance & Resource Usage
The tokenizer is optimized to work efficiently across different hardware configurations:
### System Requirements
- **Minimum**: 4GB RAM, 2-core CPU
- **Recommended**: 8GB+ RAM, 4+ core CPU
- **Disk Space**: At least 1GB free (more for large datasets)
### Expected Performance
- **Memory Usage**: Typically stays under 2GB for most datasets
- **CPU Utilization**: Deliberately capped to prevent system slowdown
- **Processing Speed**: Varies by system, but generally processes:
- Small datasets (100MB): 1-5 minutes
- Medium datasets (1GB): 10-30 minutes
- Large datasets (10GB+): 1-3 hours
### Monitoring
- The batch file shows progress updates
- Check Task Manager for real-time resource usage
- Process can be safely interrupted (CTRL+C) and resumed
## Examples
See the `examples/` directory for:
- Training on specific programming languages
- Fine-tuning pre-trained tokenizers
- Batch processing large datasets
## Contributing
Contributions are welcome! Here's how to get started:
1. Fork the repository
2. Create a new branch
3. Make your changes
4. Run tests: `pytest`
5. Submit a pull request
## License
MIT License - see [LICENSE](LICENSE) for details.