Upload 38 files

4265aea verified 11 months ago

9.97 kB

	Metadata-Version: 2.4
	Name: ez-tokenizer
	Version: 1.0.0
	Summary: High-performance tokenizer builder for code and text datasets with adaptive resource management
	Home-page: https://github.com/nexforge/nexforgetokenizer
	Author: NexForge Team
	Author-email: NexForge <jm.talbot@outlook.com>
	Maintainer-email: NexForge <jm.talbot@outlook.com>
	License: MIT with Company Restriction
	Classifier: Development Status :: 4 - Beta
	Classifier: Intended Audience :: Developers
	Classifier: Intended Audience :: Science/Research
	Classifier: License :: Other/Proprietary License
	Classifier: Programming Language :: Python :: 3.8
	Classifier: Programming Language :: Python :: 3.9
	Classifier: Programming Language :: Python :: 3.10
	Classifier: Programming Language :: Python :: 3.11
	Classifier: Programming Language :: Python :: 3.12
	Classifier: Topic :: Software Development :: Libraries :: Python Modules
	Classifier: Topic :: Text Processing :: Linguistic
	Requires-Python: >=3.8
	Description-Content-Type: text/markdown
	License-File: LICENSE
	Requires-Dist: torch>=1.9.0
	Requires-Dist: tokenizers>=0.12.0
	Requires-Dist: tqdm>=4.62.0
	Requires-Dist: psutil>=5.9.0
	Requires-Dist: python-dateutil>=2.8.2
	Provides-Extra: dev
	Requires-Dist: pytest>=6.0; extra == "dev"
	Requires-Dist: pytest-cov>=2.12.1; extra == "dev"
	Requires-Dist: pytest-xdist>=2.4.0; extra == "dev"
	Requires-Dist: black>=21.7b0; extra == "dev"
	Requires-Dist: isort>=5.0.0; extra == "dev"
	Requires-Dist: mypy>=0.910; extra == "dev"
	Requires-Dist: pylint>=2.11.0; extra == "dev"
	Requires-Dist: pre-commit>=2.15.0; extra == "dev"
	Dynamic: author
	Dynamic: home-page
	Dynamic: license-file
	Dynamic: requires-python

	# EZ-Tokenizer

	A high-performance tool for creating custom tokenizers from your code or text datasets. Automatically adapts to your system resources while providing fine-grained control over tokenizer creation.

	> Note: This project was previously known as NexForge Tokenizer. All functionality remains the same, only the name has been updated to better reflect its ease of use and efficiency.

	## 📄 License

	EZ-Tokenizer is released under the MIT License with a company restriction clause. This means:

	- 🆓 Free for everyone: Individuals and small businesses can use EZ-Tokenizer for free
	- 🏢 Commercial use: Companies with more than 10 employees or $1M+ in annual revenue need a commercial license
	- 📝 Full details: See [LICENSE](LICENSE) for complete terms

	## Quick Start with Batch File (Recommended for Most Users)

	### Prerequisites
	- Windows OS
	- Python 3.8 or higher installed
	- Administrator privileges
	- At least 4GB RAM (8GB+ recommended)

	### Getting Started

	1. Download the latest release or clone this repository
	2. Add your dataset: Place training files in the `Dataset` directory
	- Supported formats: `.txt`, `.py`, and other text files
	- The system will process all compatible files in this directory
	3. Run as Administrator: Right-click on `run_ez_tokenizer.bat` and select "Run as administrator"
	4. Follow the Menu:
	- Option 1: Install Dependencies (first time only)
	- Option 2: Create Tokenizer (processes all files in Dataset directory)
	- Option 3: Test Tokenizer (after creation)
	- Option 4: Open Dataset Directory (to add/check files)
	- Option 5: Exit

	### Default Tokenizer Settings
	- Vocabulary Size: 50,000 tokens
	- Minimum Frequency: 2 (includes tokens appearing at least twice)
	- File Processing: All files in Dataset directory
	- Output: `output/ez_tokenizer.json`
	- Test Results: `test_result/test_run.txt`

	### For Advanced Users
	Customize tokenizer creation by running manually:
	```bash
	python -m ez_tokenizer.adaptive_tokenizer [input_dir] [output_path] [vocab_size] [min_frequency] [max_files]
	```
	Example:
	```bash
	python -m ez_tokenizer.adaptive_tokenizer "Dataset" "output/custom_tokenizer.json" 50000 2 1000
	```

	---

	## Advanced Usage (Manual Setup)

	For users who need more control or are using non-Windows systems:

	## Features

	- Adaptive Resource Management: Automatically detects and utilizes available system resources (CPU, RAM, GPU)
	- Progressive Processing: Processes files in chunks to handle datasets larger than available memory
	- Smart Batching: Dynamically adjusts batch sizes based on available resources
	- Efficient Memory Usage: Implements memory conservation strategies for optimal performance
	- High Performance: Processes over 300,000 tokens per second on average hardware
	- Perfect Reconstruction: 100% accuracy in round-trip encoding/decoding
	- Optimal Compression: Achieves ~3.5 characters per token, exceeding industry standards
	- 🛠️ Extensible: Advanced users can customize all parameters
	- ✅ Tested: Built-in testing to verify tokenizer quality

	## Quick Start

	### Installation

	```bash
	# Install from source
	git clone https://github.com/yourusername/ez_tokenizer.git
	cd ez_tokenizer
	pip install -e .
	```

	### Basic Usage

	#### Command Line Interface

	```bash
	# Basic usage
	python -m ez_tokenizer.adaptive_tokenizer path/to/your/files output/tokenizer.json

	# With custom parameters
	python -m ez_tokenizer.adaptive_tokenizer path/to/your/files output/tokenizer.json 50000 2
	```

	## Complete Usage Guide

	### Command Line Arguments

	```bash
	python -m ez_tokenizer.adaptive_tokenizer <input_path> <output_path> [vocab_size] [min_frequency]
	```

	- input_path: Path to file or directory containing training data
	- output_path: Where to save the tokenizer (should end with .json)
	- vocab_size (optional, default=40000): Target vocabulary size
	- min_frequency (optional, default=2): Minimum token occurrence count

	### Python API

	```python
	from ez_tokenizer import build_tokenizer

	# Basic usage
	build_tokenizer(
	input_dir="path/to/your/files",
	output_path="output/tokenizer.json"
	)

	# Advanced usage
	build_tokenizer(
	input_dir="path/to/your/files",
	output_path="output/tokenizer.json",
	vocab_size=50000, # Larger vocabulary for specialized domains
	min_frequency=2, # Only include tokens appearing at least this many times
	chunk_size=1000000, # Characters to process at once
	n_threads=4 # Number of threads to use
	)
	```

	## Best Practices

	### Recommended Settings

	#### For Most Users
	- Vocabulary Size: 40,000 (default)
	- Balanced between coverage and performance
	- Works well for most programming languages and natural language
	- Minimum Frequency: 2 (default)
	- Includes tokens that appear at least twice
	- Good balance between vocabulary size and token quality

	#### For Specialized Use Cases
	- Larger Vocabularies (50k+)
	- Only needed for very diverse codebases
	- Requires more system resources
	- Higher Minimum Frequency
	- Use 3-5 for smaller vocabularies
	- Reduces vocabulary size while maintaining quality

	#### Processing Large Datasets
	- The batch file automatically handles large datasets
	- Processes files in memory-efficient chunks
	- Can be interrupted and resumed if needed

	### Input Data

	- Supports `.txt`, `.py`, and other text-based formats
	- Handles both files and directories
	- Automatically filters binary files

	### Performance Tips

	- For large datasets (>1GB), use chunking
	- On multi-core systems, increase thread count
	- Monitor memory usage with large vocabularies

	## Testing Your Tokenizer

	After creating your tokenizer, use the built-in test function:

	1. From the batch menu, select "Test Tokenizer"
	2. The system will:
	- Test with 10,000 random samples
	- Measure tokenization speed (typically >300k tokens/sec)
	- Verify 100% round-trip accuracy
	- Generate a detailed performance report
	# Custom test with specific sample size
	python Test_tokenizer\test_tokenizer.py \
	--tokenizer output/Nexforge_tokenizer.json \
	--input Dataset \
	--sample 20000 \
	--output test_result/detailed_test.txt
	```

	### Test Output Includes
	- Tokenization success rate
	- Sample encoded/decoded text
	- Basic statistics (vocab size, special tokens)
	- Any encoding/decoding errors

	## Troubleshooting

	### Common Issues

	1. Out of Memory
	- Reduce chunk size
	- Close other memory-intensive applications
	- Use a smaller vocabulary

	2. Slow Processing
	- Increase thread count
	- Process in smaller batches
	- Check for system resource constraints

	3. Vocabulary Too Large
	- Increase min_frequency
	- Use a smaller vocab_size
	- Pre-filter your dataset

	## Performance & Resource Usage

	The tokenizer is optimized to work efficiently across different hardware configurations:

	### System Requirements
	- Minimum: 4GB RAM, 2-core CPU
	- Recommended: 8GB+ RAM, 4+ core CPU
	- Disk Space: At least 1GB free (more for large datasets)

	### Expected Performance
	- Memory Usage: Typically stays under 2GB for most datasets
	- CPU Utilization: Deliberately capped to prevent system slowdown
	- Processing Speed: Varies by system, but generally processes:
	- Small datasets (100MB): 1-5 minutes
	- Medium datasets (1GB): 10-30 minutes
	- Large datasets (10GB+): 1-3 hours

	### Monitoring
	- The batch file shows progress updates
	- Check Task Manager for real-time resource usage
	- Process can be safely interrupted (CTRL+C) and resumed

	## Examples

	See the `examples/` directory for:
	- Training on specific programming languages
	- Fine-tuning pre-trained tokenizers
	- Batch processing large datasets

	## Contributing

	Contributions are welcome! Here's how to get started:

	1. Fork the repository
	2. Create a new branch
	3. Make your changes
	4. Run tests: `pytest`
	5. Submit a pull request

	## License

	MIT License - see [LICENSE](LICENSE) for details.