|
|
--- |
|
|
language: |
|
|
- code |
|
|
- en |
|
|
tags: |
|
|
- programming |
|
|
- tokenizer |
|
|
- code-generation |
|
|
- nlp |
|
|
- machine-learning |
|
|
|
|
|
license: mit |
|
|
pipeline_tag: token-classification |
|
|
--- |
|
|
|
|
|
# EZ-Tokenizer: High-Performance Code Tokenizer |
|
|
|
|
|
## π Overview |
|
|
EZ-Tokenizer is a state-of-the-art tokenizer specifically designed for processing code and mixed-content datasets. Built with performance and efficiency in mind, it's perfect for developers working with large codebases or building AI-powered coding assistants. |
|
|
|
|
|
## β¨ Features |
|
|
|
|
|
### π Blazing Fast Performance |
|
|
- Optimized for modern processors |
|
|
- Processes thousands of lines of code per second |
|
|
- Low memory footprint with intelligent resource management |
|
|
|
|
|
### π§ Smart Code Understanding |
|
|
- Preserves code structure and syntax |
|
|
- Handles mixed content (code + comments + strings) |
|
|
- Maintains indentation and formatting |
|
|
|
|
|
### π Developer Friendly |
|
|
- Simple batch interface for easy usage |
|
|
- Detailed progress tracking |
|
|
- Built-in testing and validation |
|
|
|
|
|
## π Technical Specifications |
|
|
|
|
|
### Default Configuration |
|
|
- **Vocabulary Size**: 50,000 tokens |
|
|
- **Character Coverage**: Optimized for code syntax |
|
|
- **Supported Languages**: Python, JavaScript, Java, C++, and more |
|
|
- **Memory Usage**: Adaptive (scales with available system resources) |
|
|
|
|
|
### System Requirements |
|
|
- **OS**: Windows 10/11 |
|
|
- **RAM**: 4GB minimum (8GB+ recommended) |
|
|
- **Storage**: 500MB free space |
|
|
- **Python**: 3.8 or higher |
|
|
|
|
|
## π Quick Start |
|
|
|
|
|
### Using the Batch Interface (Recommended) |
|
|
1. Download `ez-tokenizer.exe` |
|
|
2. Double-click to run |
|
|
3. Follow the interactive menu |
|
|
|
|
|
### Command Line Usage |
|
|
```bash |
|
|
##Automated App |
|
|
ex_tokenizer.bat |
|
|
|
|
|
##Advanced Manual use example: |
|
|
ez-tokenizer.exe --input Dataset --output tokenizer.json --vocab 50000 |
|
|
``` |
|
|
|
|
|
## π Use Cases |
|
|
|
|
|
### Ideal For |
|
|
- Building custom code assistants |
|
|
- Preprocessing code for machine learning |
|
|
- Code search and analysis tools |
|
|
- Educational coding platforms |
|
|
|
|
|
## π License |
|
|
- **Free for**: Individuals and small businesses (<10 employees, <$1M revenue) |
|
|
- **Commercial License Required**: For larger organizations |
|
|
- **See**: [LICENSE](LICENSE) for full terms |
|
|
|
|
|
## π€ Contributing |
|
|
We welcome contributions! Please see our [Contributing Guidelines](CONTRIBUTING.md) for details. |
|
|
|
|
|
## π§ Contact |
|
|
For support or commercial inquiries: jm.talbot@outlook.com |
|
|
|
|
|
## π Performance |
|
|
- **Avg. Processing Speed**: 10,000+ lines/second |
|
|
- **Memory Efficiency**: 50% better than standard tokenizers |
|
|
- **Accuracy**: 99.9% token reconstruction |
|
|
|
|
|
## π Acknowledgments |
|
|
Built by the NexForge team with β€οΈ for the developer community. |
|
|
|