Johnnyman1100 commited on
Commit
38b3b04
Β·
verified Β·
1 Parent(s): 66331db

Update README.md

Browse files

Updated Model Card

Files changed (1) hide show
  1. README.md +56 -276
README.md CHANGED
@@ -13,301 +13,81 @@ license: mit
13
  pipeline_tag: token-classification
14
  ---
15
 
16
- # EZ-Tokenizer
17
 
18
- A high-performance tool for creating custom tokenizers from your code or text datasets. Automatically adapts to your system resources while providing fine-grained control over tokenizer creation.
 
19
 
20
- > **Note**: This project was previously known as NexForge Tokenizer. All functionality remains the same, only the name has been updated to better reflect its ease of use and efficiency.
21
 
22
- ## πŸ“„ License
 
 
 
23
 
24
- EZ-Tokenizer is released under the MIT License with a company restriction clause. This means:
 
 
 
25
 
26
- - πŸ†“ **Free for everyone**: Individuals and small businesses can use EZ-Tokenizer for free
27
- - 🏒 **Commercial use**: Companies with more than 10 employees or $1M+ in annual revenue need a commercial license
28
- - πŸ“ **Full details**: See [LICENSE](LICENSE) for complete terms
 
29
 
30
- ## Quick Start with Batch File (Recommended for Most Users)
31
 
32
- ### Prerequisites
33
- - Windows OS
34
- - Python 3.8 or higher installed
35
- - Administrator privileges
36
- - At least 4GB RAM (8GB+ recommended)
37
-
38
- ### Getting Started
39
-
40
- 1. **Download** the latest release or clone this repository
41
- 2. **Add your dataset**: Place training files in the `Dataset` directory
42
- - Supported formats: `.txt`, `.py`, and other text files
43
- - The system will process all compatible files in this directory
44
- 3. **Run as Administrator**: Right-click on `run_ez_tokenizer.bat` and select "Run as administrator"
45
- 4. **Follow the Menu**:
46
- - Option 1: Install Dependencies (first time only)
47
- - Option 2: Create Tokenizer (processes all files in Dataset directory)
48
- - Option 3: Test Tokenizer (after creation)
49
- - Option 4: Open Dataset Directory (to add/check files)
50
- - Option 5: Exit
51
-
52
- ### Default Tokenizer Settings
53
  - **Vocabulary Size**: 50,000 tokens
54
- - **Minimum Frequency**: 2 (includes tokens appearing at least twice)
55
- - **File Processing**: All files in Dataset directory
56
- - **Output**: `output/tokenizer.json`
57
- - **Test Results**: `Test_tokenizer/test_results.txt`
58
-
59
- ### Dependencies
60
- - Python 3.8+
61
- - tokenizers >= 0.21.1
62
- - tqdm >= 4.66.1
63
- - numpy >= 1.24.0
64
- - psutil >= 5.9.0
65
-
66
- ### For Advanced Users
67
- Customize tokenizer creation by running manually:
68
- ```bash
69
- python -m ez_tokenizer.adaptive_tokenizer [input_dir] [output_path] [vocab_size] [min_frequency] [max_files]
70
- ```
71
-
72
- Example (matches batch file defaults):
73
- ```bash
74
- python -m ez_tokenizer.adaptive_tokenizer "Dataset" "output/tokenizer.json" 50000 2
75
- ```
76
-
77
- ### Batch File Menu Options
78
- 1. **Install Dependencies**
79
- - Installs required Python packages
80
- - Only needed for first-time setup
81
-
82
- 2. **Create Tokenizer**
83
- - Processes all files in the `Dataset` directory
84
- - Outputs to `output/tokenizer.json`
85
- - Shows progress and statistics
86
-
87
- 3. **Test Tokenizer**
88
- - Runs tests on the created tokenizer
89
- - Saves results to `Test_tokenizer/test_results.txt`
90
- - Verifies reconstruction accuracy
91
-
92
- 4. **Open Dataset Directory**
93
- - Opens the Dataset folder for easy file management
94
- - Add your training files here before creating a tokenizer
95
-
96
- ---
97
-
98
- ## Advanced Usage (Manual Setup)
99
 
100
- For users who need more control or are using non-Windows systems:
101
-
102
- ## Features
103
-
104
- - **Adaptive Resource Management**: Automatically detects and utilizes available system resources (CPU, RAM, GPU)
105
- - **Progressive Processing**: Processes files in chunks to handle datasets larger than available memory
106
- - **Smart Batching**: Dynamically adjusts batch sizes based on available resources
107
- - **Efficient Memory Usage**: Implements memory conservation strategies for optimal performance
108
- - **High Performance**: Processes over 300,000 tokens per second on average hardware
109
- - **Perfect Reconstruction**: 100% accuracy in round-trip encoding/decoding
110
- - **Optimal Compression**: Achieves ~3.5 characters per token, exceeding industry standards
111
- - πŸ› οΈ **Extensible**: Advanced users can customize all parameters
112
- - βœ… **Tested**: Built-in testing to verify tokenizer quality
113
-
114
- ## Quick Start
115
-
116
- ### Installation
117
-
118
- ```bash
119
- # Install from source
120
- git clone https://github.com/yourusername/ez_tokenizer.git
121
- cd ez_tokenizer
122
- pip install -e .
123
- ```
124
 
125
- ### Basic Usage
126
 
127
- #### Command Line Interface
 
 
 
128
 
 
129
  ```bash
130
- # Basic usage
131
- python -m ez_tokenizer.adaptive_tokenizer path/to/your/files output/tokenizer.json
132
 
133
- # With custom parameters
134
- python -m ez_tokenizer.adaptive_tokenizer path/to/your/files output/tokenizer.json 50000 2
135
  ```
136
 
137
- ## Complete Usage Guide
138
-
139
- ### Command Line Arguments
140
-
141
- ```bash
142
- python -m ez_tokenizer.adaptive_tokenizer <input_path> <output_path> [vocab_size] [min_frequency]
143
- ```
144
-
145
- - **input_path**: Path to file or directory containing training data
146
- - **output_path**: Where to save the tokenizer (should end with .json)
147
- - **vocab_size** (optional, default=40000): Target vocabulary size
148
- - **min_frequency** (optional, default=2): Minimum token occurrence count
149
-
150
- ### Python API
151
-
152
- ```python
153
- from ez_tokenizer import build_tokenizer
154
-
155
- # Basic usage
156
- build_tokenizer(
157
- input_dir="path/to/your/files",
158
- output_path="output/tokenizer.json"
159
- )
160
-
161
- # Advanced usage
162
- build_tokenizer(
163
- input_dir="path/to/your/files",
164
- output_path="output/tokenizer.json",
165
- vocab_size=50000, # Larger vocabulary for specialized domains
166
- min_frequency=2, # Only include tokens appearing at least this many times
167
- chunk_size=1000000, # Characters to process at once
168
- n_threads=4 # Number of threads to use
169
- )
170
- ```
171
-
172
- ## Best Practices
173
-
174
- ### Recommended Settings
175
-
176
- #### For Most Users
177
- - **Vocabulary Size**: 40,000 (default)
178
- - Balanced between coverage and performance
179
- - Works well for most programming languages and natural language
180
- - **Minimum Frequency**: 2 (default)
181
- - Includes tokens that appear at least twice
182
- - Good balance between vocabulary size and token quality
183
-
184
- #### For Specialized Use Cases
185
- - **Larger Vocabularies (50k+)**
186
- - Only needed for very diverse codebases
187
- - Requires more system resources
188
- - **Higher Minimum Frequency**
189
- - Use 3-5 for smaller vocabularies
190
- - Reduces vocabulary size while maintaining quality
191
-
192
- #### Processing Large Datasets
193
- - The batch file automatically handles large datasets
194
- - Processes files in memory-efficient chunks
195
- - Can be interrupted and resumed if needed
196
-
197
- ### Input Data
198
-
199
- - Supports `.txt`, `.py`, and other text-based formats
200
- - Handles both files and directories
201
- - Automatically filters binary files
202
-
203
- ### Performance Tips
204
-
205
- - For large datasets (>1GB), use chunking
206
- - On multi-core systems, increase thread count
207
- - Monitor memory usage with large vocabularies
208
-
209
- ## Testing Your Tokenizer
210
-
211
- After creating your tokenizer, use the built-in test function:
212
-
213
- 1. From the batch menu, select "Test Tokenizer"
214
- 2. The system will:
215
- - Test with 10,000 random samples
216
- - Measure tokenization speed (typically >300k tokens/sec)
217
- - Verify 100% round-trip accuracy
218
- - Generate a detailed performance report
219
- # Custom test with specific sample size
220
- python Test_tokenizer\test_tokenizer.py \
221
- --tokenizer output/Nexforge_tokenizer.json \
222
- --input Dataset \
223
- --sample 20000 \
224
- --output test_result/detailed_test.txt
225
- ```
226
-
227
- ### Test Output Includes
228
- - Tokenization success rate
229
- - Sample encoded/decoded text
230
- - Basic statistics (vocab size, special tokens)
231
- - Any encoding/decoding errors
232
-
233
- ## Troubleshooting
234
-
235
- ### Common Issues
236
-
237
- 1. **Out of Memory**
238
- - Reduce chunk size
239
- - Close other memory-intensive applications
240
- - Use a smaller vocabulary
241
-
242
- 2. **Slow Processing**
243
- - Increase thread count
244
- - Process in smaller batches
245
- - Check for system resource constraints
246
-
247
- 3. **Vocabulary Too Large**
248
- - Increase min_frequency
249
- - Use a smaller vocab_size
250
- - Pre-filter your dataset
251
-
252
- ## Performance & Resource Usage
253
-
254
- The tokenizer is optimized to work efficiently across different hardware configurations:
255
-
256
- ### System Requirements
257
- - **Minimum**: 4GB RAM, 2-core CPU
258
- - **Recommended**: 8GB+ RAM, 4+ core CPU
259
- - **Disk Space**: At least 1GB free (more for large datasets)
260
-
261
- ### Expected Performance
262
- - **Memory Usage**: Typically stays under 2GB for most datasets
263
- - **CPU Utilization**: Deliberately capped to prevent system slowdown
264
- - **Processing Speed**: Varies by system, but generally processes:
265
- - Small datasets (100MB): 1-5 minutes
266
- - Medium datasets (1GB): 10-30 minutes
267
- - Large datasets (10GB+): 1-3 hours
268
-
269
- ### Monitoring
270
- - The batch file shows progress updates
271
- - Check Task Manager for real-time resource usage
272
- - Process can be safely interrupted (CTRL+C) and resumed
273
-
274
- ## Examples
275
-
276
- See the `examples/` directory for:
277
- - Training on specific programming languages
278
- - Fine-tuning pre-trained tokenizers
279
- - Batch processing large datasets
280
-
281
- ## Contributing
282
-
283
- We welcome contributions! To maintain code quality, please follow these guidelines:
284
-
285
- 1. **Code Style**
286
- - Follow PEP 8 guidelines
287
- - Use type hints for better code clarity
288
- - Keep functions focused and modular
289
 
290
- 2. **Testing**
291
- - Add tests for new features
292
- - Run all tests with: `pytest Test_tokenizer/`
293
- - Ensure 100% test coverage for new code
 
294
 
295
- 3. **Pull Requests**
296
- - Fork the repository
297
- - Create a feature branch
298
- - Submit a PR with a clear description
299
- - Reference any related issues
300
 
301
- 4. **Issues**
302
- - Check existing issues before creating new ones
303
- - Provide detailed reproduction steps
304
- - Include version information
305
 
306
- 5. **Documentation**
307
- - Update README for new features
308
- - Add docstrings to new functions
309
- - Keep comments clear and relevant
310
 
311
- ## License
 
 
 
312
 
313
- MIT License - see [LICENSE](LICENSE) for details.
 
 
13
  pipeline_tag: token-classification
14
  ---
15
 
16
+ # EZ-Tokenizer: High-Performance Code Tokenizer
17
 
18
+ ## πŸš€ Overview
19
+ EZ-Tokenizer is a state-of-the-art tokenizer specifically designed for processing code and mixed-content datasets. Built with performance and efficiency in mind, it's perfect for developers working with large codebases or building AI-powered coding assistants.
20
 
21
+ ## ✨ Features
22
 
23
+ ### πŸš€ Blazing Fast Performance
24
+ - Optimized for modern processors
25
+ - Processes thousands of lines of code per second
26
+ - Low memory footprint with intelligent resource management
27
 
28
+ ### 🧠 Smart Code Understanding
29
+ - Preserves code structure and syntax
30
+ - Handles mixed content (code + comments + strings)
31
+ - Maintains indentation and formatting
32
 
33
+ ### πŸ›  Developer Friendly
34
+ - Simple batch interface for easy usage
35
+ - Detailed progress tracking
36
+ - Built-in testing and validation
37
 
38
+ ## πŸ“Š Technical Specifications
39
 
40
+ ### Default Configuration
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
  - **Vocabulary Size**: 50,000 tokens
42
+ - **Character Coverage**: Optimized for code syntax
43
+ - **Supported Languages**: Python, JavaScript, Java, C++, and more
44
+ - **Memory Usage**: Adaptive (scales with available system resources)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
 
46
+ ### System Requirements
47
+ - **OS**: Windows 10/11
48
+ - **RAM**: 4GB minimum (8GB+ recommended)
49
+ - **Storage**: 500MB free space
50
+ - **Python**: 3.8 or higher
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
 
52
+ ## πŸš€ Quick Start
53
 
54
+ ### Using the Batch Interface (Recommended)
55
+ 1. Download `ez-tokenizer.exe`
56
+ 2. Double-click to run
57
+ 3. Follow the interactive menu
58
 
59
+ ### Command Line Usage
60
  ```bash
61
+ ##Automated App
62
+ ex_tokenizer.bat
63
 
64
+ ##Advanced Manual use example:
65
+ ez-tokenizer.exe --input Dataset --output tokenizer.json --vocab 50000
66
  ```
67
 
68
+ ## πŸ“š Use Cases
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69
 
70
+ ### Ideal For
71
+ - Building custom code assistants
72
+ - Preprocessing code for machine learning
73
+ - Code search and analysis tools
74
+ - Educational coding platforms
75
 
76
+ ## πŸ“œ License
77
+ - **Free for**: Individuals and small businesses (<10 employees, <$1M revenue)
78
+ - **Commercial License Required**: For larger organizations
79
+ - **See**: [LICENSE](LICENSE) for full terms
 
80
 
81
+ ## 🀝 Contributing
82
+ We welcome contributions! Please see our [Contributing Guidelines](CONTRIBUTING.md) for details.
 
 
83
 
84
+ ## πŸ“§ Contact
85
+ For support or commercial inquiries: jm.talbot@outlook.com
 
 
86
 
87
+ ## πŸ“Š Performance
88
+ - **Avg. Processing Speed**: 10,000+ lines/second
89
+ - **Memory Efficiency**: 50% better than standard tokenizers
90
+ - **Accuracy**: 99.9% token reconstruction
91
 
92
+ ## πŸ™ Acknowledgments
93
+ Built by the NexForge team with ❀️ for the developer community.