File size: 5,164 Bytes
4265aea |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 |
# NexForge Tokenizer Testing
This directory contains tools for testing the NexForge tokenizer on your code or text files.
## Quick Start
1. **Create a tokenizer** using the main menu (`run_nexforge.bat`)
2. **Run tests** from the main menu
- Tests 10,000 random samples by default
- Results saved to `test_result/test_run.txt`
## Advanced Testing
### Prerequisites
- Python 3.8+
- NexForge tokenizer package installed
### Test Scripts
1. **test_tokenizer.py** - Comprehensive testing with detailed metrics
2. **test_tokenizer_simple.py** - Quick testing on a single file
## Installation
Dependencies are automatically installed when you run the main installer. For manual setup:
```bash
pip install tokenizers python-Levenshtein
```
## Project Structure
```
NexForge/
βββ Test_tokenizer/
β βββ test_tokenizer.py # Main test script (batch processing)
β βββ test_tokenizer_simple.py # Quick test script (single file)
βββ output/ # Tokenizer output (Nexforge_tokenizer.json)
βββ Dataset/ # Your training/test files
βββ test_result/ # Test outputs and reports
```
## test_tokenizer.py
Comprehensive testing with detailed metrics and batch processing.
### Basic Usage
```bash
# Run with default settings (uses tokenizer from parent directory)
python test_tokenizer.py
# Or specify custom paths
python test_tokenizer.py \
--tokenizer ../output/Nexforge_tokenizer.json \
--input ../Dataset \
--output ../test_result/detailed_test.txt
```
### What's Tested
- Tokenization/decoding accuracy
- Special token handling
- Performance metrics
- File format compatibility
### Command Line Options
```bash
# Custom tokenizer, input, and output paths
python test_tokenizer.py \
--tokenizer path/to/your/tokenizer.json \
--input path/to/your/code/directory \
--output custom_results/custom_test.txt \
--file-types py,js,json \
--max-files 20 \
--sample 50000
# Process only specific file types
python test_tokenizer.py --file-types py,js,json
# Process all files but limit to first 20
python test_tokenizer.py --max-files 20
# Process all files of specific types (no limit)
python test_tokenizer.py --max-files 0 --file-types py,js
# Process full content of each file (no sampling)
python test_tokenizer.py --sample 0
```
## test_tokenizer_simple.py
Quick verification of tokenizer functionality.
### Usage
```bash
# Quick test on a single file
python test_tokenizer_simple.py --input sample.py
# Test with custom tokenizer
python test_tokenizer_simple.py \
--tokenizer ../output/Nexforge_tokenizer.json \
--input sample.py
```
### When to Use
- Quick validation of tokenizer
- Debugging specific files
- Verifying tokenization quality
- Minimal setup required
## Understanding Test Results
### Sample Output
```
=== NexForge Tokenizer Test Results ===
Tested on: 2025-05-25 13:30:00
Tokenizer: ../output/Nexforge_tokenizer.json
Files processed: 42
Total tokens: 1,234,567
Success Rate: 99.8%
Avg. tokens/file: 29,394
Max memory used: 1.2GB
=== Detailed Metrics ===
- Perfect matches: 98.2%
- Minor differences: 1.5%
- Major issues: 0.3%
See test_result/test_run.txt for full report
```
### Interpreting Results
- **Success Rate**: Percentage of files processed without errors
- **Perfect Matches**: Files that round-trip encode/decode perfectly
- **Minor Differences**: Small whitespace or formatting differences
- **Major Issues**: Significant differences requiring attention
## Need Help?
If you encounter any issues:
1. Check the test results in `test_result/`
2. Ensure your tokenizer was created successfully
3. Verify file encodings (UTF-8 recommended)
4. Check for corrupted or extremely large files
For additional support, please open an issue on our GitHub repository.
File types: py,js,json
Max files: 10
Sample size: 100000 chars/file
=== Summary ===
Processed files: 10
Skipped files: 0
avg_chars_per_token: 3.47
avg_tokens_per_sec: 12500.34
```
### test_tokenizer_simple.py Output
```
=== TOKENIZER TEST SUMMARY ================================================
Test Script: test_tokenizer_simple.py
Timestamp: 20250524_154835
Tokenizer: ../output/tokenizer.json
Chunk file: example.txt
--------------------------------------------------------------------------------
Lines processed: 1000
Perfect matches: 987 (98.7%)
Average tokens/line: 15.23
Total characters: 1,234,567
Total tokens: 15,230
Character accuracy: 99.85%
Character diff: 1,845 chars (0.15%)
Chars per token: 7.92 (lower is better)
```
## Troubleshooting
- **Missing Dependencies**: Install required packages with `pip install -r requirements.txt`
- **File Not Found**: Ensure the tokenizer and input paths are correct
- **Empty Results**: Check that your input directory contains files with the specified extensions
- **Tokenizer Not Found**: By default, looks for tokenizer.json in `../output/` (one level up from Test_tokenizer)
## License
This tool is part of the Nexforge project. See the main project for licensing information.
|