| # NexForge Tokenizer Testing | |
| This directory contains tools for testing the NexForge tokenizer on your code or text files. | |
| ## Quick Start | |
| 1. **Create a tokenizer** using the main menu (`run_nexforge.bat`) | |
| 2. **Run tests** from the main menu | |
| - Tests 10,000 random samples by default | |
| - Results saved to `test_result/test_run.txt` | |
| ## Advanced Testing | |
| ### Prerequisites | |
| - Python 3.8+ | |
| - NexForge tokenizer package installed | |
| ### Test Scripts | |
| 1. **test_tokenizer.py** - Comprehensive testing with detailed metrics | |
| 2. **test_tokenizer_simple.py** - Quick testing on a single file | |
| ## Installation | |
| Dependencies are automatically installed when you run the main installer. For manual setup: | |
| ```bash | |
| pip install tokenizers python-Levenshtein | |
| ``` | |
| ## Project Structure | |
| ``` | |
| NexForge/ | |
| βββ Test_tokenizer/ | |
| β βββ test_tokenizer.py # Main test script (batch processing) | |
| β βββ test_tokenizer_simple.py # Quick test script (single file) | |
| βββ output/ # Tokenizer output (Nexforge_tokenizer.json) | |
| βββ Dataset/ # Your training/test files | |
| βββ test_result/ # Test outputs and reports | |
| ``` | |
| ## test_tokenizer.py | |
| Comprehensive testing with detailed metrics and batch processing. | |
| ### Basic Usage | |
| ```bash | |
| # Run with default settings (uses tokenizer from parent directory) | |
| python test_tokenizer.py | |
| # Or specify custom paths | |
| python test_tokenizer.py \ | |
| --tokenizer ../output/Nexforge_tokenizer.json \ | |
| --input ../Dataset \ | |
| --output ../test_result/detailed_test.txt | |
| ``` | |
| ### What's Tested | |
| - Tokenization/decoding accuracy | |
| - Special token handling | |
| - Performance metrics | |
| - File format compatibility | |
| ### Command Line Options | |
| ```bash | |
| # Custom tokenizer, input, and output paths | |
| python test_tokenizer.py \ | |
| --tokenizer path/to/your/tokenizer.json \ | |
| --input path/to/your/code/directory \ | |
| --output custom_results/custom_test.txt \ | |
| --file-types py,js,json \ | |
| --max-files 20 \ | |
| --sample 50000 | |
| # Process only specific file types | |
| python test_tokenizer.py --file-types py,js,json | |
| # Process all files but limit to first 20 | |
| python test_tokenizer.py --max-files 20 | |
| # Process all files of specific types (no limit) | |
| python test_tokenizer.py --max-files 0 --file-types py,js | |
| # Process full content of each file (no sampling) | |
| python test_tokenizer.py --sample 0 | |
| ``` | |
| ## test_tokenizer_simple.py | |
| Quick verification of tokenizer functionality. | |
| ### Usage | |
| ```bash | |
| # Quick test on a single file | |
| python test_tokenizer_simple.py --input sample.py | |
| # Test with custom tokenizer | |
| python test_tokenizer_simple.py \ | |
| --tokenizer ../output/Nexforge_tokenizer.json \ | |
| --input sample.py | |
| ``` | |
| ### When to Use | |
| - Quick validation of tokenizer | |
| - Debugging specific files | |
| - Verifying tokenization quality | |
| - Minimal setup required | |
| ## Understanding Test Results | |
| ### Sample Output | |
| ``` | |
| === NexForge Tokenizer Test Results === | |
| Tested on: 2025-05-25 13:30:00 | |
| Tokenizer: ../output/Nexforge_tokenizer.json | |
| Files processed: 42 | |
| Total tokens: 1,234,567 | |
| Success Rate: 99.8% | |
| Avg. tokens/file: 29,394 | |
| Max memory used: 1.2GB | |
| === Detailed Metrics === | |
| - Perfect matches: 98.2% | |
| - Minor differences: 1.5% | |
| - Major issues: 0.3% | |
| See test_result/test_run.txt for full report | |
| ``` | |
| ### Interpreting Results | |
| - **Success Rate**: Percentage of files processed without errors | |
| - **Perfect Matches**: Files that round-trip encode/decode perfectly | |
| - **Minor Differences**: Small whitespace or formatting differences | |
| - **Major Issues**: Significant differences requiring attention | |
| ## Need Help? | |
| If you encounter any issues: | |
| 1. Check the test results in `test_result/` | |
| 2. Ensure your tokenizer was created successfully | |
| 3. Verify file encodings (UTF-8 recommended) | |
| 4. Check for corrupted or extremely large files | |
| For additional support, please open an issue on our GitHub repository. | |
| File types: py,js,json | |
| Max files: 10 | |
| Sample size: 100000 chars/file | |
| === Summary === | |
| Processed files: 10 | |
| Skipped files: 0 | |
| avg_chars_per_token: 3.47 | |
| avg_tokens_per_sec: 12500.34 | |
| ``` | |
| ### test_tokenizer_simple.py Output | |
| ``` | |
| === TOKENIZER TEST SUMMARY ================================================ | |
| Test Script: test_tokenizer_simple.py | |
| Timestamp: 20250524_154835 | |
| Tokenizer: ../output/tokenizer.json | |
| Chunk file: example.txt | |
| -------------------------------------------------------------------------------- | |
| Lines processed: 1000 | |
| Perfect matches: 987 (98.7%) | |
| Average tokens/line: 15.23 | |
| Total characters: 1,234,567 | |
| Total tokens: 15,230 | |
| Character accuracy: 99.85% | |
| Character diff: 1,845 chars (0.15%) | |
| Chars per token: 7.92 (lower is better) | |
| ``` | |
| ## Troubleshooting | |
| - **Missing Dependencies**: Install required packages with `pip install -r requirements.txt` | |
| - **File Not Found**: Ensure the tokenizer and input paths are correct | |
| - **Empty Results**: Check that your input directory contains files with the specified extensions | |
| - **Tokenizer Not Found**: By default, looks for tokenizer.json in `../output/` (one level up from Test_tokenizer) | |
| ## License | |
| This tool is part of the Nexforge project. See the main project for licensing information. | |