# NexForge Tokenizer Testing This directory contains tools for testing the NexForge tokenizer on your code or text files. ## Quick Start 1. **Create a tokenizer** using the main menu (`run_nexforge.bat`) 2. **Run tests** from the main menu - Tests 10,000 random samples by default - Results saved to `test_result/test_run.txt` ## Advanced Testing ### Prerequisites - Python 3.8+ - NexForge tokenizer package installed ### Test Scripts 1. **test_tokenizer.py** - Comprehensive testing with detailed metrics 2. **test_tokenizer_simple.py** - Quick testing on a single file ## Installation Dependencies are automatically installed when you run the main installer. For manual setup: ```bash pip install tokenizers python-Levenshtein ``` ## Project Structure ``` NexForge/ ├── Test_tokenizer/ │ ├── test_tokenizer.py # Main test script (batch processing) │ └── test_tokenizer_simple.py # Quick test script (single file) ├── output/ # Tokenizer output (Nexforge_tokenizer.json) ├── Dataset/ # Your training/test files └── test_result/ # Test outputs and reports ``` ## test_tokenizer.py Comprehensive testing with detailed metrics and batch processing. ### Basic Usage ```bash # Run with default settings (uses tokenizer from parent directory) python test_tokenizer.py # Or specify custom paths python test_tokenizer.py \ --tokenizer ../output/Nexforge_tokenizer.json \ --input ../Dataset \ --output ../test_result/detailed_test.txt ``` ### What's Tested - Tokenization/decoding accuracy - Special token handling - Performance metrics - File format compatibility ### Command Line Options ```bash # Custom tokenizer, input, and output paths python test_tokenizer.py \ --tokenizer path/to/your/tokenizer.json \ --input path/to/your/code/directory \ --output custom_results/custom_test.txt \ --file-types py,js,json \ --max-files 20 \ --sample 50000 # Process only specific file types python test_tokenizer.py --file-types py,js,json # Process all files but limit to first 20 python test_tokenizer.py --max-files 20 # Process all files of specific types (no limit) python test_tokenizer.py --max-files 0 --file-types py,js # Process full content of each file (no sampling) python test_tokenizer.py --sample 0 ``` ## test_tokenizer_simple.py Quick verification of tokenizer functionality. ### Usage ```bash # Quick test on a single file python test_tokenizer_simple.py --input sample.py # Test with custom tokenizer python test_tokenizer_simple.py \ --tokenizer ../output/Nexforge_tokenizer.json \ --input sample.py ``` ### When to Use - Quick validation of tokenizer - Debugging specific files - Verifying tokenization quality - Minimal setup required ## Understanding Test Results ### Sample Output ``` === NexForge Tokenizer Test Results === Tested on: 2025-05-25 13:30:00 Tokenizer: ../output/Nexforge_tokenizer.json Files processed: 42 Total tokens: 1,234,567 Success Rate: 99.8% Avg. tokens/file: 29,394 Max memory used: 1.2GB === Detailed Metrics === - Perfect matches: 98.2% - Minor differences: 1.5% - Major issues: 0.3% See test_result/test_run.txt for full report ``` ### Interpreting Results - **Success Rate**: Percentage of files processed without errors - **Perfect Matches**: Files that round-trip encode/decode perfectly - **Minor Differences**: Small whitespace or formatting differences - **Major Issues**: Significant differences requiring attention ## Need Help? If you encounter any issues: 1. Check the test results in `test_result/` 2. Ensure your tokenizer was created successfully 3. Verify file encodings (UTF-8 recommended) 4. Check for corrupted or extremely large files For additional support, please open an issue on our GitHub repository. File types: py,js,json Max files: 10 Sample size: 100000 chars/file === Summary === Processed files: 10 Skipped files: 0 avg_chars_per_token: 3.47 avg_tokens_per_sec: 12500.34 ``` ### test_tokenizer_simple.py Output ``` === TOKENIZER TEST SUMMARY ================================================ Test Script: test_tokenizer_simple.py Timestamp: 20250524_154835 Tokenizer: ../output/tokenizer.json Chunk file: example.txt -------------------------------------------------------------------------------- Lines processed: 1000 Perfect matches: 987 (98.7%) Average tokens/line: 15.23 Total characters: 1,234,567 Total tokens: 15,230 Character accuracy: 99.85% Character diff: 1,845 chars (0.15%) Chars per token: 7.92 (lower is better) ``` ## Troubleshooting - **Missing Dependencies**: Install required packages with `pip install -r requirements.txt` - **File Not Found**: Ensure the tokenizer and input paths are correct - **Empty Results**: Check that your input directory contains files with the specified extensions - **Tokenizer Not Found**: By default, looks for tokenizer.json in `../output/` (one level up from Test_tokenizer) ## License This tool is part of the Nexforge project. See the main project for licensing information.