NexForge Tokenizer Testing
This directory contains tools for testing the NexForge tokenizer on your code or text files.
Quick Start
- Create a tokenizer using the main menu (
run_nexforge.bat) - Run tests from the main menu
- Tests 10,000 random samples by default
- Results saved to
test_result/test_run.txt
Advanced Testing
Prerequisites
- Python 3.8+
- NexForge tokenizer package installed
Test Scripts
- test_tokenizer.py - Comprehensive testing with detailed metrics
- test_tokenizer_simple.py - Quick testing on a single file
Installation
Dependencies are automatically installed when you run the main installer. For manual setup:
pip install tokenizers python-Levenshtein
Project Structure
NexForge/
βββ Test_tokenizer/
β βββ test_tokenizer.py # Main test script (batch processing)
β βββ test_tokenizer_simple.py # Quick test script (single file)
βββ output/ # Tokenizer output (Nexforge_tokenizer.json)
βββ Dataset/ # Your training/test files
βββ test_result/ # Test outputs and reports
test_tokenizer.py
Comprehensive testing with detailed metrics and batch processing.
Basic Usage
# Run with default settings (uses tokenizer from parent directory)
python test_tokenizer.py
# Or specify custom paths
python test_tokenizer.py \
--tokenizer ../output/Nexforge_tokenizer.json \
--input ../Dataset \
--output ../test_result/detailed_test.txt
What's Tested
- Tokenization/decoding accuracy
- Special token handling
- Performance metrics
- File format compatibility
Command Line Options
# Custom tokenizer, input, and output paths
python test_tokenizer.py \
--tokenizer path/to/your/tokenizer.json \
--input path/to/your/code/directory \
--output custom_results/custom_test.txt \
--file-types py,js,json \
--max-files 20 \
--sample 50000
# Process only specific file types
python test_tokenizer.py --file-types py,js,json
# Process all files but limit to first 20
python test_tokenizer.py --max-files 20
# Process all files of specific types (no limit)
python test_tokenizer.py --max-files 0 --file-types py,js
# Process full content of each file (no sampling)
python test_tokenizer.py --sample 0
test_tokenizer_simple.py
Quick verification of tokenizer functionality.
Usage
# Quick test on a single file
python test_tokenizer_simple.py --input sample.py
# Test with custom tokenizer
python test_tokenizer_simple.py \
--tokenizer ../output/Nexforge_tokenizer.json \
--input sample.py
When to Use
- Quick validation of tokenizer
- Debugging specific files
- Verifying tokenization quality
- Minimal setup required
Understanding Test Results
Sample Output
=== NexForge Tokenizer Test Results ===
Tested on: 2025-05-25 13:30:00
Tokenizer: ../output/Nexforge_tokenizer.json
Files processed: 42
Total tokens: 1,234,567
Success Rate: 99.8%
Avg. tokens/file: 29,394
Max memory used: 1.2GB
=== Detailed Metrics ===
- Perfect matches: 98.2%
- Minor differences: 1.5%
- Major issues: 0.3%
See test_result/test_run.txt for full report
Interpreting Results
- Success Rate: Percentage of files processed without errors
- Perfect Matches: Files that round-trip encode/decode perfectly
- Minor Differences: Small whitespace or formatting differences
- Major Issues: Significant differences requiring attention
Need Help?
If you encounter any issues:
- Check the test results in
test_result/ - Ensure your tokenizer was created successfully
- Verify file encodings (UTF-8 recommended)
- Check for corrupted or extremely large files
For additional support, please open an issue on our GitHub repository. File types: py,js,json Max files: 10 Sample size: 100000 chars/file
=== Summary === Processed files: 10 Skipped files: 0 avg_chars_per_token: 3.47 avg_tokens_per_sec: 12500.34
### test_tokenizer_simple.py Output
=== TOKENIZER TEST SUMMARY ================================================ Test Script: test_tokenizer_simple.py Timestamp: 20250524_154835 Tokenizer: ../output/tokenizer.json Chunk file: example.txt
Lines processed: 1000 Perfect matches: 987 (98.7%) Average tokens/line: 15.23 Total characters: 1,234,567 Total tokens: 15,230 Character accuracy: 99.85% Character diff: 1,845 chars (0.15%) Chars per token: 7.92 (lower is better)
## Troubleshooting
- **Missing Dependencies**: Install required packages with `pip install -r requirements.txt`
- **File Not Found**: Ensure the tokenizer and input paths are correct
- **Empty Results**: Check that your input directory contains files with the specified extensions
- **Tokenizer Not Found**: By default, looks for tokenizer.json in `../output/` (one level up from Test_tokenizer)
## License
This tool is part of the Nexforge project. See the main project for licensing information.