Johnnyman1100's picture
Upload 38 files
4265aea verified
# NexForge Tokenizer Testing
This directory contains tools for testing the NexForge tokenizer on your code or text files.
## Quick Start
1. **Create a tokenizer** using the main menu (`run_nexforge.bat`)
2. **Run tests** from the main menu
- Tests 10,000 random samples by default
- Results saved to `test_result/test_run.txt`
## Advanced Testing
### Prerequisites
- Python 3.8+
- NexForge tokenizer package installed
### Test Scripts
1. **test_tokenizer.py** - Comprehensive testing with detailed metrics
2. **test_tokenizer_simple.py** - Quick testing on a single file
## Installation
Dependencies are automatically installed when you run the main installer. For manual setup:
```bash
pip install tokenizers python-Levenshtein
```
## Project Structure
```
NexForge/
β”œβ”€β”€ Test_tokenizer/
β”‚ β”œβ”€β”€ test_tokenizer.py # Main test script (batch processing)
β”‚ └── test_tokenizer_simple.py # Quick test script (single file)
β”œβ”€β”€ output/ # Tokenizer output (Nexforge_tokenizer.json)
β”œβ”€β”€ Dataset/ # Your training/test files
└── test_result/ # Test outputs and reports
```
## test_tokenizer.py
Comprehensive testing with detailed metrics and batch processing.
### Basic Usage
```bash
# Run with default settings (uses tokenizer from parent directory)
python test_tokenizer.py
# Or specify custom paths
python test_tokenizer.py \
--tokenizer ../output/Nexforge_tokenizer.json \
--input ../Dataset \
--output ../test_result/detailed_test.txt
```
### What's Tested
- Tokenization/decoding accuracy
- Special token handling
- Performance metrics
- File format compatibility
### Command Line Options
```bash
# Custom tokenizer, input, and output paths
python test_tokenizer.py \
--tokenizer path/to/your/tokenizer.json \
--input path/to/your/code/directory \
--output custom_results/custom_test.txt \
--file-types py,js,json \
--max-files 20 \
--sample 50000
# Process only specific file types
python test_tokenizer.py --file-types py,js,json
# Process all files but limit to first 20
python test_tokenizer.py --max-files 20
# Process all files of specific types (no limit)
python test_tokenizer.py --max-files 0 --file-types py,js
# Process full content of each file (no sampling)
python test_tokenizer.py --sample 0
```
## test_tokenizer_simple.py
Quick verification of tokenizer functionality.
### Usage
```bash
# Quick test on a single file
python test_tokenizer_simple.py --input sample.py
# Test with custom tokenizer
python test_tokenizer_simple.py \
--tokenizer ../output/Nexforge_tokenizer.json \
--input sample.py
```
### When to Use
- Quick validation of tokenizer
- Debugging specific files
- Verifying tokenization quality
- Minimal setup required
## Understanding Test Results
### Sample Output
```
=== NexForge Tokenizer Test Results ===
Tested on: 2025-05-25 13:30:00
Tokenizer: ../output/Nexforge_tokenizer.json
Files processed: 42
Total tokens: 1,234,567
Success Rate: 99.8%
Avg. tokens/file: 29,394
Max memory used: 1.2GB
=== Detailed Metrics ===
- Perfect matches: 98.2%
- Minor differences: 1.5%
- Major issues: 0.3%
See test_result/test_run.txt for full report
```
### Interpreting Results
- **Success Rate**: Percentage of files processed without errors
- **Perfect Matches**: Files that round-trip encode/decode perfectly
- **Minor Differences**: Small whitespace or formatting differences
- **Major Issues**: Significant differences requiring attention
## Need Help?
If you encounter any issues:
1. Check the test results in `test_result/`
2. Ensure your tokenizer was created successfully
3. Verify file encodings (UTF-8 recommended)
4. Check for corrupted or extremely large files
For additional support, please open an issue on our GitHub repository.
File types: py,js,json
Max files: 10
Sample size: 100000 chars/file
=== Summary ===
Processed files: 10
Skipped files: 0
avg_chars_per_token: 3.47
avg_tokens_per_sec: 12500.34
```
### test_tokenizer_simple.py Output
```
=== TOKENIZER TEST SUMMARY ================================================
Test Script: test_tokenizer_simple.py
Timestamp: 20250524_154835
Tokenizer: ../output/tokenizer.json
Chunk file: example.txt
--------------------------------------------------------------------------------
Lines processed: 1000
Perfect matches: 987 (98.7%)
Average tokens/line: 15.23
Total characters: 1,234,567
Total tokens: 15,230
Character accuracy: 99.85%
Character diff: 1,845 chars (0.15%)
Chars per token: 7.92 (lower is better)
```
## Troubleshooting
- **Missing Dependencies**: Install required packages with `pip install -r requirements.txt`
- **File Not Found**: Ensure the tokenizer and input paths are correct
- **Empty Results**: Check that your input directory contains files with the specified extensions
- **Tokenizer Not Found**: By default, looks for tokenizer.json in `../output/` (one level up from Test_tokenizer)
## License
This tool is part of the Nexforge project. See the main project for licensing information.