NexForge Tokenizer Testing

This directory contains tools for testing the NexForge tokenizer on your code or text files.

Quick Start

Create a tokenizer using the main menu (run_nexforge.bat)
Run tests from the main menu
- Tests 10,000 random samples by default
- Results saved to test_result/test_run.txt

Advanced Testing

Prerequisites

Python 3.8+
NexForge tokenizer package installed

Test Scripts

test_tokenizer.py - Comprehensive testing with detailed metrics
test_tokenizer_simple.py - Quick testing on a single file

Installation

Dependencies are automatically installed when you run the main installer. For manual setup:

pip install tokenizers python-Levenshtein

Project Structure

NexForge/
├── Test_tokenizer/
│   ├── test_tokenizer.py         # Main test script (batch processing)
│   └── test_tokenizer_simple.py  # Quick test script (single file)
├── output/                      # Tokenizer output (Nexforge_tokenizer.json)
├── Dataset/                     # Your training/test files
└── test_result/                 # Test outputs and reports

test_tokenizer.py

Comprehensive testing with detailed metrics and batch processing.

Basic Usage

# Run with default settings (uses tokenizer from parent directory)
python test_tokenizer.py

# Or specify custom paths
python test_tokenizer.py \
    --tokenizer ../output/Nexforge_tokenizer.json \
    --input ../Dataset \
    --output ../test_result/detailed_test.txt

What's Tested

Tokenization/decoding accuracy
Special token handling
Performance metrics
File format compatibility

Command Line Options

# Custom tokenizer, input, and output paths
python test_tokenizer.py \
    --tokenizer path/to/your/tokenizer.json \
    --input path/to/your/code/directory \
    --output custom_results/custom_test.txt \
    --file-types py,js,json \
    --max-files 20 \
    --sample 50000

# Process only specific file types
python test_tokenizer.py --file-types py,js,json

# Process all files but limit to first 20
python test_tokenizer.py --max-files 20

# Process all files of specific types (no limit)
python test_tokenizer.py --max-files 0 --file-types py,js

# Process full content of each file (no sampling)
python test_tokenizer.py --sample 0

test_tokenizer_simple.py

Quick verification of tokenizer functionality.

Usage

# Quick test on a single file
python test_tokenizer_simple.py --input sample.py

# Test with custom tokenizer
python test_tokenizer_simple.py \
    --tokenizer ../output/Nexforge_tokenizer.json \
    --input sample.py

When to Use

Quick validation of tokenizer
Debugging specific files
Verifying tokenization quality
Minimal setup required

Understanding Test Results

Sample Output

=== NexForge Tokenizer Test Results ===
Tested on: 2025-05-25 13:30:00
Tokenizer: ../output/Nexforge_tokenizer.json
Files processed: 42
Total tokens: 1,234,567

Success Rate: 99.8%
Avg. tokens/file: 29,394
Max memory used: 1.2GB

=== Detailed Metrics ===
- Perfect matches: 98.2%
- Minor differences: 1.5%
- Major issues: 0.3%

See test_result/test_run.txt for full report

Interpreting Results

Success Rate: Percentage of files processed without errors
Perfect Matches: Files that round-trip encode/decode perfectly
Minor Differences: Small whitespace or formatting differences
Major Issues: Significant differences requiring attention

Need Help?

If you encounter any issues:

Check the test results in test_result/
Ensure your tokenizer was created successfully
Verify file encodings (UTF-8 recommended)
Check for corrupted or extremely large files

For additional support, please open an issue on our GitHub repository. File types: py,js,json Max files: 10 Sample size: 100000 chars/file

=== Summary === Processed files: 10 Skipped files: 0 avg_chars_per_token: 3.47 avg_tokens_per_sec: 12500.34


### test_tokenizer_simple.py Output

=== TOKENIZER TEST SUMMARY ================================================ Test Script: test_tokenizer_simple.py Timestamp: 20250524_154835 Tokenizer: ../output/tokenizer.json Chunk file: example.txt

Lines processed: 1000 Perfect matches: 987 (98.7%) Average tokens/line: 15.23 Total characters: 1,234,567 Total tokens: 15,230 Character accuracy: 99.85% Character diff: 1,845 chars (0.15%) Chars per token: 7.92 (lower is better)


## Troubleshooting

- **Missing Dependencies**: Install required packages with `pip install -r requirements.txt`
- **File Not Found**: Ensure the tokenizer and input paths are correct
- **Empty Results**: Check that your input directory contains files with the specified extensions
- **Tokenizer Not Found**: By default, looks for tokenizer.json in `../output/` (one level up from Test_tokenizer)

## License

This tool is part of the Nexforge project. See the main project for licensing information.