Johnnyman1100's picture
Upload 38 files
4265aea verified

NexForge Tokenizer Testing

This directory contains tools for testing the NexForge tokenizer on your code or text files.

Quick Start

  1. Create a tokenizer using the main menu (run_nexforge.bat)
  2. Run tests from the main menu
    • Tests 10,000 random samples by default
    • Results saved to test_result/test_run.txt

Advanced Testing

Prerequisites

  • Python 3.8+
  • NexForge tokenizer package installed

Test Scripts

  1. test_tokenizer.py - Comprehensive testing with detailed metrics
  2. test_tokenizer_simple.py - Quick testing on a single file

Installation

Dependencies are automatically installed when you run the main installer. For manual setup:

pip install tokenizers python-Levenshtein

Project Structure

NexForge/
β”œβ”€β”€ Test_tokenizer/
β”‚   β”œβ”€β”€ test_tokenizer.py         # Main test script (batch processing)
β”‚   └── test_tokenizer_simple.py  # Quick test script (single file)
β”œβ”€β”€ output/                      # Tokenizer output (Nexforge_tokenizer.json)
β”œβ”€β”€ Dataset/                     # Your training/test files
└── test_result/                 # Test outputs and reports

test_tokenizer.py

Comprehensive testing with detailed metrics and batch processing.

Basic Usage

# Run with default settings (uses tokenizer from parent directory)
python test_tokenizer.py

# Or specify custom paths
python test_tokenizer.py \
    --tokenizer ../output/Nexforge_tokenizer.json \
    --input ../Dataset \
    --output ../test_result/detailed_test.txt

What's Tested

  • Tokenization/decoding accuracy
  • Special token handling
  • Performance metrics
  • File format compatibility

Command Line Options

# Custom tokenizer, input, and output paths
python test_tokenizer.py \
    --tokenizer path/to/your/tokenizer.json \
    --input path/to/your/code/directory \
    --output custom_results/custom_test.txt \
    --file-types py,js,json \
    --max-files 20 \
    --sample 50000

# Process only specific file types
python test_tokenizer.py --file-types py,js,json

# Process all files but limit to first 20
python test_tokenizer.py --max-files 20

# Process all files of specific types (no limit)
python test_tokenizer.py --max-files 0 --file-types py,js

# Process full content of each file (no sampling)
python test_tokenizer.py --sample 0

test_tokenizer_simple.py

Quick verification of tokenizer functionality.

Usage

# Quick test on a single file
python test_tokenizer_simple.py --input sample.py

# Test with custom tokenizer
python test_tokenizer_simple.py \
    --tokenizer ../output/Nexforge_tokenizer.json \
    --input sample.py

When to Use

  • Quick validation of tokenizer
  • Debugging specific files
  • Verifying tokenization quality
  • Minimal setup required

Understanding Test Results

Sample Output

=== NexForge Tokenizer Test Results ===
Tested on: 2025-05-25 13:30:00
Tokenizer: ../output/Nexforge_tokenizer.json
Files processed: 42
Total tokens: 1,234,567

Success Rate: 99.8%
Avg. tokens/file: 29,394
Max memory used: 1.2GB

=== Detailed Metrics ===
- Perfect matches: 98.2%
- Minor differences: 1.5%
- Major issues: 0.3%

See test_result/test_run.txt for full report

Interpreting Results

  • Success Rate: Percentage of files processed without errors
  • Perfect Matches: Files that round-trip encode/decode perfectly
  • Minor Differences: Small whitespace or formatting differences
  • Major Issues: Significant differences requiring attention

Need Help?

If you encounter any issues:

  1. Check the test results in test_result/
  2. Ensure your tokenizer was created successfully
  3. Verify file encodings (UTF-8 recommended)
  4. Check for corrupted or extremely large files

For additional support, please open an issue on our GitHub repository. File types: py,js,json Max files: 10 Sample size: 100000 chars/file

=== Summary === Processed files: 10 Skipped files: 0 avg_chars_per_token: 3.47 avg_tokens_per_sec: 12500.34


### test_tokenizer_simple.py Output

=== TOKENIZER TEST SUMMARY ================================================ Test Script: test_tokenizer_simple.py Timestamp: 20250524_154835 Tokenizer: ../output/tokenizer.json Chunk file: example.txt

Lines processed: 1000 Perfect matches: 987 (98.7%) Average tokens/line: 15.23 Total characters: 1,234,567 Total tokens: 15,230 Character accuracy: 99.85% Character diff: 1,845 chars (0.15%) Chars per token: 7.92 (lower is better)


## Troubleshooting

- **Missing Dependencies**: Install required packages with `pip install -r requirements.txt`
- **File Not Found**: Ensure the tokenizer and input paths are correct
- **Empty Results**: Check that your input directory contains files with the specified extensions
- **Tokenizer Not Found**: By default, looks for tokenizer.json in `../output/` (one level up from Test_tokenizer)

## License

This tool is part of the Nexforge project. See the main project for licensing information.