File size: 5,164 Bytes

4265aea

# NexForge Tokenizer Testing

This directory contains tools for testing the NexForge tokenizer on your code or text files.

## Quick Start

1. **Create a tokenizer** using the main menu (`run_nexforge.bat`)
2. **Run tests** from the main menu
   - Tests 10,000 random samples by default
   - Results saved to `test_result/test_run.txt`

## Advanced Testing

### Prerequisites
- Python 3.8+
- NexForge tokenizer package installed

### Test Scripts

1. **test_tokenizer.py** - Comprehensive testing with detailed metrics
2. **test_tokenizer_simple.py** - Quick testing on a single file

## Installation

Dependencies are automatically installed when you run the main installer. For manual setup:

```bash
pip install tokenizers python-Levenshtein
```

## Project Structure

```
NexForge/
├── Test_tokenizer/
│   ├── test_tokenizer.py         # Main test script (batch processing)
│   └── test_tokenizer_simple.py  # Quick test script (single file)
├── output/                      # Tokenizer output (Nexforge_tokenizer.json)
├── Dataset/                     # Your training/test files
└── test_result/                 # Test outputs and reports
```

## test_tokenizer.py

Comprehensive testing with detailed metrics and batch processing.

### Basic Usage

```bash
# Run with default settings (uses tokenizer from parent directory)
python test_tokenizer.py

# Or specify custom paths
python test_tokenizer.py \
    --tokenizer ../output/Nexforge_tokenizer.json \
    --input ../Dataset \
    --output ../test_result/detailed_test.txt
```

### What's Tested
- Tokenization/decoding accuracy
- Special token handling
- Performance metrics
- File format compatibility

### Command Line Options

```bash
# Custom tokenizer, input, and output paths
python test_tokenizer.py \
    --tokenizer path/to/your/tokenizer.json \
    --input path/to/your/code/directory \
    --output custom_results/custom_test.txt \
    --file-types py,js,json \
    --max-files 20 \
    --sample 50000

# Process only specific file types
python test_tokenizer.py --file-types py,js,json

# Process all files but limit to first 20
python test_tokenizer.py --max-files 20

# Process all files of specific types (no limit)
python test_tokenizer.py --max-files 0 --file-types py,js

# Process full content of each file (no sampling)
python test_tokenizer.py --sample 0
```

## test_tokenizer_simple.py

Quick verification of tokenizer functionality.

### Usage

```bash
# Quick test on a single file
python test_tokenizer_simple.py --input sample.py

# Test with custom tokenizer
python test_tokenizer_simple.py \
    --tokenizer ../output/Nexforge_tokenizer.json \
    --input sample.py
```

### When to Use
- Quick validation of tokenizer
- Debugging specific files
- Verifying tokenization quality
- Minimal setup required

## Understanding Test Results

### Sample Output

```
=== NexForge Tokenizer Test Results ===
Tested on: 2025-05-25 13:30:00
Tokenizer: ../output/Nexforge_tokenizer.json
Files processed: 42
Total tokens: 1,234,567

Success Rate: 99.8%
Avg. tokens/file: 29,394
Max memory used: 1.2GB

=== Detailed Metrics ===
- Perfect matches: 98.2%
- Minor differences: 1.5%
- Major issues: 0.3%

See test_result/test_run.txt for full report
```

### Interpreting Results
- **Success Rate**: Percentage of files processed without errors
- **Perfect Matches**: Files that round-trip encode/decode perfectly
- **Minor Differences**: Small whitespace or formatting differences
- **Major Issues**: Significant differences requiring attention

## Need Help?

If you encounter any issues:
1. Check the test results in `test_result/`
2. Ensure your tokenizer was created successfully
3. Verify file encodings (UTF-8 recommended)
4. Check for corrupted or extremely large files

For additional support, please open an issue on our GitHub repository.
File types: py,js,json
Max files: 10
Sample size: 100000 chars/file

=== Summary ===
Processed files: 10
Skipped files: 0
avg_chars_per_token: 3.47
avg_tokens_per_sec: 12500.34
```

### test_tokenizer_simple.py Output

```
=== TOKENIZER TEST SUMMARY ================================================
Test Script:       test_tokenizer_simple.py
Timestamp:          20250524_154835
Tokenizer:          ../output/tokenizer.json
Chunk file:         example.txt
--------------------------------------------------------------------------------
Lines processed:     1000
Perfect matches:     987 (98.7%)
Average tokens/line:  15.23
Total characters:    1,234,567
Total tokens:        15,230
Character accuracy:   99.85%
Character diff:      1,845 chars (0.15%)
Chars per token:     7.92 (lower is better)
```

## Troubleshooting

- **Missing Dependencies**: Install required packages with `pip install -r requirements.txt`
- **File Not Found**: Ensure the tokenizer and input paths are correct
- **Empty Results**: Check that your input directory contains files with the specified extensions
- **Tokenizer Not Found**: By default, looks for tokenizer.json in `../output/` (one level up from Test_tokenizer)

## License

This tool is part of the Nexforge project. See the main project for licensing information.