EZ-Tokenizer / Test_tokenizer /README.md

Upload 38 files

4265aea verified 8 months ago

5.16 kB

	# NexForge Tokenizer Testing

	This directory contains tools for testing the NexForge tokenizer on your code or text files.

	## Quick Start

	1. Create a tokenizer using the main menu (`run_nexforge.bat`)
	2. Run tests from the main menu
	- Tests 10,000 random samples by default
	- Results saved to `test_result/test_run.txt`

	## Advanced Testing

	### Prerequisites
	- Python 3.8+
	- NexForge tokenizer package installed

	### Test Scripts

	1. test_tokenizer.py - Comprehensive testing with detailed metrics
	2. test_tokenizer_simple.py - Quick testing on a single file

	## Installation

	Dependencies are automatically installed when you run the main installer. For manual setup:

	```bash
	pip install tokenizers python-Levenshtein
	```

	## Project Structure

	```
	NexForge/
	├── Test_tokenizer/
	│ ├── test_tokenizer.py # Main test script (batch processing)
	│ └── test_tokenizer_simple.py # Quick test script (single file)
	├── output/ # Tokenizer output (Nexforge_tokenizer.json)
	├── Dataset/ # Your training/test files
	└── test_result/ # Test outputs and reports
	```

	## test_tokenizer.py

	Comprehensive testing with detailed metrics and batch processing.

	### Basic Usage

	```bash
	# Run with default settings (uses tokenizer from parent directory)
	python test_tokenizer.py

	# Or specify custom paths
	python test_tokenizer.py \
	--tokenizer ../output/Nexforge_tokenizer.json \
	--input ../Dataset \
	--output ../test_result/detailed_test.txt
	```

	### What's Tested
	- Tokenization/decoding accuracy
	- Special token handling
	- Performance metrics
	- File format compatibility

	### Command Line Options

	```bash
	# Custom tokenizer, input, and output paths
	python test_tokenizer.py \
	--tokenizer path/to/your/tokenizer.json \
	--input path/to/your/code/directory \
	--output custom_results/custom_test.txt \
	--file-types py,js,json \
	--max-files 20 \
	--sample 50000

	# Process only specific file types
	python test_tokenizer.py --file-types py,js,json

	# Process all files but limit to first 20
	python test_tokenizer.py --max-files 20

	# Process all files of specific types (no limit)
	python test_tokenizer.py --max-files 0 --file-types py,js

	# Process full content of each file (no sampling)
	python test_tokenizer.py --sample 0
	```

	## test_tokenizer_simple.py

	Quick verification of tokenizer functionality.

	### Usage

	```bash
	# Quick test on a single file
	python test_tokenizer_simple.py --input sample.py

	# Test with custom tokenizer
	python test_tokenizer_simple.py \
	--tokenizer ../output/Nexforge_tokenizer.json \
	--input sample.py
	```

	### When to Use
	- Quick validation of tokenizer
	- Debugging specific files
	- Verifying tokenization quality
	- Minimal setup required

	## Understanding Test Results

	### Sample Output

	```
	=== NexForge Tokenizer Test Results ===
	Tested on: 2025-05-25 13:30:00
	Tokenizer: ../output/Nexforge_tokenizer.json
	Files processed: 42
	Total tokens: 1,234,567

	Success Rate: 99.8%
	Avg. tokens/file: 29,394
	Max memory used: 1.2GB

	=== Detailed Metrics ===
	- Perfect matches: 98.2%
	- Minor differences: 1.5%
	- Major issues: 0.3%

	See test_result/test_run.txt for full report
	```

	### Interpreting Results
	- Success Rate: Percentage of files processed without errors
	- Perfect Matches: Files that round-trip encode/decode perfectly
	- Minor Differences: Small whitespace or formatting differences
	- Major Issues: Significant differences requiring attention

	## Need Help?

	If you encounter any issues:
	1. Check the test results in `test_result/`
	2. Ensure your tokenizer was created successfully
	3. Verify file encodings (UTF-8 recommended)
	4. Check for corrupted or extremely large files

	For additional support, please open an issue on our GitHub repository.
	File types: py,js,json
	Max files: 10
	Sample size: 100000 chars/file

	=== Summary ===
	Processed files: 10
	Skipped files: 0
	avg_chars_per_token: 3.47
	avg_tokens_per_sec: 12500.34
	```

	### test_tokenizer_simple.py Output

	```
	=== TOKENIZER TEST SUMMARY ================================================
	Test Script: test_tokenizer_simple.py
	Timestamp: 20250524_154835
	Tokenizer: ../output/tokenizer.json
	Chunk file: example.txt
	--------------------------------------------------------------------------------
	Lines processed: 1000
	Perfect matches: 987 (98.7%)
	Average tokens/line: 15.23
	Total characters: 1,234,567
	Total tokens: 15,230
	Character accuracy: 99.85%
	Character diff: 1,845 chars (0.15%)
	Chars per token: 7.92 (lower is better)
	```

	## Troubleshooting

	- Missing Dependencies: Install required packages with `pip install -r requirements.txt`
	- File Not Found: Ensure the tokenizer and input paths are correct
	- Empty Results: Check that your input directory contains files with the specified extensions
	- Tokenizer Not Found: By default, looks for tokenizer.json in `../output/` (one level up from Test_tokenizer)

	## License

	This tool is part of the Nexforge project. See the main project for licensing information.