File size: 5,164 Bytes
4265aea
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
# NexForge Tokenizer Testing

This directory contains tools for testing the NexForge tokenizer on your code or text files.

## Quick Start

1. **Create a tokenizer** using the main menu (`run_nexforge.bat`)
2. **Run tests** from the main menu
   - Tests 10,000 random samples by default
   - Results saved to `test_result/test_run.txt`

## Advanced Testing

### Prerequisites
- Python 3.8+
- NexForge tokenizer package installed

### Test Scripts

1. **test_tokenizer.py** - Comprehensive testing with detailed metrics
2. **test_tokenizer_simple.py** - Quick testing on a single file

## Installation

Dependencies are automatically installed when you run the main installer. For manual setup:

```bash
pip install tokenizers python-Levenshtein
```

## Project Structure

```
NexForge/
β”œβ”€β”€ Test_tokenizer/
β”‚   β”œβ”€β”€ test_tokenizer.py         # Main test script (batch processing)
β”‚   └── test_tokenizer_simple.py  # Quick test script (single file)
β”œβ”€β”€ output/                      # Tokenizer output (Nexforge_tokenizer.json)
β”œβ”€β”€ Dataset/                     # Your training/test files
└── test_result/                 # Test outputs and reports
```

## test_tokenizer.py

Comprehensive testing with detailed metrics and batch processing.

### Basic Usage

```bash
# Run with default settings (uses tokenizer from parent directory)
python test_tokenizer.py

# Or specify custom paths
python test_tokenizer.py \
    --tokenizer ../output/Nexforge_tokenizer.json \
    --input ../Dataset \
    --output ../test_result/detailed_test.txt
```

### What's Tested
- Tokenization/decoding accuracy
- Special token handling
- Performance metrics
- File format compatibility

### Command Line Options

```bash
# Custom tokenizer, input, and output paths
python test_tokenizer.py \
    --tokenizer path/to/your/tokenizer.json \
    --input path/to/your/code/directory \
    --output custom_results/custom_test.txt \
    --file-types py,js,json \
    --max-files 20 \
    --sample 50000

# Process only specific file types
python test_tokenizer.py --file-types py,js,json

# Process all files but limit to first 20
python test_tokenizer.py --max-files 20

# Process all files of specific types (no limit)
python test_tokenizer.py --max-files 0 --file-types py,js

# Process full content of each file (no sampling)
python test_tokenizer.py --sample 0
```

## test_tokenizer_simple.py

Quick verification of tokenizer functionality.

### Usage

```bash
# Quick test on a single file
python test_tokenizer_simple.py --input sample.py

# Test with custom tokenizer
python test_tokenizer_simple.py \
    --tokenizer ../output/Nexforge_tokenizer.json \
    --input sample.py
```

### When to Use
- Quick validation of tokenizer
- Debugging specific files
- Verifying tokenization quality
- Minimal setup required

## Understanding Test Results

### Sample Output

```
=== NexForge Tokenizer Test Results ===
Tested on: 2025-05-25 13:30:00
Tokenizer: ../output/Nexforge_tokenizer.json
Files processed: 42
Total tokens: 1,234,567

Success Rate: 99.8%
Avg. tokens/file: 29,394
Max memory used: 1.2GB

=== Detailed Metrics ===
- Perfect matches: 98.2%
- Minor differences: 1.5%
- Major issues: 0.3%

See test_result/test_run.txt for full report
```

### Interpreting Results
- **Success Rate**: Percentage of files processed without errors
- **Perfect Matches**: Files that round-trip encode/decode perfectly
- **Minor Differences**: Small whitespace or formatting differences
- **Major Issues**: Significant differences requiring attention

## Need Help?

If you encounter any issues:
1. Check the test results in `test_result/`
2. Ensure your tokenizer was created successfully
3. Verify file encodings (UTF-8 recommended)
4. Check for corrupted or extremely large files

For additional support, please open an issue on our GitHub repository.
File types: py,js,json
Max files: 10
Sample size: 100000 chars/file

=== Summary ===
Processed files: 10
Skipped files: 0
avg_chars_per_token: 3.47
avg_tokens_per_sec: 12500.34
```

### test_tokenizer_simple.py Output

```
=== TOKENIZER TEST SUMMARY ================================================
Test Script:       test_tokenizer_simple.py
Timestamp:          20250524_154835
Tokenizer:          ../output/tokenizer.json
Chunk file:         example.txt
--------------------------------------------------------------------------------
Lines processed:     1000
Perfect matches:     987 (98.7%)
Average tokens/line:  15.23
Total characters:    1,234,567
Total tokens:        15,230
Character accuracy:   99.85%
Character diff:      1,845 chars (0.15%)
Chars per token:     7.92 (lower is better)
```

## Troubleshooting

- **Missing Dependencies**: Install required packages with `pip install -r requirements.txt`
- **File Not Found**: Ensure the tokenizer and input paths are correct
- **Empty Results**: Check that your input directory contains files with the specified extensions
- **Tokenizer Not Found**: By default, looks for tokenizer.json in `../output/` (one level up from Test_tokenizer)

## License

This tool is part of the Nexforge project. See the main project for licensing information.