nanochat-cache / report /tokenizer-evaluation.md
ttj's picture
Add files using upload-large-folder tool
85a524c verified

Tokenizer evaluation

timestamp: 2025-11-03 05:56:20

Comparison with GPT-2

Text Type Bytes GPT-2 Tokens GPT-2 Ratio Ours Tokens Ours Ratio Relative Diff %
news 1819 404 4.50 375 4.85 +7.2%
korean 893 745 1.20 712 1.25 +4.4%
code 1259 576 2.19 492 2.56 +14.6%
math 1834 936 1.96 966 1.90 -3.2%
science 1112 260 4.28 228 4.88 +12.3%
fwe-train 4208518 900364 4.67 856883 4.91 +4.8%
fwe-val 4908443 1059062 4.63 1010352 4.86 +4.6%

Comparison with GPT-4

Text Type Bytes GPT-4 Tokens GPT-4 Ratio Ours Tokens Ours Ratio Relative Diff %
news 1819 387 4.70 375 4.85 +3.1%
korean 893 364 2.45 712 1.25 -95.6%
code 1259 309 4.07 492 2.56 -59.2%
math 1834 832 2.20 966 1.90 -16.1%
science 1112 249 4.47 228 4.88 +8.4%
fwe-train 4208518 874799 4.81 856883 4.91 +2.0%
fwe-val 4908443 1029691 4.77 1010352 4.86 +1.9%