stefan-it/nanochat-german-data
Viewer • Updated • 51.2M • 520
This repository hosts a tokenizer, trained on the German nanochat dataset.
Following the original nanochat tokenizer training process, we trained the tokenizer on 2B chars:
python -m scripts.tok_train --max_chars=2000000000
Using:
python -m scripts.tok_eval
| Text Type | Bytes | GPT-2 Tokens | GPT-2 Ratio | Ours Tokens | Ours Ratio | Relative Diff % |
|---|---|---|---|---|---|---|
| news | 1883 | 731 | 2.58 | 385 | 4.89 | +47.3% |
| korean | 893 | 745 | 1.20 | 802 | 1.11 | -7.7% |
| code | 1259 | 576 | 2.19 | 662 | 1.90 | -14.9% |
| math | 9172 | 4627 | 1.98 | 4062 | 2.26 | +12.2% |
| science | 1698 | 643 | 2.64 | 334 | 5.08 | +48.1% |
| fwe-train | 4555564 | 1694319 | 2.69 | 926779 | 4.92 | +45.3% |
| fwe-val | 4063797 | 1520703 | 2.67 | 841356 | 4.83 | +44.7% |
| Text Type | Bytes | GPT-4 Tokens | GPT-4 Ratio | Ours Tokens | Ours Ratio | Relative Diff % |
|---|---|---|---|---|---|---|
| news | 1883 | 541 | 3.48 | 385 | 4.89 | +28.8% |
| korean | 893 | 364 | 2.45 | 802 | 1.11 | -120.3% |
| code | 1259 | 309 | 4.07 | 662 | 1.90 | -114.2% |
| math | 9172 | 3573 | 2.57 | 4062 | 2.26 | -13.7% |
| science | 1698 | 467 | 3.64 | 334 | 5.08 | +28.5% |
| fwe-train | 4555564 | 1296818 | 3.51 | 926779 | 4.92 | +28.5% |
| fwe-val | 4063797 | 1166775 | 3.48 | 841356 | 4.83 | +27.9% |
Notice: The original tokenizer evaluation scripts contain English examples. We did not simply "translate" these examples to German. Instead, we used proper German examples from newspaper articles, lecture notes, and theses. The modificed evaluation script can be found here.