nanochat German: Tokenizer

This repository hosts a tokenizer, trained on the German nanochat dataset.

Following the original nanochat tokenizer training process, we trained the tokenizer on 2B chars:

python -m scripts.tok_train --max_chars=2000000000

Stats

max_chars: 2,000,000,000
doc_cap: 10,000
vocab_size: 65,536
train_time: 117.8557
num_special_tokens: 9
token_bytes_min: 1
token_bytes_max: 66
token_bytes_mean: 7.5642
token_bytes_std: 3.6434

Evaluation

Using:

python -m scripts.tok_eval

Comparison with GPT-2

Text Type	Bytes	GPT-2 Tokens	GPT-2 Ratio	Ours Tokens	Ours Ratio	Relative Diff %
news	1883	731	2.58	385	4.89	+47.3%
korean	893	745	1.20	802	1.11	-7.7%
code	1259	576	2.19	662	1.90	-14.9%
math	9172	4627	1.98	4062	2.26	+12.2%
science	1698	643	2.64	334	5.08	+48.1%
fwe-train	4555564	1694319	2.69	926779	4.92	+45.3%
fwe-val	4063797	1520703	2.67	841356	4.83	+44.7%

Comparison with GPT-4

Text Type	Bytes	GPT-4 Tokens	GPT-4 Ratio	Ours Tokens	Ours Ratio	Relative Diff %
news	1883	541	3.48	385	4.89	+28.8%
korean	893	364	2.45	802	1.11	-120.3%
code	1259	309	4.07	662	1.90	-114.2%
math	9172	3573	2.57	4062	2.26	-13.7%
science	1698	467	3.64	334	5.08	+28.5%
fwe-train	4555564	1296818	3.51	926779	4.92	+28.5%
fwe-val	4063797	1166775	3.48	841356	4.83	+27.9%

Notice: The original tokenizer evaluation scripts contain English examples. We did not simply "translate" these examples to German. Instead, we used proper German examples from newspaper articles, lecture notes, and theses. The modificed evaluation script can be found here.

Downloads last month: 2

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

stefan-it
/

nanochat-german-tokenizer

nanochat German: Tokenizer

Stats

Evaluation

Comparison with GPT-2

Comparison with GPT-4

Dataset used to train stefan-it/nanochat-german-tokenizer