metadata
library_name: transformers
license: cc0-1.0
datasets:
- deutsche-telekom/Ger-RAG-eval
tags:
- tokenization
language:
- de
Small German Tokenizer
This is a small public domain-like tokenizer optimized for German.
Special Tokens
- End-of-Sequence token:
[EOS] - Padding token:
[PAD]
Training
This tokenizer was trained on the context column of the configs task1 and task4 in deutsche-telekom/Ger-RAG-eval.
Limitations
Due to its small corpus, this tokenizer may split words into smaller pieces. Also, some uncommon special tokens aren't present, you'll have to add them manually if needed.