--- library_name: transformers license: cc0-1.0 datasets: - deutsche-telekom/Ger-RAG-eval tags: - tokenization language: - de --- # Small German Tokenizer This is a small public domain-like tokenizer optimized for German. ## Special Tokens - End-of-Sequence token: `[EOS]` - Padding token: `[PAD]` ## Training This tokenizer was trained on the `context` column of the configs `task1` and `task4` in [deutsche-telekom/Ger-RAG-eval](https://huggingface.co/datasets/deutsche-telekom/Ger-RAG-eval). ## Limitations Due to its small corpus, this tokenizer may split words into smaller pieces. Also, some uncommon special tokens aren't present, you'll have to add them manually if needed.