File size: 689 Bytes
0af1031 3961fd2 7f6d5fe 0af1031 3961fd2 0af1031 3961fd2 0af1031 3961fd2 0af1031 3961fd2 0af1031 3961fd2 0af1031 3961fd2 0af1031 3961fd2 0af1031 3961fd2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | ---
library_name: transformers
license: cc0-1.0
datasets:
- deutsche-telekom/Ger-RAG-eval
tags:
- tokenization
language:
- de
---
# Small German Tokenizer
This is a small public domain-like tokenizer optimized for German.
## Special Tokens
- End-of-Sequence token: `[EOS]`
- Padding token: `[PAD]`
## Training
This tokenizer was trained on the `context` column of the configs `task1` and `task4` in [deutsche-telekom/Ger-RAG-eval](https://huggingface.co/datasets/deutsche-telekom/Ger-RAG-eval).
## Limitations
Due to its small corpus, this tokenizer may split words into smaller pieces. Also, some uncommon special tokens aren't present, you'll have to add them manually if needed. |