qikp's picture
Update README.md
7f6d5fe verified
---
library_name: transformers
license: cc0-1.0
datasets:
- deutsche-telekom/Ger-RAG-eval
tags:
- tokenization
language:
- de
---
# Small German Tokenizer
This is a small public domain-like tokenizer optimized for German.
## Special Tokens
- End-of-Sequence token: `[EOS]`
- Padding token: `[PAD]`
## Training
This tokenizer was trained on the `context` column of the configs `task1` and `task4` in [deutsche-telekom/Ger-RAG-eval](https://huggingface.co/datasets/deutsche-telekom/Ger-RAG-eval).
## Limitations
Due to its small corpus, this tokenizer may split words into smaller pieces. Also, some uncommon special tokens aren't present, you'll have to add them manually if needed.