| --- |
| library_name: transformers |
| license: cc0-1.0 |
| datasets: |
| - deutsche-telekom/Ger-RAG-eval |
| tags: |
| - tokenization |
| language: |
| - de |
| --- |
| |
| # Small German Tokenizer |
|
|
| This is a small public domain-like tokenizer optimized for German. |
|
|
| ## Special Tokens |
|
|
| - End-of-Sequence token: `[EOS]` |
| - Padding token: `[PAD]` |
|
|
| ## Training |
|
|
| This tokenizer was trained on the `context` column of the configs `task1` and `task4` in [deutsche-telekom/Ger-RAG-eval](https://huggingface.co/datasets/deutsche-telekom/Ger-RAG-eval). |
|
|
| ## Limitations |
|
|
| Due to its small corpus, this tokenizer may split words into smaller pieces. Also, some uncommon special tokens aren't present, you'll have to add them manually if needed. |