File size: 689 Bytes
0af1031
 
3961fd2
 
 
 
 
7f6d5fe
 
0af1031
 
3961fd2
0af1031
3961fd2
0af1031
3961fd2
0af1031
3961fd2
 
0af1031
3961fd2
0af1031
3961fd2
0af1031
3961fd2
0af1031
3961fd2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
---
library_name: transformers
license: cc0-1.0
datasets:
- deutsche-telekom/Ger-RAG-eval
tags:
- tokenization
language:
- de
---

# Small German Tokenizer

This is a small public domain-like tokenizer optimized for German.

## Special Tokens

- End-of-Sequence token: `[EOS]`
- Padding token: `[PAD]`

## Training

This tokenizer was trained on the `context` column of the configs `task1` and `task4` in [deutsche-telekom/Ger-RAG-eval](https://huggingface.co/datasets/deutsche-telekom/Ger-RAG-eval).

## Limitations

Due to its small corpus, this tokenizer may split words into smaller pieces. Also, some uncommon special tokens aren't present, you'll have to add them manually if needed.