qikp's picture
Update README.md
7f6d5fe verified
metadata
library_name: transformers
license: cc0-1.0
datasets:
  - deutsche-telekom/Ger-RAG-eval
tags:
  - tokenization
language:
  - de

Small German Tokenizer

This is a small public domain-like tokenizer optimized for German.

Special Tokens

  • End-of-Sequence token: [EOS]
  • Padding token: [PAD]

Training

This tokenizer was trained on the context column of the configs task1 and task4 in deutsche-telekom/Ger-RAG-eval.

Limitations

Due to its small corpus, this tokenizer may split words into smaller pieces. Also, some uncommon special tokens aren't present, you'll have to add them manually if needed.