README / README.md
stefan-it's picture
docs: add more sections, incl. reference to original GitHub project
b2cebd5 verified
metadata
title: README
emoji: πŸ“ˆ
colorFrom: red
colorTo: red
sdk: static
pinned: false

πŸ‡©πŸ‡ͺ German Tokenizer Benchmark

A curated collection of German datasets for comprehensive tokenizer evaluation across diverse domains and text types.

This organization hosts datasets used by the GerTokEval framework to evaluate German tokenizers with standardized metrics and fairness analysis.

πŸ”Ž Datasets

The following datasets are currently supported in the main framework:

The main goal for choosing these datasets is to evaluate tokenizers on a broad range of domains.

❀️ Acknowledgements

Many thanks to Clara Meister for releasing the amazing TokEval framework!