| <!-- DISABLE-FRONTMATTER-SECTIONS --> | |
| # Tokenizers | |
| Fast State-of-the-art tokenizers, optimized for both research and | |
| production | |
| [🤗 Tokenizers](https://github.com/huggingface/tokenizers) provides an | |
| implementation of today's most used tokenizers, with a focus on | |
| performance and versatility. These tokenizers are also used in [🤗 Transformers](https://github.com/huggingface/transformers). | |
| # Main features: | |
| - Train new vocabularies and tokenize, using today's most used tokenizers. | |
| - Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU. | |
| - Easy to use, but also extremely versatile. | |
| - Designed for both research and production. | |
| - Full alignment tracking. Even with destructive normalization, it's always possible to get the part of the original sentence that corresponds to any token. | |
| - Does all the pre-processing: Truncation, Padding, add the special tokens your model needs. | |