| Tokenizers |
| ==================================================================================================== |
|
|
| Fast State-of-the-art tokenizers, optimized for both research and production |
|
|
| `π€ Tokenizers`_ provides an implementation of today's most used tokenizers, with |
| a focus on performance and versatility. These tokenizers are also used in |
| `π€ Transformers`_. |
|
|
| .. _π€ Tokenizers: https://github.com/huggingface/tokenizers |
| .. _π€ Transformers: https://github.com/huggingface/transformers |
|
|
| Main features: |
| ---------------------------------------------------------------------------------------------------- |
|
|
| - Train new vocabularies and tokenize, using today's most used tokenizers. |
| - Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes |
| less than 20 seconds to tokenize a GB of text on a server's CPU. |
| - Easy to use, but also extremely versatile. |
| - Designed for both research and production. |
| - Full alignment tracking. Even with destructive normalization, it's always possible to get |
| the part of the original sentence that corresponds to any token. |
| - Does all the pre-processing: Truncation, Padding, add the special tokens your model needs. |
|
|
|
|
| .. toctree:: |
| :maxdepth: 2 |
| :caption: Getting Started |
|
|
| quicktour |
| installation/main |
| pipeline |
| components |
|
|
| .. toctree-tags:: |
| :maxdepth: 3 |
| :caption: Using π€ Tokenizers |
| :glob: |
|
|
| :python:tutorials/python/* |
|
|
| .. toctree:: |
| :maxdepth: 3 |
| :caption: API Reference |
|
|
| api/reference |
|
|
| .. include:: entities.inc |
|
|