Buckets:

rtrm's picture
|
download
raw
950 Bytes
# Tokenizers
Fast State-of-the-art tokenizers, optimized for both research and
production
[🤗 Tokenizers](https://github.com/huggingface/tokenizers) provides an
implementation of today's most used tokenizers, with a focus on
performance and versatility. These tokenizers are also used in [🤗 Transformers](https://github.com/huggingface/transformers).
# Main features:
- Train new vocabularies and tokenize, using today's most used tokenizers.
- Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
- Easy to use, but also extremely versatile.
- Designed for both research and production.
- Full alignment tracking. Even with destructive normalization, it's always possible to get the part of the original sentence that corresponds to any token.
- Does all the pre-processing: Truncation, Padding, add the special tokens your model needs.

Xet Storage Details

Size:
950 Bytes
·
Xet hash:
e70d9078ff069066c0c7fe8ae32988e2e6f5017e25730a2d5a5b6445f9260008

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.