Buckets:

hf-doc-build
/

doc-dev

about 2 months ago

950 Bytes

	# Tokenizers

	Fast State-of-the-art tokenizers, optimized for both research and
	production

	[🤗 Tokenizers](https://github.com/huggingface/tokenizers) provides an
	implementation of today's most used tokenizers, with a focus on
	performance and versatility. These tokenizers are also used in [🤗 Transformers](https://github.com/huggingface/transformers).

	# Main features:

	- Train new vocabularies and tokenize, using today's most used tokenizers.
	- Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
	- Easy to use, but also extremely versatile.
	- Designed for both research and production.
	- Full alignment tracking. Even with destructive normalization, it's always possible to get the part of the original sentence that corresponds to any token.
	- Does all the pre-processing: Truncation, Padding, add the special tokens your model needs.

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.