2ira
/

Byte-lingua-code

Model card Files Files and versions

Byte-lingua-code / superbpe /tokenizers_superbpe /docs /source-doc-builder /index.mdx

2ira's picture

offline_compression_graph_code

72c0672 verified 4 months ago

history blame contribute delete

988 Bytes

	<!-- DISABLE-FRONTMATTER-SECTIONS -->

	# Tokenizers

	Fast State-of-the-art tokenizers, optimized for both research and
	production

	[🤗 Tokenizers](https://github.com/huggingface/tokenizers) provides an
	implementation of today's most used tokenizers, with a focus on
	performance and versatility. These tokenizers are also used in [🤗 Transformers](https://github.com/huggingface/transformers).

	# Main features:

	- Train new vocabularies and tokenize, using today's most used tokenizers.
	- Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
	- Easy to use, but also extremely versatile.
	- Designed for both research and production.
	- Full alignment tracking. Even with destructive normalization, it's always possible to get the part of the original sentence that corresponds to any token.
	- Does all the pre-processing: Truncation, Padding, add the special tokens your model needs.