2ira
/

Byte-lingua-code

Model card Files Files and versions

Byte-lingua-code / superbpe /tokenizers_superbpe /docs /source /index.rst

2ira's picture

offline_compression_graph_code

72c0672 verified 4 months ago

history blame contribute delete

1.55 kB

	Tokenizers
	====================================================================================================

	Fast State-of-the-art tokenizers, optimized for both research and production

	`🤗 Tokenizers`_ provides an implementation of today's most used tokenizers, with
	a focus on performance and versatility. These tokenizers are also used in
	`🤗 Transformers`_.

	.. _🤗 Tokenizers: https://github.com/huggingface/tokenizers
	.. _🤗 Transformers: https://github.com/huggingface/transformers

	Main features:
	----------------------------------------------------------------------------------------------------

	- Train new vocabularies and tokenize, using today's most used tokenizers.
	- Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes
	less than 20 seconds to tokenize a GB of text on a server's CPU.
	- Easy to use, but also extremely versatile.
	- Designed for both research and production.
	- Full alignment tracking. Even with destructive normalization, it's always possible to get
	the part of the original sentence that corresponds to any token.
	- Does all the pre-processing: Truncation, Padding, add the special tokens your model needs.


	.. toctree::
	:maxdepth: 2
	:caption: Getting Started

	quicktour
	installation/main
	pipeline
	components

	.. toctree-tags::
	:maxdepth: 3
	:caption: Using 🤗 Tokenizers
	:glob:

	:python:tutorials/python/*

	.. toctree::
	:maxdepth: 3
	:caption: API Reference

	api/reference

	.. include:: entities.inc