offline_compression_graph_code

72c0672 verified 4 months ago

24.4 kB

	Quicktour
	====================================================================================================

	Let's have a quick look at the 🤗 Tokenizers library features. The library provides an
	implementation of today's most used tokenizers that is both easy to use and blazing fast.

	.. only:: python

	It can be used to instantiate a :ref:`pretrained tokenizer <pretrained>` but we will start our
	quicktour by building one from scratch and see how we can train it.


	Build a tokenizer from scratch
	----------------------------------------------------------------------------------------------------

	To illustrate how fast the 🤗 Tokenizers library is, let's train a new tokenizer on `wikitext-103
	<https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/>`__ (516M of
	text) in just a few seconds. First things first, you will need to download this dataset and unzip it
	with:

	.. code-block:: bash

	wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
	unzip wikitext-103-raw-v1.zip

	Training the tokenizer
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	.. entities:: python

	BpeTrainer
	:class:`~tokenizers.trainers.BpeTrainer`
	vocab_size
	:obj:`vocab_size`
	min_frequency
	:obj:`min_frequency`
	special_tokens
	:obj:`special_tokens`
	unk_token
	:obj:`unk_token`
	pad_token
	:obj:`pad_token`

	.. entities:: rust

	BpeTrainer
	:rust_struct:`~tokenizers::models::bpe::BpeTrainer`
	vocab_size
	:obj:`vocab_size`
	min_frequency
	:obj:`min_frequency`
	special_tokens
	:obj:`special_tokens`
	unk_token
	:obj:`unk_token`
	pad_token
	:obj:`pad_token`

	.. entities:: node

	BpeTrainer
	BpeTrainer
	vocab_size
	:obj:`vocabSize`
	min_frequency
	:obj:`minFrequency`
	special_tokens
	:obj:`specialTokens`
	unk_token
	:obj:`unkToken`
	pad_token
	:obj:`padToken`

	In this tour, we will build and train a Byte-Pair Encoding (BPE) tokenizer. For more information
	about the different type of tokenizers, check out this `guide
	<https://huggingface.co/docs/transformers/main/en/tokenizer_summary#summary-of-the-tokenizers>`__ in the 🤗 Transformers
	documentation. Here, training the tokenizer means it will learn merge rules by:

	- Start with all the characters present in the training corpus as tokens.
	- Identify the most common pair of tokens and merge it into one token.
	- Repeat until the vocabulary (e.g., the number of tokens) has reached the size we want.

	The main API of the library is the :entity:`class` :entity:`Tokenizer`, here is how we instantiate
	one with a BPE model:

	.. only:: python

	.. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py
	:language: python
	:start-after: START init_tokenizer
	:end-before: END init_tokenizer
	:dedent: 8

	.. only:: rust

	.. literalinclude:: ../../tokenizers/tests/documentation.rs
	:language: rust
	:start-after: START quicktour_init_tokenizer
	:end-before: END quicktour_init_tokenizer
	:dedent: 4

	.. only:: node

	.. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts
	:language: javascript
	:start-after: START init_tokenizer
	:end-before: END init_tokenizer
	:dedent: 4

	To train our tokenizer on the wikitext files, we will need to instantiate a `trainer`, in this case
	a :entity:`BpeTrainer`

	.. only:: python

	.. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py
	:language: python
	:start-after: START init_trainer
	:end-before: END init_trainer
	:dedent: 8

	.. only:: rust

	.. literalinclude:: ../../tokenizers/tests/documentation.rs
	:language: rust
	:start-after: START quicktour_init_trainer
	:end-before: END quicktour_init_trainer
	:dedent: 4

	.. only:: node

	.. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts
	:language: javascript
	:start-after: START init_trainer
	:end-before: END init_trainer
	:dedent: 4

	We can set the training arguments like :entity:`vocab_size` or :entity:`min_frequency` (here left at
	their default values of 30,000 and 0) but the most important part is to give the
	:entity:`special_tokens` we plan to use later on (they are not used at all during training) so that
	they get inserted in the vocabulary.

	.. note::

	The order in which you write the special tokens list matters: here :obj:`"[UNK]"` will get the
	ID 0, :obj:`"[CLS]"` will get the ID 1 and so forth.

	We could train our tokenizer right now, but it wouldn't be optimal. Without a pre-tokenizer that
	will split our inputs into words, we might get tokens that overlap several words: for instance we
	could get an :obj:`"it is"` token since those two words often appear next to each other. Using a
	pre-tokenizer will ensure no token is bigger than a word returned by the pre-tokenizer. Here we want
	to train a subword BPE tokenizer, and we will use the easiest pre-tokenizer possible by splitting
	on whitespace.

	.. only:: python

	.. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py
	:language: python
	:start-after: START init_pretok
	:end-before: END init_pretok
	:dedent: 8

	.. only:: rust

	.. literalinclude:: ../../tokenizers/tests/documentation.rs
	:language: rust
	:start-after: START quicktour_init_pretok
	:end-before: END quicktour_init_pretok
	:dedent: 4

	.. only:: node

	.. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts
	:language: javascript
	:start-after: START init_pretok
	:end-before: END init_pretok
	:dedent: 4

	Now, we can just call the :entity:`Tokenizer.train` method with any list of files we want
	to use:

	.. only:: python

	.. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py
	:language: python
	:start-after: START train
	:end-before: END train
	:dedent: 8

	.. only:: rust

	.. literalinclude:: ../../tokenizers/tests/documentation.rs
	:language: rust
	:start-after: START quicktour_train
	:end-before: END quicktour_train
	:dedent: 4

	.. only:: node

	.. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts
	:language: javascript
	:start-after: START train
	:end-before: END train
	:dedent: 4

	This should only take a few seconds to train our tokenizer on the full wikitext dataset!
	To save the tokenizer in one file that contains all its configuration and vocabulary, just use the
	:entity:`Tokenizer.save` method:

	.. only:: python

	.. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py
	:language: python
	:start-after: START save
	:end-before: END save
	:dedent: 8

	.. only:: rust

	.. literalinclude:: ../../tokenizers/tests/documentation.rs
	:language: rust
	:start-after: START quicktour_save
	:end-before: END quicktour_save
	:dedent: 4

	.. only:: node

	.. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts
	:language: javascript
	:start-after: START save
	:end-before: END save
	:dedent: 4

	and you can reload your tokenizer from that file with the :entity:`Tokenizer.from_file`
	:entity:`classmethod`:

	.. only:: python

	.. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py
	:language: python
	:start-after: START reload_tokenizer
	:end-before: END reload_tokenizer
	:dedent: 12

	.. only:: rust

	.. literalinclude:: ../../tokenizers/tests/documentation.rs
	:language: rust
	:start-after: START quicktour_reload_tokenizer
	:end-before: END quicktour_reload_tokenizer
	:dedent: 4

	.. only:: node

	.. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts
	:language: javascript
	:start-after: START reload_tokenizer
	:end-before: END reload_tokenizer
	:dedent: 4

	Using the tokenizer
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	Now that we have trained a tokenizer, we can use it on any text we want with the
	:entity:`Tokenizer.encode` method:

	.. only:: python

	.. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py
	:language: python
	:start-after: START encode
	:end-before: END encode
	:dedent: 8

	.. only:: rust

	.. literalinclude:: ../../tokenizers/tests/documentation.rs
	:language: rust
	:start-after: START quicktour_encode
	:end-before: END quicktour_encode
	:dedent: 4

	.. only:: node

	.. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts
	:language: javascript
	:start-after: START encode
	:end-before: END encode
	:dedent: 4

	This applied the full pipeline of the tokenizer on the text, returning an
	:entity:`Encoding` object. To learn more about this pipeline, and how to apply (or
	customize) parts of it, check out :doc:`this page <pipeline>`.

	This :entity:`Encoding` object then has all the attributes you need for your deep
	learning model (or other). The :obj:`tokens` attribute contains the segmentation of your text in
	tokens:

	.. only:: python

	.. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py
	:language: python
	:start-after: START print_tokens
	:end-before: END print_tokens
	:dedent: 8

	.. only:: rust

	.. literalinclude:: ../../tokenizers/tests/documentation.rs
	:language: rust
	:start-after: START quicktour_print_tokens
	:end-before: END quicktour_print_tokens
	:dedent: 4

	.. only:: node

	.. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts
	:language: javascript
	:start-after: START print_tokens
	:end-before: END print_tokens
	:dedent: 4

	Similarly, the :obj:`ids` attribute will contain the index of each of those tokens in the
	tokenizer's vocabulary:

	.. only:: python

	.. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py
	:language: python
	:start-after: START print_ids
	:end-before: END print_ids
	:dedent: 8

	.. only:: rust

	.. literalinclude:: ../../tokenizers/tests/documentation.rs
	:language: rust
	:start-after: START quicktour_print_ids
	:end-before: END quicktour_print_ids
	:dedent: 4

	.. only:: node

	.. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts
	:language: javascript
	:start-after: START print_ids
	:end-before: END print_ids
	:dedent: 4

	An important feature of the 🤗 Tokenizers library is that it comes with full alignment tracking,
	meaning you can always get the part of your original sentence that corresponds to a given token.
	Those are stored in the :obj:`offsets` attribute of our :entity:`Encoding` object. For
	instance, let's assume we would want to find back what caused the :obj:`"[UNK]"` token to appear,
	which is the token at index 9 in the list, we can just ask for the offset at the index:

	.. only:: python

	.. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py
	:language: python
	:start-after: START print_offsets
	:end-before: END print_offsets
	:dedent: 8

	.. only:: rust

	.. literalinclude:: ../../tokenizers/tests/documentation.rs
	:language: rust
	:start-after: START quicktour_print_offsets
	:end-before: END quicktour_print_offsets
	:dedent: 4

	.. only:: node

	.. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts
	:language: javascript
	:start-after: START print_offsets
	:end-before: END print_offsets
	:dedent: 4

	and those are the indices that correspond to the emoji in the original sentence:

	.. only:: python

	.. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py
	:language: python
	:start-after: START use_offsets
	:end-before: END use_offsets
	:dedent: 8

	.. only:: rust

	.. literalinclude:: ../../tokenizers/tests/documentation.rs
	:language: rust
	:start-after: START quicktour_use_offsets
	:end-before: END quicktour_use_offsets
	:dedent: 4

	.. only:: node

	.. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts
	:language: javascript
	:start-after: START use_offsets
	:end-before: END use_offsets
	:dedent: 4

	Post-processing
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	We might want our tokenizer to automatically add special tokens, like :obj:`"[CLS]"` or
	:obj:`"[SEP]"`. To do this, we use a post-processor. :entity:`TemplateProcessing` is the
	most commonly used, you just have to specify a template for the processing of single sentences and
	pairs of sentences, along with the special tokens and their IDs.

	When we built our tokenizer, we set :obj:`"[CLS]"` and :obj:`"[SEP]"` in positions 1 and 2 of our
	list of special tokens, so this should be their IDs. To double-check, we can use the
	:entity:`Tokenizer.token_to_id` method:

	.. only:: python

	.. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py
	:language: python
	:start-after: START check_sep
	:end-before: END check_sep
	:dedent: 8

	.. only:: rust

	.. literalinclude:: ../../tokenizers/tests/documentation.rs
	:language: rust
	:start-after: START quicktour_check_sep
	:end-before: END quicktour_check_sep
	:dedent: 4

	.. only:: node

	.. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts
	:language: javascript
	:start-after: START check_sep
	:end-before: END check_sep
	:dedent: 4

	Here is how we can set the post-processing to give us the traditional BERT inputs:

	.. only:: python

	.. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py
	:language: python
	:start-after: START init_template_processing
	:end-before: END init_template_processing
	:dedent: 8

	.. only:: rust

	.. literalinclude:: ../../tokenizers/tests/documentation.rs
	:language: rust
	:start-after: START quicktour_init_template_processing
	:end-before: END quicktour_init_template_processing
	:dedent: 4

	.. only:: node

	.. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts
	:language: javascript
	:start-after: START init_template_processing
	:end-before: END init_template_processing
	:dedent: 4

	Let's go over this snippet of code in more details. First we specify the template for single
	sentences: those should have the form :obj:`"[CLS] $A [SEP]"` where :obj:`$A` represents our
	sentence.

	Then, we specify the template for sentence pairs, which should have the form
	:obj:`"[CLS] $A [SEP] $B [SEP]"` where :obj:`$A` represents the first sentence and :obj:`$B` the
	second one. The :obj:`:1` added in the template represent the `type IDs` we want for each part of
	our input: it defaults to 0 for everything (which is why we don't have :obj:`$A:0`) and here we set
	it to 1 for the tokens of the second sentence and the last :obj:`"[SEP]"` token.

	Lastly, we specify the special tokens we used and their IDs in our tokenizer's vocabulary.

	To check out this worked properly, let's try to encode the same sentence as before:

	.. only:: python

	.. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py
	:language: python
	:start-after: START print_special_tokens
	:end-before: END print_special_tokens
	:dedent: 8

	.. only:: rust

	.. literalinclude:: ../../tokenizers/tests/documentation.rs
	:language: rust
	:start-after: START quicktour_print_special_tokens
	:end-before: END quicktour_print_special_tokens
	:dedent: 4

	.. only:: node

	.. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts
	:language: javascript
	:start-after: START print_special_tokens
	:end-before: END print_special_tokens
	:dedent: 4

	To check the results on a pair of sentences, we just pass the two sentences to
	:entity:`Tokenizer.encode`:

	.. only:: python

	.. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py
	:language: python
	:start-after: START print_special_tokens_pair
	:end-before: END print_special_tokens_pair
	:dedent: 8

	.. only:: rust

	.. literalinclude:: ../../tokenizers/tests/documentation.rs
	:language: rust
	:start-after: START quicktour_print_special_tokens_pair
	:end-before: END quicktour_print_special_tokens_pair
	:dedent: 4

	.. only:: node

	.. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts
	:language: javascript
	:start-after: START print_special_tokens_pair
	:end-before: END print_special_tokens_pair
	:dedent: 4

	You can then check the type IDs attributed to each token is correct with

	.. only:: python

	.. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py
	:language: python
	:start-after: START print_type_ids
	:end-before: END print_type_ids
	:dedent: 8

	.. only:: rust

	.. literalinclude:: ../../tokenizers/tests/documentation.rs
	:language: rust
	:start-after: START quicktour_print_type_ids
	:end-before: END quicktour_print_type_ids
	:dedent: 4

	.. only:: node

	.. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts
	:language: javascript
	:start-after: START print_type_ids
	:end-before: END print_type_ids
	:dedent: 4

	If you save your tokenizer with :entity:`Tokenizer.save`, the post-processor will be saved along.

	Encoding multiple sentences in a batch
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	To get the full speed of the 🤗 Tokenizers library, it's best to process your texts by batches by
	using the :entity:`Tokenizer.encode_batch` method:

	.. only:: python

	.. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py
	:language: python
	:start-after: START encode_batch
	:end-before: END encode_batch
	:dedent: 8

	.. only:: rust

	.. literalinclude:: ../../tokenizers/tests/documentation.rs
	:language: rust
	:start-after: START quicktour_encode_batch
	:end-before: END quicktour_encode_batch
	:dedent: 4

	.. only:: node

	.. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts
	:language: javascript
	:start-after: START encode_batch
	:end-before: END encode_batch
	:dedent: 4

	The output is then a list of :entity:`Encoding` objects like the ones we saw before. You
	can process together as many texts as you like, as long as it fits in memory.

	To process a batch of sentences pairs, pass two lists to the
	:entity:`Tokenizer.encode_batch` method: the list of sentences A and the list of sentences
	B:

	.. only:: python

	.. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py
	:language: python
	:start-after: START encode_batch_pair
	:end-before: END encode_batch_pair
	:dedent: 8

	.. only:: rust

	.. literalinclude:: ../../tokenizers/tests/documentation.rs
	:language: rust
	:start-after: START quicktour_encode_batch_pair
	:end-before: END quicktour_encode_batch_pair
	:dedent: 4

	.. only:: node

	.. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts
	:language: javascript
	:start-after: START encode_batch_pair
	:end-before: END encode_batch_pair
	:dedent: 4

	When encoding multiple sentences, you can automatically pad the outputs to the longest sentence
	present by using :entity:`Tokenizer.enable_padding`, with the :entity:`pad_token` and its ID
	(which we can double-check the id for the padding token with
	:entity:`Tokenizer.token_to_id` like before):

	.. only:: python

	.. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py
	:language: python
	:start-after: START enable_padding
	:end-before: END enable_padding
	:dedent: 8

	.. only:: rust

	.. literalinclude:: ../../tokenizers/tests/documentation.rs
	:language: rust
	:start-after: START quicktour_enable_padding
	:end-before: END quicktour_enable_padding
	:dedent: 4

	.. only:: node

	.. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts
	:language: javascript
	:start-after: START enable_padding
	:end-before: END enable_padding
	:dedent: 4

	We can set the :obj:`direction` of the padding (defaults to the right) or a given :obj:`length` if
	we want to pad every sample to that specific number (here we leave it unset to pad to the size of
	the longest text).

	.. only:: python

	.. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py
	:language: python
	:start-after: START print_batch_tokens
	:end-before: END print_batch_tokens
	:dedent: 8

	.. only:: rust

	.. literalinclude:: ../../tokenizers/tests/documentation.rs
	:language: rust
	:start-after: START quicktour_print_batch_tokens
	:end-before: END quicktour_print_batch_tokens
	:dedent: 4

	.. only:: node

	.. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts
	:language: javascript
	:start-after: START print_batch_tokens
	:end-before: END print_batch_tokens
	:dedent: 4

	In this case, the `attention mask` generated by the tokenizer takes the padding into account:

	.. only:: python

	.. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py
	:language: python
	:start-after: START print_attention_mask
	:end-before: END print_attention_mask
	:dedent: 8

	.. only:: rust

	.. literalinclude:: ../../tokenizers/tests/documentation.rs
	:language: rust
	:start-after: START quicktour_print_attention_mask
	:end-before: END quicktour_print_attention_mask
	:dedent: 4

	.. only:: node

	.. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts
	:language: javascript
	:start-after: START print_attention_mask
	:end-before: END print_attention_mask
	:dedent: 4

	.. _pretrained:

	.. only:: python

	Using a pretrained tokenizer
	------------------------------------------------------------------------------------------------

	You can load any tokenizer from the Hugging Face Hub as long as a `tokenizer.json` file is
	available in the repository.

	.. code-block:: python

	from tokenizers import Tokenizer

	tokenizer = Tokenizer.from_pretrained("bert-base-uncased")

	Importing a pretrained tokenizer from legacy vocabulary files
	------------------------------------------------------------------------------------------------

	You can also import a pretrained tokenizer directly in, as long as you have its vocabulary file.
	For instance, here is how to import the classic pretrained BERT tokenizer:

	.. code-block:: python

	from tokenizers import BertWordPieceTokenizer

	tokenizer = BertWordPieceTokenizer("bert-base-uncased-vocab.txt", lowercase=True)

	as long as you have downloaded the file `bert-base-uncased-vocab.txt` with

	.. code-block:: bash

	wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt