File size: 23,270 Bytes

72c0672

# Quicktour

Let's have a quick look at the 🤗 Tokenizers library features. The
library provides an implementation of today's most used tokenizers that
is both easy to use and blazing fast.

## Build a tokenizer from scratch

To illustrate how fast the 🤗 Tokenizers library is, let's train a new
tokenizer on [wikitext-103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/)
(516M of text) in just a few seconds. First things first, you will need
to download this dataset and unzip it with:

``` bash
wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
unzip wikitext-103-raw-v1.zip
```

### Training the tokenizer

In this tour, we will build and train a Byte-Pair Encoding (BPE)
tokenizer. For more information about the different type of tokenizers,
check out this [guide](https://huggingface.co/transformers/tokenizer_summary.html) in
the 🤗 Transformers documentation. Here, training the tokenizer means it
will learn merge rules by:

-   Start with all the characters present in the training corpus as
    tokens.
-   Identify the most common pair of tokens and merge it into one token.
-   Repeat until the vocabulary (e.g., the number of tokens) has reached
    the size we want.

The main API of the library is the `class` `Tokenizer`, here is how
we instantiate one with a BPE model:

<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START init_tokenizer",
"end-before": "END init_tokenizer",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_init_tokenizer",
"end-before": "END quicktour_init_tokenizer",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START init_tokenizer",
"end-before": "END init_tokenizer",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>

To train our tokenizer on the wikitext files, we will need to
instantiate a [trainer]{.title-ref}, in this case a
`BpeTrainer`

<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START init_trainer",
"end-before": "END init_trainer",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_init_trainer",
"end-before": "END quicktour_init_trainer",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START init_trainer",
"end-before": "END init_trainer",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>

We can set the training arguments like `vocab_size` or `min_frequency` (here
left at their default values of 30,000 and 0) but the most important
part is to give the `special_tokens` we
plan to use later on (they are not used at all during training) so that
they get inserted in the vocabulary.

<Tip>

The order in which you write the special tokens list matters: here `"[UNK]"` will get the ID 0,
`"[CLS]"` will get the ID 1 and so forth.

</Tip>

We could train our tokenizer right now, but it wouldn't be optimal.
Without a pre-tokenizer that will split our inputs into words, we might
get tokens that overlap several words: for instance we could get an
`"it is"` token since those two words
often appear next to each other. Using a pre-tokenizer will ensure no
token is bigger than a word returned by the pre-tokenizer. Here we want
to train a subword BPE tokenizer, and we will use the easiest
pre-tokenizer possible by splitting on whitespace.

<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START init_pretok",
"end-before": "END init_pretok",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_init_pretok",
"end-before": "END quicktour_init_pretok",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START init_pretok",
"end-before": "END init_pretok",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>

Now, we can just call the `Tokenizer.train` method with any list of files we want to use:

<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START train",
"end-before": "END train",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_train",
"end-before": "END quicktour_train",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START train",
"end-before": "END train",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>

This should only take a few seconds to train our tokenizer on the full
wikitext dataset! To save the tokenizer in one file that contains all
its configuration and vocabulary, just use the
`Tokenizer.save` method:

<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START save",
"end-before": "END save",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_save",
"end-before": "END quicktour_save",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START save",
"end-before": "END save",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>

and you can reload your tokenizer from that file with the
`Tokenizer.from_file`
`classmethod`:

<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START reload_tokenizer",
"end-before": "END reload_tokenizer",
"dedent": 12}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_reload_tokenizer",
"end-before": "END quicktour_reload_tokenizer",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START reload_tokenizer",
"end-before": "END reload_tokenizer",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>

### Using the tokenizer

Now that we have trained a tokenizer, we can use it on any text we want
with the `Tokenizer.encode` method:

<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START encode",
"end-before": "END encode",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_encode",
"end-before": "END quicktour_encode",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START encode",
"end-before": "END encode",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>

This applied the full pipeline of the tokenizer on the text, returning
an `Encoding` object. To learn more
about this pipeline, and how to apply (or customize) parts of it, check out [this page](pipeline).

This `Encoding` object then has all the
attributes you need for your deep learning model (or other). The
`tokens` attribute contains the
segmentation of your text in tokens:

<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START print_tokens",
"end-before": "END print_tokens",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_print_tokens",
"end-before": "END quicktour_print_tokens",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START print_tokens",
"end-before": "END print_tokens",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>

Similarly, the `ids` attribute will
contain the index of each of those tokens in the tokenizer's
vocabulary:

<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START print_ids",
"end-before": "END print_ids",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_print_ids",
"end-before": "END quicktour_print_ids",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START print_ids",
"end-before": "END print_ids",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>

An important feature of the 🤗 Tokenizers library is that it comes with
full alignment tracking, meaning you can always get the part of your
original sentence that corresponds to a given token. Those are stored in
the `offsets` attribute of our
`Encoding` object. For instance, let's
assume we would want to find back what caused the
`"[UNK]"` token to appear, which is the
token at index 9 in the list, we can just ask for the offset at the
index:

<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START print_offsets",
"end-before": "END print_offsets",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_print_offsets",
"end-before": "END quicktour_print_offsets",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START print_offsets",
"end-before": "END print_offsets",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>

and those are the indices that correspond to the emoji in the original
sentence:

<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START use_offsets",
"end-before": "END use_offsets",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_use_offsets",
"end-before": "END quicktour_use_offsets",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START use_offsets",
"end-before": "END use_offsets",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>

### Post-processing

We might want our tokenizer to automatically add special tokens, like
`"[CLS]"` or `"[SEP]"`. To do this, we use a post-processor.
`TemplateProcessing` is the most
commonly used, you just have to specify a template for the processing of
single sentences and pairs of sentences, along with the special tokens
and their IDs.

When we built our tokenizer, we set `"[CLS]"` and `"[SEP]"` in positions 1
and 2 of our list of special tokens, so this should be their IDs. To
double-check, we can use the `Tokenizer.token_to_id` method:

<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START check_sep",
"end-before": "END check_sep",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_check_sep",
"end-before": "END quicktour_check_sep",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START check_sep",
"end-before": "END check_sep",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>

Here is how we can set the post-processing to give us the traditional
BERT inputs:

<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START init_template_processing",
"end-before": "END init_template_processing",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_init_template_processing",
"end-before": "END quicktour_init_template_processing",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START init_template_processing",
"end-before": "END init_template_processing",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>

Let's go over this snippet of code in more details. First we specify
the template for single sentences: those should have the form
`"[CLS] $A [SEP]"` where
`$A` represents our sentence.

Then, we specify the template for sentence pairs, which should have the
form `"[CLS] $A [SEP] $B [SEP]"` where
`$A` represents the first sentence and
`$B` the second one. The
`:1` added in the template represent the `type IDs` we want for each part of our input: it defaults
to 0 for everything (which is why we don't have
`$A:0`) and here we set it to 1 for the
tokens of the second sentence and the last `"[SEP]"` token.

Lastly, we specify the special tokens we used and their IDs in our
tokenizer's vocabulary.

To check out this worked properly, let's try to encode the same
sentence as before:

<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START print_special_tokens",
"end-before": "END print_special_tokens",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_print_special_tokens",
"end-before": "END quicktour_print_special_tokens",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START print_special_tokens",
"end-before": "END print_special_tokens",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>

To check the results on a pair of sentences, we just pass the two
sentences to `Tokenizer.encode`:

<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START print_special_tokens_pair",
"end-before": "END print_special_tokens_pair",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_print_special_tokens_pair",
"end-before": "END quicktour_print_special_tokens_pair",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START print_special_tokens_pair",
"end-before": "END print_special_tokens_pair",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>

You can then check the type IDs attributed to each token is correct with

<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START print_type_ids",
"end-before": "END print_type_ids",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_print_type_ids",
"end-before": "END quicktour_print_type_ids",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START print_type_ids",
"end-before": "END print_type_ids",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>

If you save your tokenizer with `Tokenizer.save`, the post-processor will be saved along.

### Encoding multiple sentences in a batch

To get the full speed of the 🤗 Tokenizers library, it's best to
process your texts by batches by using the
`Tokenizer.encode_batch` method:

<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START encode_batch",
"end-before": "END encode_batch",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_encode_batch",
"end-before": "END quicktour_encode_batch",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START encode_batch",
"end-before": "END encode_batch",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>

The output is then a list of `Encoding`
objects like the ones we saw before. You can process together as many
texts as you like, as long as it fits in memory.

To process a batch of sentences pairs, pass two lists to the
`Tokenizer.encode_batch` method: the
list of sentences A and the list of sentences B:

<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START encode_batch_pair",
"end-before": "END encode_batch_pair",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_encode_batch_pair",
"end-before": "END quicktour_encode_batch_pair",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START encode_batch_pair",
"end-before": "END encode_batch_pair",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>

When encoding multiple sentences, you can automatically pad the outputs
to the longest sentence present by using
`Tokenizer.enable_padding`, with the
`pad_token` and its ID (which we can
double-check the id for the padding token with
`Tokenizer.token_to_id` like before):

<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START enable_padding",
"end-before": "END enable_padding",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_enable_padding",
"end-before": "END quicktour_enable_padding",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START enable_padding",
"end-before": "END enable_padding",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>

We can set the `direction` of the padding
(defaults to the right) or a given `length` if we want to pad every sample to that specific number (here
we leave it unset to pad to the size of the longest text).

<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START print_batch_tokens",
"end-before": "END print_batch_tokens",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_print_batch_tokens",
"end-before": "END quicktour_print_batch_tokens",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START print_batch_tokens",
"end-before": "END print_batch_tokens",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>

In this case, the `attention mask` generated by the
tokenizer takes the padding into account:

<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START print_attention_mask",
"end-before": "END print_attention_mask",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_print_attention_mask",
"end-before": "END quicktour_print_attention_mask",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START print_attention_mask",
"end-before": "END print_attention_mask",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>

## Pretrained

<tokenizerslangcontent>
<python>
### Using a pretrained tokenizer

You can load any tokenizer from the Hugging Face Hub as long as a
`tokenizer.json` file is available in the repository.

```python
from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
```

### Importing a pretrained tokenizer from legacy vocabulary files

You can also import a pretrained tokenizer directly in, as long as you
have its vocabulary file. For instance, here is how to import the
classic pretrained BERT tokenizer:

```python
from tokenizers import BertWordPieceTokenizer

tokenizer = BertWordPieceTokenizer("bert-base-uncased-vocab.txt", lowercase=True)
```

as long as you have downloaded the file `bert-base-uncased-vocab.txt` with

```bash
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt
```
</python>
</tokenizerslangcontent>