| Quicktour |
| ==================================================================================================== |
|
|
| Let's have a quick look at the 🤗 Tokenizers library features. The library provides an |
| implementation of today's most used tokenizers that is both easy to use and blazing fast. |
|
|
| .. only:: python |
|
|
| It can be used to instantiate a :ref:`pretrained tokenizer <pretrained>` but we will start our |
| quicktour by building one from scratch and see how we can train it. |
|
|
|
|
| Build a tokenizer from scratch |
| ---------------------------------------------------------------------------------------------------- |
|
|
| To illustrate how fast the 🤗 Tokenizers library is, let's train a new tokenizer on `wikitext-103 |
| <https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/>`__ (516M of |
| text) in just a few seconds. First things first, you will need to download this dataset and unzip it |
| with: |
|
|
| .. code-block:: bash |
|
|
| wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip |
| unzip wikitext-103-raw-v1.zip |
|
|
| Training the tokenizer |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
|
|
| .. entities:: python |
|
|
| BpeTrainer |
| :class:`~tokenizers.trainers.BpeTrainer` |
| vocab_size |
| :obj:`vocab_size` |
| min_frequency |
| :obj:`min_frequency` |
| special_tokens |
| :obj:`special_tokens` |
| unk_token |
| :obj:`unk_token` |
| pad_token |
| :obj:`pad_token` |
|
|
| .. entities:: rust |
|
|
| BpeTrainer |
| :rust_struct:`~tokenizers::models::bpe::BpeTrainer` |
| vocab_size |
| :obj:`vocab_size` |
| min_frequency |
| :obj:`min_frequency` |
| special_tokens |
| :obj:`special_tokens` |
| unk_token |
| :obj:`unk_token` |
| pad_token |
| :obj:`pad_token` |
|
|
| .. entities:: node |
|
|
| BpeTrainer |
| BpeTrainer |
| vocab_size |
| :obj:`vocabSize` |
| min_frequency |
| :obj:`minFrequency` |
| special_tokens |
| :obj:`specialTokens` |
| unk_token |
| :obj:`unkToken` |
| pad_token |
| :obj:`padToken` |
|
|
| In this tour, we will build and train a Byte-Pair Encoding (BPE) tokenizer. For more information |
| about the different type of tokenizers, check out this `guide |
| <https://huggingface.co/docs/transformers/main/en/tokenizer_summary#summary-of-the-tokenizers>`__ in the 🤗 Transformers |
| documentation. Here, training the tokenizer means it will learn merge rules by: |
|
|
| - Start with all the characters present in the training corpus as tokens. |
| - Identify the most common pair of tokens and merge it into one token. |
| - Repeat until the vocabulary (e.g., the number of tokens) has reached the size we want. |
|
|
| The main API of the library is the :entity:`class` :entity:`Tokenizer`, here is how we instantiate |
| one with a BPE model: |
|
|
| .. only:: python |
|
|
| .. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py |
| :language: python |
| :start-after: START init_tokenizer |
| :end-before: END init_tokenizer |
| :dedent: 8 |
|
|
| .. only:: rust |
|
|
| .. literalinclude:: ../../tokenizers/tests/documentation.rs |
| :language: rust |
| :start-after: START quicktour_init_tokenizer |
| :end-before: END quicktour_init_tokenizer |
| :dedent: 4 |
|
|
| .. only:: node |
|
|
| .. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts |
| :language: javascript |
| :start-after: START init_tokenizer |
| :end-before: END init_tokenizer |
| :dedent: 4 |
|
|
| To train our tokenizer on the wikitext files, we will need to instantiate a `trainer`, in this case |
| a :entity:`BpeTrainer` |
|
|
| .. only:: python |
|
|
| .. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py |
| :language: python |
| :start-after: START init_trainer |
| :end-before: END init_trainer |
| :dedent: 8 |
|
|
| .. only:: rust |
|
|
| .. literalinclude:: ../../tokenizers/tests/documentation.rs |
| :language: rust |
| :start-after: START quicktour_init_trainer |
| :end-before: END quicktour_init_trainer |
| :dedent: 4 |
|
|
| .. only:: node |
|
|
| .. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts |
| :language: javascript |
| :start-after: START init_trainer |
| :end-before: END init_trainer |
| :dedent: 4 |
|
|
| We can set the training arguments like :entity:`vocab_size` or :entity:`min_frequency` (here left at |
| their default values of 30,000 and 0) but the most important part is to give the |
| :entity:`special_tokens` we plan to use later on (they are not used at all during training) so that |
| they get inserted in the vocabulary. |
|
|
| .. note:: |
|
|
| The order in which you write the special tokens list matters: here :obj:`"[UNK]"` will get the |
| ID 0, :obj:`"[CLS]"` will get the ID 1 and so forth. |
|
|
| We could train our tokenizer right now, but it wouldn't be optimal. Without a pre-tokenizer that |
| will split our inputs into words, we might get tokens that overlap several words: for instance we |
| could get an :obj:`"it is"` token since those two words often appear next to each other. Using a |
| pre-tokenizer will ensure no token is bigger than a word returned by the pre-tokenizer. Here we want |
| to train a subword BPE tokenizer, and we will use the easiest pre-tokenizer possible by splitting |
| on whitespace. |
| |
| .. only:: python |
| |
| .. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py |
| :language: python |
| :start-after: START init_pretok |
| :end-before: END init_pretok |
| :dedent: 8 |
| |
| .. only:: rust |
| |
| .. literalinclude:: ../../tokenizers/tests/documentation.rs |
| :language: rust |
| :start-after: START quicktour_init_pretok |
| :end-before: END quicktour_init_pretok |
| :dedent: 4 |
| |
| .. only:: node |
| |
| .. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts |
| :language: javascript |
| :start-after: START init_pretok |
| :end-before: END init_pretok |
| :dedent: 4 |
| |
| Now, we can just call the :entity:`Tokenizer.train` method with any list of files we want |
| to use: |
| |
| .. only:: python |
| |
| .. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py |
| :language: python |
| :start-after: START train |
| :end-before: END train |
| :dedent: 8 |
| |
| .. only:: rust |
| |
| .. literalinclude:: ../../tokenizers/tests/documentation.rs |
| :language: rust |
| :start-after: START quicktour_train |
| :end-before: END quicktour_train |
| :dedent: 4 |
| |
| .. only:: node |
| |
| .. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts |
| :language: javascript |
| :start-after: START train |
| :end-before: END train |
| :dedent: 4 |
| |
| This should only take a few seconds to train our tokenizer on the full wikitext dataset! |
| To save the tokenizer in one file that contains all its configuration and vocabulary, just use the |
| :entity:`Tokenizer.save` method: |
| |
| .. only:: python |
| |
| .. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py |
| :language: python |
| :start-after: START save |
| :end-before: END save |
| :dedent: 8 |
| |
| .. only:: rust |
| |
| .. literalinclude:: ../../tokenizers/tests/documentation.rs |
| :language: rust |
| :start-after: START quicktour_save |
| :end-before: END quicktour_save |
| :dedent: 4 |
| |
| .. only:: node |
| |
| .. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts |
| :language: javascript |
| :start-after: START save |
| :end-before: END save |
| :dedent: 4 |
| |
| and you can reload your tokenizer from that file with the :entity:`Tokenizer.from_file` |
| :entity:`classmethod`: |
| |
| .. only:: python |
| |
| .. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py |
| :language: python |
| :start-after: START reload_tokenizer |
| :end-before: END reload_tokenizer |
| :dedent: 12 |
| |
| .. only:: rust |
| |
| .. literalinclude:: ../../tokenizers/tests/documentation.rs |
| :language: rust |
| :start-after: START quicktour_reload_tokenizer |
| :end-before: END quicktour_reload_tokenizer |
| :dedent: 4 |
| |
| .. only:: node |
| |
| .. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts |
| :language: javascript |
| :start-after: START reload_tokenizer |
| :end-before: END reload_tokenizer |
| :dedent: 4 |
| |
| Using the tokenizer |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| Now that we have trained a tokenizer, we can use it on any text we want with the |
| :entity:`Tokenizer.encode` method: |
| |
| .. only:: python |
| |
| .. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py |
| :language: python |
| :start-after: START encode |
| :end-before: END encode |
| :dedent: 8 |
| |
| .. only:: rust |
| |
| .. literalinclude:: ../../tokenizers/tests/documentation.rs |
| :language: rust |
| :start-after: START quicktour_encode |
| :end-before: END quicktour_encode |
| :dedent: 4 |
| |
| .. only:: node |
| |
| .. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts |
| :language: javascript |
| :start-after: START encode |
| :end-before: END encode |
| :dedent: 4 |
| |
| This applied the full pipeline of the tokenizer on the text, returning an |
| :entity:`Encoding` object. To learn more about this pipeline, and how to apply (or |
| customize) parts of it, check out :doc:`this page <pipeline>`. |
| |
| This :entity:`Encoding` object then has all the attributes you need for your deep |
| learning model (or other). The :obj:`tokens` attribute contains the segmentation of your text in |
| tokens: |
| |
| .. only:: python |
| |
| .. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py |
| :language: python |
| :start-after: START print_tokens |
| :end-before: END print_tokens |
| :dedent: 8 |
| |
| .. only:: rust |
| |
| .. literalinclude:: ../../tokenizers/tests/documentation.rs |
| :language: rust |
| :start-after: START quicktour_print_tokens |
| :end-before: END quicktour_print_tokens |
| :dedent: 4 |
| |
| .. only:: node |
| |
| .. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts |
| :language: javascript |
| :start-after: START print_tokens |
| :end-before: END print_tokens |
| :dedent: 4 |
| |
| Similarly, the :obj:`ids` attribute will contain the index of each of those tokens in the |
| tokenizer's vocabulary: |
| |
| .. only:: python |
| |
| .. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py |
| :language: python |
| :start-after: START print_ids |
| :end-before: END print_ids |
| :dedent: 8 |
| |
| .. only:: rust |
| |
| .. literalinclude:: ../../tokenizers/tests/documentation.rs |
| :language: rust |
| :start-after: START quicktour_print_ids |
| :end-before: END quicktour_print_ids |
| :dedent: 4 |
| |
| .. only:: node |
| |
| .. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts |
| :language: javascript |
| :start-after: START print_ids |
| :end-before: END print_ids |
| :dedent: 4 |
| |
| An important feature of the 🤗 Tokenizers library is that it comes with full alignment tracking, |
| meaning you can always get the part of your original sentence that corresponds to a given token. |
| Those are stored in the :obj:`offsets` attribute of our :entity:`Encoding` object. For |
| instance, let's assume we would want to find back what caused the :obj:`"[UNK]"` token to appear, |
| which is the token at index 9 in the list, we can just ask for the offset at the index: |
|
|
| .. only:: python |
|
|
| .. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py |
| :language: python |
| :start-after: START print_offsets |
| :end-before: END print_offsets |
| :dedent: 8 |
|
|
| .. only:: rust |
|
|
| .. literalinclude:: ../../tokenizers/tests/documentation.rs |
| :language: rust |
| :start-after: START quicktour_print_offsets |
| :end-before: END quicktour_print_offsets |
| :dedent: 4 |
|
|
| .. only:: node |
|
|
| .. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts |
| :language: javascript |
| :start-after: START print_offsets |
| :end-before: END print_offsets |
| :dedent: 4 |
|
|
| and those are the indices that correspond to the emoji in the original sentence: |
|
|
| .. only:: python |
|
|
| .. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py |
| :language: python |
| :start-after: START use_offsets |
| :end-before: END use_offsets |
| :dedent: 8 |
|
|
| .. only:: rust |
|
|
| .. literalinclude:: ../../tokenizers/tests/documentation.rs |
| :language: rust |
| :start-after: START quicktour_use_offsets |
| :end-before: END quicktour_use_offsets |
| :dedent: 4 |
|
|
| .. only:: node |
|
|
| .. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts |
| :language: javascript |
| :start-after: START use_offsets |
| :end-before: END use_offsets |
| :dedent: 4 |
|
|
| Post-processing |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
|
|
| We might want our tokenizer to automatically add special tokens, like :obj:`"[CLS]"` or |
| :obj:`"[SEP]"`. To do this, we use a post-processor. :entity:`TemplateProcessing` is the |
| most commonly used, you just have to specify a template for the processing of single sentences and |
| pairs of sentences, along with the special tokens and their IDs. |
|
|
| When we built our tokenizer, we set :obj:`"[CLS]"` and :obj:`"[SEP]"` in positions 1 and 2 of our |
| list of special tokens, so this should be their IDs. To double-check, we can use the |
| :entity:`Tokenizer.token_to_id` method: |
|
|
| .. only:: python |
|
|
| .. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py |
| :language: python |
| :start-after: START check_sep |
| :end-before: END check_sep |
| :dedent: 8 |
|
|
| .. only:: rust |
|
|
| .. literalinclude:: ../../tokenizers/tests/documentation.rs |
| :language: rust |
| :start-after: START quicktour_check_sep |
| :end-before: END quicktour_check_sep |
| :dedent: 4 |
|
|
| .. only:: node |
|
|
| .. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts |
| :language: javascript |
| :start-after: START check_sep |
| :end-before: END check_sep |
| :dedent: 4 |
|
|
| Here is how we can set the post-processing to give us the traditional BERT inputs: |
|
|
| .. only:: python |
|
|
| .. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py |
| :language: python |
| :start-after: START init_template_processing |
| :end-before: END init_template_processing |
| :dedent: 8 |
|
|
| .. only:: rust |
|
|
| .. literalinclude:: ../../tokenizers/tests/documentation.rs |
| :language: rust |
| :start-after: START quicktour_init_template_processing |
| :end-before: END quicktour_init_template_processing |
| :dedent: 4 |
|
|
| .. only:: node |
|
|
| .. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts |
| :language: javascript |
| :start-after: START init_template_processing |
| :end-before: END init_template_processing |
| :dedent: 4 |
|
|
| Let's go over this snippet of code in more details. First we specify the template for single |
| sentences: those should have the form :obj:`"[CLS] $A [SEP]"` where :obj:`$A` represents our |
| sentence. |
| |
| Then, we specify the template for sentence pairs, which should have the form |
| :obj:`"[CLS] $A [SEP] $B [SEP]"` where :obj:`$A` represents the first sentence and :obj:`$B` the |
| second one. The :obj:`:1` added in the template represent the `type IDs` we want for each part of |
| our input: it defaults to 0 for everything (which is why we don't have :obj:`$A:0`) and here we set |
| it to 1 for the tokens of the second sentence and the last :obj:`"[SEP]"` token. |
| |
| Lastly, we specify the special tokens we used and their IDs in our tokenizer's vocabulary. |
| |
| To check out this worked properly, let's try to encode the same sentence as before: |
| |
| .. only:: python |
| |
| .. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py |
| :language: python |
| :start-after: START print_special_tokens |
| :end-before: END print_special_tokens |
| :dedent: 8 |
| |
| .. only:: rust |
| |
| .. literalinclude:: ../../tokenizers/tests/documentation.rs |
| :language: rust |
| :start-after: START quicktour_print_special_tokens |
| :end-before: END quicktour_print_special_tokens |
| :dedent: 4 |
| |
| .. only:: node |
| |
| .. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts |
| :language: javascript |
| :start-after: START print_special_tokens |
| :end-before: END print_special_tokens |
| :dedent: 4 |
| |
| To check the results on a pair of sentences, we just pass the two sentences to |
| :entity:`Tokenizer.encode`: |
| |
| .. only:: python |
| |
| .. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py |
| :language: python |
| :start-after: START print_special_tokens_pair |
| :end-before: END print_special_tokens_pair |
| :dedent: 8 |
| |
| .. only:: rust |
| |
| .. literalinclude:: ../../tokenizers/tests/documentation.rs |
| :language: rust |
| :start-after: START quicktour_print_special_tokens_pair |
| :end-before: END quicktour_print_special_tokens_pair |
| :dedent: 4 |
| |
| .. only:: node |
| |
| .. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts |
| :language: javascript |
| :start-after: START print_special_tokens_pair |
| :end-before: END print_special_tokens_pair |
| :dedent: 4 |
| |
| You can then check the type IDs attributed to each token is correct with |
| |
| .. only:: python |
| |
| .. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py |
| :language: python |
| :start-after: START print_type_ids |
| :end-before: END print_type_ids |
| :dedent: 8 |
| |
| .. only:: rust |
| |
| .. literalinclude:: ../../tokenizers/tests/documentation.rs |
| :language: rust |
| :start-after: START quicktour_print_type_ids |
| :end-before: END quicktour_print_type_ids |
| :dedent: 4 |
| |
| .. only:: node |
| |
| .. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts |
| :language: javascript |
| :start-after: START print_type_ids |
| :end-before: END print_type_ids |
| :dedent: 4 |
| |
| If you save your tokenizer with :entity:`Tokenizer.save`, the post-processor will be saved along. |
| |
| Encoding multiple sentences in a batch |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| To get the full speed of the 🤗 Tokenizers library, it's best to process your texts by batches by |
| using the :entity:`Tokenizer.encode_batch` method: |
| |
| .. only:: python |
| |
| .. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py |
| :language: python |
| :start-after: START encode_batch |
| :end-before: END encode_batch |
| :dedent: 8 |
| |
| .. only:: rust |
| |
| .. literalinclude:: ../../tokenizers/tests/documentation.rs |
| :language: rust |
| :start-after: START quicktour_encode_batch |
| :end-before: END quicktour_encode_batch |
| :dedent: 4 |
| |
| .. only:: node |
| |
| .. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts |
| :language: javascript |
| :start-after: START encode_batch |
| :end-before: END encode_batch |
| :dedent: 4 |
| |
| The output is then a list of :entity:`Encoding` objects like the ones we saw before. You |
| can process together as many texts as you like, as long as it fits in memory. |
| |
| To process a batch of sentences pairs, pass two lists to the |
| :entity:`Tokenizer.encode_batch` method: the list of sentences A and the list of sentences |
| B: |
| |
| .. only:: python |
| |
| .. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py |
| :language: python |
| :start-after: START encode_batch_pair |
| :end-before: END encode_batch_pair |
| :dedent: 8 |
| |
| .. only:: rust |
| |
| .. literalinclude:: ../../tokenizers/tests/documentation.rs |
| :language: rust |
| :start-after: START quicktour_encode_batch_pair |
| :end-before: END quicktour_encode_batch_pair |
| :dedent: 4 |
| |
| .. only:: node |
| |
| .. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts |
| :language: javascript |
| :start-after: START encode_batch_pair |
| :end-before: END encode_batch_pair |
| :dedent: 4 |
| |
| When encoding multiple sentences, you can automatically pad the outputs to the longest sentence |
| present by using :entity:`Tokenizer.enable_padding`, with the :entity:`pad_token` and its ID |
| (which we can double-check the id for the padding token with |
| :entity:`Tokenizer.token_to_id` like before): |
| |
| .. only:: python |
| |
| .. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py |
| :language: python |
| :start-after: START enable_padding |
| :end-before: END enable_padding |
| :dedent: 8 |
| |
| .. only:: rust |
| |
| .. literalinclude:: ../../tokenizers/tests/documentation.rs |
| :language: rust |
| :start-after: START quicktour_enable_padding |
| :end-before: END quicktour_enable_padding |
| :dedent: 4 |
| |
| .. only:: node |
| |
| .. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts |
| :language: javascript |
| :start-after: START enable_padding |
| :end-before: END enable_padding |
| :dedent: 4 |
| |
| We can set the :obj:`direction` of the padding (defaults to the right) or a given :obj:`length` if |
| we want to pad every sample to that specific number (here we leave it unset to pad to the size of |
| the longest text). |
| |
| .. only:: python |
| |
| .. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py |
| :language: python |
| :start-after: START print_batch_tokens |
| :end-before: END print_batch_tokens |
| :dedent: 8 |
| |
| .. only:: rust |
| |
| .. literalinclude:: ../../tokenizers/tests/documentation.rs |
| :language: rust |
| :start-after: START quicktour_print_batch_tokens |
| :end-before: END quicktour_print_batch_tokens |
| :dedent: 4 |
| |
| .. only:: node |
| |
| .. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts |
| :language: javascript |
| :start-after: START print_batch_tokens |
| :end-before: END print_batch_tokens |
| :dedent: 4 |
| |
| In this case, the `attention mask` generated by the tokenizer takes the padding into account: |
| |
| .. only:: python |
| |
| .. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py |
| :language: python |
| :start-after: START print_attention_mask |
| :end-before: END print_attention_mask |
| :dedent: 8 |
| |
| .. only:: rust |
| |
| .. literalinclude:: ../../tokenizers/tests/documentation.rs |
| :language: rust |
| :start-after: START quicktour_print_attention_mask |
| :end-before: END quicktour_print_attention_mask |
| :dedent: 4 |
| |
| .. only:: node |
| |
| .. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts |
| :language: javascript |
| :start-after: START print_attention_mask |
| :end-before: END print_attention_mask |
| :dedent: 4 |
| |
| .. _pretrained: |
| |
| .. only:: python |
| |
| Using a pretrained tokenizer |
| ------------------------------------------------------------------------------------------------ |
| |
| You can load any tokenizer from the Hugging Face Hub as long as a `tokenizer.json` file is |
| available in the repository. |
| |
| .. code-block:: python |
| |
| from tokenizers import Tokenizer |
| |
| tokenizer = Tokenizer.from_pretrained("bert-base-uncased") |
| |
| Importing a pretrained tokenizer from legacy vocabulary files |
| ------------------------------------------------------------------------------------------------ |
| |
| You can also import a pretrained tokenizer directly in, as long as you have its vocabulary file. |
| For instance, here is how to import the classic pretrained BERT tokenizer: |
| |
| .. code-block:: python |
| |
| from tokenizers import BertWordPieceTokenizer |
| |
| tokenizer = BertWordPieceTokenizer("bert-base-uncased-vocab.txt", lowercase=True) |
| |
| as long as you have downloaded the file `bert-base-uncased-vocab.txt` with |
| |
| .. code-block:: bash |
| |
| wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt |
| |