| # Quicktour |
|
|
| Let's have a quick look at the 🤗 Tokenizers library features. The |
| library provides an implementation of today's most used tokenizers that |
| is both easy to use and blazing fast. |
|
|
| ## Build a tokenizer from scratch |
|
|
| To illustrate how fast the 🤗 Tokenizers library is, let's train a new |
| tokenizer on [wikitext-103](https: |
| (516M of text) in just a few seconds. First things first, you will need |
| to download this dataset and unzip it with: |
|
|
| ``` bash |
| wget https: |
| unzip wikitext-103-raw-v1.zip |
| ``` |
|
|
| ### Training the tokenizer |
|
|
| In this tour, we will build and train a Byte-Pair Encoding (BPE) |
| tokenizer. For more information about the different type of tokenizers, |
| check out this [guide](https: |
| the 🤗 Transformers documentation. Here, training the tokenizer means it |
| will learn merge rules by: |
|
|
| - Start with all the characters present in the training corpus as |
| tokens. |
| - Identify the most common pair of tokens and merge it into one token. |
| - Repeat until the vocabulary (e.g., the number of tokens) has reached |
| the size we want. |
|
|
| The main API of the library is the `class` `Tokenizer`, here is how |
| we instantiate one with a BPE model: |
|
|
| <tokenizerslangcontent> |
| <python> |
| <literalinclude> |
| {"path": "../../bindings/python/tests/documentation/test_quicktour.py", |
| "language": "python", |
| "start-after": "START init_tokenizer", |
| "end-before": "END init_tokenizer", |
| "dedent": 8} |
| </literalinclude> |
| </python> |
| <rust> |
| <literalinclude> |
| {"path": "../../tokenizers/tests/documentation.rs", |
| "language": "rust", |
| "start-after": "START quicktour_init_tokenizer", |
| "end-before": "END quicktour_init_tokenizer", |
| "dedent": 4} |
| </literalinclude> |
| </rust> |
| <node> |
| <literalinclude> |
| {"path": "../../bindings/node/examples/documentation/quicktour.test.ts", |
| "language": "js", |
| "start-after": "START init_tokenizer", |
| "end-before": "END init_tokenizer", |
| "dedent": 8} |
| </literalinclude> |
| </node> |
| </tokenizerslangcontent> |
|
|
| To train our tokenizer on the wikitext files, we will need to |
| instantiate a [trainer]{.title-ref}, in this case a |
| `BpeTrainer` |
|
|
| <tokenizerslangcontent> |
| <python> |
| <literalinclude> |
| {"path": "../../bindings/python/tests/documentation/test_quicktour.py", |
| "language": "python", |
| "start-after": "START init_trainer", |
| "end-before": "END init_trainer", |
| "dedent": 8} |
| </literalinclude> |
| </python> |
| <rust> |
| <literalinclude> |
| {"path": "../../tokenizers/tests/documentation.rs", |
| "language": "rust", |
| "start-after": "START quicktour_init_trainer", |
| "end-before": "END quicktour_init_trainer", |
| "dedent": 4} |
| </literalinclude> |
| </rust> |
| <node> |
| <literalinclude> |
| {"path": "../../bindings/node/examples/documentation/quicktour.test.ts", |
| "language": "js", |
| "start-after": "START init_trainer", |
| "end-before": "END init_trainer", |
| "dedent": 8} |
| </literalinclude> |
| </node> |
| </tokenizerslangcontent> |
|
|
| We can set the training arguments like `vocab_size` or `min_frequency` (here |
| left at their default values of 30,000 and 0) but the most important |
| part is to give the `special_tokens` we |
| plan to use later on (they are not used at all during training) so that |
| they get inserted in the vocabulary. |
|
|
| <Tip> |
|
|
| The order in which you write the special tokens list matters: here `"[UNK]"` will get the ID 0, |
| `"[CLS]"` will get the ID 1 and so forth. |
|
|
| </Tip> |
|
|
| We could train our tokenizer right now, but it wouldn't be optimal. |
| Without a pre-tokenizer that will split our inputs into words, we might |
| get tokens that overlap several words: for instance we could get an |
| `"it is"` token since those two words |
| often appear next to each other. Using a pre-tokenizer will ensure no |
| token is bigger than a word returned by the pre-tokenizer. Here we want |
| to train a subword BPE tokenizer, and we will use the easiest |
| pre-tokenizer possible by splitting on whitespace. |
|
|
| <tokenizerslangcontent> |
| <python> |
| <literalinclude> |
| {"path": "../../bindings/python/tests/documentation/test_quicktour.py", |
| "language": "python", |
| "start-after": "START init_pretok", |
| "end-before": "END init_pretok", |
| "dedent": 8} |
| </literalinclude> |
| </python> |
| <rust> |
| <literalinclude> |
| {"path": "../../tokenizers/tests/documentation.rs", |
| "language": "rust", |
| "start-after": "START quicktour_init_pretok", |
| "end-before": "END quicktour_init_pretok", |
| "dedent": 4} |
| </literalinclude> |
| </rust> |
| <node> |
| <literalinclude> |
| {"path": "../../bindings/node/examples/documentation/quicktour.test.ts", |
| "language": "js", |
| "start-after": "START init_pretok", |
| "end-before": "END init_pretok", |
| "dedent": 8} |
| </literalinclude> |
| </node> |
| </tokenizerslangcontent> |
|
|
| Now, we can just call the `Tokenizer.train` method with any list of files we want to use: |
|
|
| <tokenizerslangcontent> |
| <python> |
| <literalinclude> |
| {"path": "../../bindings/python/tests/documentation/test_quicktour.py", |
| "language": "python", |
| "start-after": "START train", |
| "end-before": "END train", |
| "dedent": 8} |
| </literalinclude> |
| </python> |
| <rust> |
| <literalinclude> |
| {"path": "../../tokenizers/tests/documentation.rs", |
| "language": "rust", |
| "start-after": "START quicktour_train", |
| "end-before": "END quicktour_train", |
| "dedent": 4} |
| </literalinclude> |
| </rust> |
| <node> |
| <literalinclude> |
| {"path": "../../bindings/node/examples/documentation/quicktour.test.ts", |
| "language": "js", |
| "start-after": "START train", |
| "end-before": "END train", |
| "dedent": 8} |
| </literalinclude> |
| </node> |
| </tokenizerslangcontent> |
|
|
| This should only take a few seconds to train our tokenizer on the full |
| wikitext dataset! To save the tokenizer in one file that contains all |
| its configuration and vocabulary, just use the |
| `Tokenizer.save` method: |
|
|
| <tokenizerslangcontent> |
| <python> |
| <literalinclude> |
| {"path": "../../bindings/python/tests/documentation/test_quicktour.py", |
| "language": "python", |
| "start-after": "START save", |
| "end-before": "END save", |
| "dedent": 8} |
| </literalinclude> |
| </python> |
| <rust> |
| <literalinclude> |
| {"path": "../../tokenizers/tests/documentation.rs", |
| "language": "rust", |
| "start-after": "START quicktour_save", |
| "end-before": "END quicktour_save", |
| "dedent": 4} |
| </literalinclude> |
| </rust> |
| <node> |
| <literalinclude> |
| {"path": "../../bindings/node/examples/documentation/quicktour.test.ts", |
| "language": "js", |
| "start-after": "START save", |
| "end-before": "END save", |
| "dedent": 8} |
| </literalinclude> |
| </node> |
| </tokenizerslangcontent> |
|
|
| and you can reload your tokenizer from that file with the |
| `Tokenizer.from_file` |
| `classmethod`: |
|
|
| <tokenizerslangcontent> |
| <python> |
| <literalinclude> |
| {"path": "../../bindings/python/tests/documentation/test_quicktour.py", |
| "language": "python", |
| "start-after": "START reload_tokenizer", |
| "end-before": "END reload_tokenizer", |
| "dedent": 12} |
| </literalinclude> |
| </python> |
| <rust> |
| <literalinclude> |
| {"path": "../../tokenizers/tests/documentation.rs", |
| "language": "rust", |
| "start-after": "START quicktour_reload_tokenizer", |
| "end-before": "END quicktour_reload_tokenizer", |
| "dedent": 4} |
| </literalinclude> |
| </rust> |
| <node> |
| <literalinclude> |
| {"path": "../../bindings/node/examples/documentation/quicktour.test.ts", |
| "language": "js", |
| "start-after": "START reload_tokenizer", |
| "end-before": "END reload_tokenizer", |
| "dedent": 8} |
| </literalinclude> |
| </node> |
| </tokenizerslangcontent> |
|
|
| ### Using the tokenizer |
|
|
| Now that we have trained a tokenizer, we can use it on any text we want |
| with the `Tokenizer.encode` method: |
|
|
| <tokenizerslangcontent> |
| <python> |
| <literalinclude> |
| {"path": "../../bindings/python/tests/documentation/test_quicktour.py", |
| "language": "python", |
| "start-after": "START encode", |
| "end-before": "END encode", |
| "dedent": 8} |
| </literalinclude> |
| </python> |
| <rust> |
| <literalinclude> |
| {"path": "../../tokenizers/tests/documentation.rs", |
| "language": "rust", |
| "start-after": "START quicktour_encode", |
| "end-before": "END quicktour_encode", |
| "dedent": 4} |
| </literalinclude> |
| </rust> |
| <node> |
| <literalinclude> |
| {"path": "../../bindings/node/examples/documentation/quicktour.test.ts", |
| "language": "js", |
| "start-after": "START encode", |
| "end-before": "END encode", |
| "dedent": 8} |
| </literalinclude> |
| </node> |
| </tokenizerslangcontent> |
|
|
| This applied the full pipeline of the tokenizer on the text, returning |
| an `Encoding` object. To learn more |
| about this pipeline, and how to apply (or customize) parts of it, check out [this page](pipeline). |
|
|
| This `Encoding` object then has all the |
| attributes you need for your deep learning model (or other). The |
| `tokens` attribute contains the |
| segmentation of your text in tokens: |
|
|
| <tokenizerslangcontent> |
| <python> |
| <literalinclude> |
| {"path": "../../bindings/python/tests/documentation/test_quicktour.py", |
| "language": "python", |
| "start-after": "START print_tokens", |
| "end-before": "END print_tokens", |
| "dedent": 8} |
| </literalinclude> |
| </python> |
| <rust> |
| <literalinclude> |
| {"path": "../../tokenizers/tests/documentation.rs", |
| "language": "rust", |
| "start-after": "START quicktour_print_tokens", |
| "end-before": "END quicktour_print_tokens", |
| "dedent": 4} |
| </literalinclude> |
| </rust> |
| <node> |
| <literalinclude> |
| {"path": "../../bindings/node/examples/documentation/quicktour.test.ts", |
| "language": "js", |
| "start-after": "START print_tokens", |
| "end-before": "END print_tokens", |
| "dedent": 8} |
| </literalinclude> |
| </node> |
| </tokenizerslangcontent> |
|
|
| Similarly, the `ids` attribute will |
| contain the index of each of those tokens in the tokenizer's |
| vocabulary: |
|
|
| <tokenizerslangcontent> |
| <python> |
| <literalinclude> |
| {"path": "../../bindings/python/tests/documentation/test_quicktour.py", |
| "language": "python", |
| "start-after": "START print_ids", |
| "end-before": "END print_ids", |
| "dedent": 8} |
| </literalinclude> |
| </python> |
| <rust> |
| <literalinclude> |
| {"path": "../../tokenizers/tests/documentation.rs", |
| "language": "rust", |
| "start-after": "START quicktour_print_ids", |
| "end-before": "END quicktour_print_ids", |
| "dedent": 4} |
| </literalinclude> |
| </rust> |
| <node> |
| <literalinclude> |
| {"path": "../../bindings/node/examples/documentation/quicktour.test.ts", |
| "language": "js", |
| "start-after": "START print_ids", |
| "end-before": "END print_ids", |
| "dedent": 8} |
| </literalinclude> |
| </node> |
| </tokenizerslangcontent> |
|
|
| An important feature of the 🤗 Tokenizers library is that it comes with |
| full alignment tracking, meaning you can always get the part of your |
| original sentence that corresponds to a given token. Those are stored in |
| the `offsets` attribute of our |
| `Encoding` object. For instance, let's |
| assume we would want to find back what caused the |
| `"[UNK]"` token to appear, which is the |
| token at index 9 in the list, we can just ask for the offset at the |
| index: |
|
|
| <tokenizerslangcontent> |
| <python> |
| <literalinclude> |
| {"path": "../../bindings/python/tests/documentation/test_quicktour.py", |
| "language": "python", |
| "start-after": "START print_offsets", |
| "end-before": "END print_offsets", |
| "dedent": 8} |
| </literalinclude> |
| </python> |
| <rust> |
| <literalinclude> |
| {"path": "../../tokenizers/tests/documentation.rs", |
| "language": "rust", |
| "start-after": "START quicktour_print_offsets", |
| "end-before": "END quicktour_print_offsets", |
| "dedent": 4} |
| </literalinclude> |
| </rust> |
| <node> |
| <literalinclude> |
| {"path": "../../bindings/node/examples/documentation/quicktour.test.ts", |
| "language": "js", |
| "start-after": "START print_offsets", |
| "end-before": "END print_offsets", |
| "dedent": 8} |
| </literalinclude> |
| </node> |
| </tokenizerslangcontent> |
|
|
| and those are the indices that correspond to the emoji in the original |
| sentence: |
|
|
| <tokenizerslangcontent> |
| <python> |
| <literalinclude> |
| {"path": "../../bindings/python/tests/documentation/test_quicktour.py", |
| "language": "python", |
| "start-after": "START use_offsets", |
| "end-before": "END use_offsets", |
| "dedent": 8} |
| </literalinclude> |
| </python> |
| <rust> |
| <literalinclude> |
| {"path": "../../tokenizers/tests/documentation.rs", |
| "language": "rust", |
| "start-after": "START quicktour_use_offsets", |
| "end-before": "END quicktour_use_offsets", |
| "dedent": 4} |
| </literalinclude> |
| </rust> |
| <node> |
| <literalinclude> |
| {"path": "../../bindings/node/examples/documentation/quicktour.test.ts", |
| "language": "js", |
| "start-after": "START use_offsets", |
| "end-before": "END use_offsets", |
| "dedent": 8} |
| </literalinclude> |
| </node> |
| </tokenizerslangcontent> |
|
|
| ### Post-processing |
|
|
| We might want our tokenizer to automatically add special tokens, like |
| `"[CLS]"` or `"[SEP]"`. To do this, we use a post-processor. |
| `TemplateProcessing` is the most |
| commonly used, you just have to specify a template for the processing of |
| single sentences and pairs of sentences, along with the special tokens |
| and their IDs. |
|
|
| When we built our tokenizer, we set `"[CLS]"` and `"[SEP]"` in positions 1 |
| and 2 of our list of special tokens, so this should be their IDs. To |
| double-check, we can use the `Tokenizer.token_to_id` method: |
|
|
| <tokenizerslangcontent> |
| <python> |
| <literalinclude> |
| {"path": "../../bindings/python/tests/documentation/test_quicktour.py", |
| "language": "python", |
| "start-after": "START check_sep", |
| "end-before": "END check_sep", |
| "dedent": 8} |
| </literalinclude> |
| </python> |
| <rust> |
| <literalinclude> |
| {"path": "../../tokenizers/tests/documentation.rs", |
| "language": "rust", |
| "start-after": "START quicktour_check_sep", |
| "end-before": "END quicktour_check_sep", |
| "dedent": 4} |
| </literalinclude> |
| </rust> |
| <node> |
| <literalinclude> |
| {"path": "../../bindings/node/examples/documentation/quicktour.test.ts", |
| "language": "js", |
| "start-after": "START check_sep", |
| "end-before": "END check_sep", |
| "dedent": 8} |
| </literalinclude> |
| </node> |
| </tokenizerslangcontent> |
|
|
| Here is how we can set the post-processing to give us the traditional |
| BERT inputs: |
|
|
| <tokenizerslangcontent> |
| <python> |
| <literalinclude> |
| {"path": "../../bindings/python/tests/documentation/test_quicktour.py", |
| "language": "python", |
| "start-after": "START init_template_processing", |
| "end-before": "END init_template_processing", |
| "dedent": 8} |
| </literalinclude> |
| </python> |
| <rust> |
| <literalinclude> |
| {"path": "../../tokenizers/tests/documentation.rs", |
| "language": "rust", |
| "start-after": "START quicktour_init_template_processing", |
| "end-before": "END quicktour_init_template_processing", |
| "dedent": 4} |
| </literalinclude> |
| </rust> |
| <node> |
| <literalinclude> |
| {"path": "../../bindings/node/examples/documentation/quicktour.test.ts", |
| "language": "js", |
| "start-after": "START init_template_processing", |
| "end-before": "END init_template_processing", |
| "dedent": 8} |
| </literalinclude> |
| </node> |
| </tokenizerslangcontent> |
|
|
| Let's go over this snippet of code in more details. First we specify |
| the template for single sentences: those should have the form |
| `"[CLS] $A [SEP]"` where |
| `$A` represents our sentence. |
|
|
| Then, we specify the template for sentence pairs, which should have the |
| form `"[CLS] $A [SEP] $B [SEP]"` where |
| `$A` represents the first sentence and |
| `$B` the second one. The |
| `:1` added in the template represent the `type IDs` we want for each part of our input: it defaults |
| to 0 for everything (which is why we don't have |
| `$A:0`) and here we set it to 1 for the |
| tokens of the second sentence and the last `"[SEP]"` token. |
|
|
| Lastly, we specify the special tokens we used and their IDs in our |
| tokenizer's vocabulary. |
|
|
| To check out this worked properly, let's try to encode the same |
| sentence as before: |
|
|
| <tokenizerslangcontent> |
| <python> |
| <literalinclude> |
| {"path": "../../bindings/python/tests/documentation/test_quicktour.py", |
| "language": "python", |
| "start-after": "START print_special_tokens", |
| "end-before": "END print_special_tokens", |
| "dedent": 8} |
| </literalinclude> |
| </python> |
| <rust> |
| <literalinclude> |
| {"path": "../../tokenizers/tests/documentation.rs", |
| "language": "rust", |
| "start-after": "START quicktour_print_special_tokens", |
| "end-before": "END quicktour_print_special_tokens", |
| "dedent": 4} |
| </literalinclude> |
| </rust> |
| <node> |
| <literalinclude> |
| {"path": "../../bindings/node/examples/documentation/quicktour.test.ts", |
| "language": "js", |
| "start-after": "START print_special_tokens", |
| "end-before": "END print_special_tokens", |
| "dedent": 8} |
| </literalinclude> |
| </node> |
| </tokenizerslangcontent> |
|
|
| To check the results on a pair of sentences, we just pass the two |
| sentences to `Tokenizer.encode`: |
|
|
| <tokenizerslangcontent> |
| <python> |
| <literalinclude> |
| {"path": "../../bindings/python/tests/documentation/test_quicktour.py", |
| "language": "python", |
| "start-after": "START print_special_tokens_pair", |
| "end-before": "END print_special_tokens_pair", |
| "dedent": 8} |
| </literalinclude> |
| </python> |
| <rust> |
| <literalinclude> |
| {"path": "../../tokenizers/tests/documentation.rs", |
| "language": "rust", |
| "start-after": "START quicktour_print_special_tokens_pair", |
| "end-before": "END quicktour_print_special_tokens_pair", |
| "dedent": 4} |
| </literalinclude> |
| </rust> |
| <node> |
| <literalinclude> |
| {"path": "../../bindings/node/examples/documentation/quicktour.test.ts", |
| "language": "js", |
| "start-after": "START print_special_tokens_pair", |
| "end-before": "END print_special_tokens_pair", |
| "dedent": 8} |
| </literalinclude> |
| </node> |
| </tokenizerslangcontent> |
|
|
| You can then check the type IDs attributed to each token is correct with |
|
|
| <tokenizerslangcontent> |
| <python> |
| <literalinclude> |
| {"path": "../../bindings/python/tests/documentation/test_quicktour.py", |
| "language": "python", |
| "start-after": "START print_type_ids", |
| "end-before": "END print_type_ids", |
| "dedent": 8} |
| </literalinclude> |
| </python> |
| <rust> |
| <literalinclude> |
| {"path": "../../tokenizers/tests/documentation.rs", |
| "language": "rust", |
| "start-after": "START quicktour_print_type_ids", |
| "end-before": "END quicktour_print_type_ids", |
| "dedent": 4} |
| </literalinclude> |
| </rust> |
| <node> |
| <literalinclude> |
| {"path": "../../bindings/node/examples/documentation/quicktour.test.ts", |
| "language": "js", |
| "start-after": "START print_type_ids", |
| "end-before": "END print_type_ids", |
| "dedent": 8} |
| </literalinclude> |
| </node> |
| </tokenizerslangcontent> |
|
|
| If you save your tokenizer with `Tokenizer.save`, the post-processor will be saved along. |
|
|
| ### Encoding multiple sentences in a batch |
|
|
| To get the full speed of the 🤗 Tokenizers library, it's best to |
| process your texts by batches by using the |
| `Tokenizer.encode_batch` method: |
|
|
| <tokenizerslangcontent> |
| <python> |
| <literalinclude> |
| {"path": "../../bindings/python/tests/documentation/test_quicktour.py", |
| "language": "python", |
| "start-after": "START encode_batch", |
| "end-before": "END encode_batch", |
| "dedent": 8} |
| </literalinclude> |
| </python> |
| <rust> |
| <literalinclude> |
| {"path": "../../tokenizers/tests/documentation.rs", |
| "language": "rust", |
| "start-after": "START quicktour_encode_batch", |
| "end-before": "END quicktour_encode_batch", |
| "dedent": 4} |
| </literalinclude> |
| </rust> |
| <node> |
| <literalinclude> |
| {"path": "../../bindings/node/examples/documentation/quicktour.test.ts", |
| "language": "js", |
| "start-after": "START encode_batch", |
| "end-before": "END encode_batch", |
| "dedent": 8} |
| </literalinclude> |
| </node> |
| </tokenizerslangcontent> |
|
|
| The output is then a list of `Encoding` |
| objects like the ones we saw before. You can process together as many |
| texts as you like, as long as it fits in memory. |
|
|
| To process a batch of sentences pairs, pass two lists to the |
| `Tokenizer.encode_batch` method: the |
| list of sentences A and the list of sentences B: |
|
|
| <tokenizerslangcontent> |
| <python> |
| <literalinclude> |
| {"path": "../../bindings/python/tests/documentation/test_quicktour.py", |
| "language": "python", |
| "start-after": "START encode_batch_pair", |
| "end-before": "END encode_batch_pair", |
| "dedent": 8} |
| </literalinclude> |
| </python> |
| <rust> |
| <literalinclude> |
| {"path": "../../tokenizers/tests/documentation.rs", |
| "language": "rust", |
| "start-after": "START quicktour_encode_batch_pair", |
| "end-before": "END quicktour_encode_batch_pair", |
| "dedent": 4} |
| </literalinclude> |
| </rust> |
| <node> |
| <literalinclude> |
| {"path": "../../bindings/node/examples/documentation/quicktour.test.ts", |
| "language": "js", |
| "start-after": "START encode_batch_pair", |
| "end-before": "END encode_batch_pair", |
| "dedent": 8} |
| </literalinclude> |
| </node> |
| </tokenizerslangcontent> |
|
|
| When encoding multiple sentences, you can automatically pad the outputs |
| to the longest sentence present by using |
| `Tokenizer.enable_padding`, with the |
| `pad_token` and its ID (which we can |
| double-check the id for the padding token with |
| `Tokenizer.token_to_id` like before): |
|
|
| <tokenizerslangcontent> |
| <python> |
| <literalinclude> |
| {"path": "../../bindings/python/tests/documentation/test_quicktour.py", |
| "language": "python", |
| "start-after": "START enable_padding", |
| "end-before": "END enable_padding", |
| "dedent": 8} |
| </literalinclude> |
| </python> |
| <rust> |
| <literalinclude> |
| {"path": "../../tokenizers/tests/documentation.rs", |
| "language": "rust", |
| "start-after": "START quicktour_enable_padding", |
| "end-before": "END quicktour_enable_padding", |
| "dedent": 4} |
| </literalinclude> |
| </rust> |
| <node> |
| <literalinclude> |
| {"path": "../../bindings/node/examples/documentation/quicktour.test.ts", |
| "language": "js", |
| "start-after": "START enable_padding", |
| "end-before": "END enable_padding", |
| "dedent": 8} |
| </literalinclude> |
| </node> |
| </tokenizerslangcontent> |
|
|
| We can set the `direction` of the padding |
| (defaults to the right) or a given `length` if we want to pad every sample to that specific number (here |
| we leave it unset to pad to the size of the longest text). |
|
|
| <tokenizerslangcontent> |
| <python> |
| <literalinclude> |
| {"path": "../../bindings/python/tests/documentation/test_quicktour.py", |
| "language": "python", |
| "start-after": "START print_batch_tokens", |
| "end-before": "END print_batch_tokens", |
| "dedent": 8} |
| </literalinclude> |
| </python> |
| <rust> |
| <literalinclude> |
| {"path": "../../tokenizers/tests/documentation.rs", |
| "language": "rust", |
| "start-after": "START quicktour_print_batch_tokens", |
| "end-before": "END quicktour_print_batch_tokens", |
| "dedent": 4} |
| </literalinclude> |
| </rust> |
| <node> |
| <literalinclude> |
| {"path": "../../bindings/node/examples/documentation/quicktour.test.ts", |
| "language": "js", |
| "start-after": "START print_batch_tokens", |
| "end-before": "END print_batch_tokens", |
| "dedent": 8} |
| </literalinclude> |
| </node> |
| </tokenizerslangcontent> |
|
|
| In this case, the `attention mask` generated by the |
| tokenizer takes the padding into account: |
|
|
| <tokenizerslangcontent> |
| <python> |
| <literalinclude> |
| {"path": "../../bindings/python/tests/documentation/test_quicktour.py", |
| "language": "python", |
| "start-after": "START print_attention_mask", |
| "end-before": "END print_attention_mask", |
| "dedent": 8} |
| </literalinclude> |
| </python> |
| <rust> |
| <literalinclude> |
| {"path": "../../tokenizers/tests/documentation.rs", |
| "language": "rust", |
| "start-after": "START quicktour_print_attention_mask", |
| "end-before": "END quicktour_print_attention_mask", |
| "dedent": 4} |
| </literalinclude> |
| </rust> |
| <node> |
| <literalinclude> |
| {"path": "../../bindings/node/examples/documentation/quicktour.test.ts", |
| "language": "js", |
| "start-after": "START print_attention_mask", |
| "end-before": "END print_attention_mask", |
| "dedent": 8} |
| </literalinclude> |
| </node> |
| </tokenizerslangcontent> |
|
|
| ## Pretrained |
|
|
| <tokenizerslangcontent> |
| <python> |
| ### Using a pretrained tokenizer |
|
|
| You can load any tokenizer from the Hugging Face Hub as long as a |
| `tokenizer.json` file is available in the repository. |
|
|
| ```python |
| from tokenizers import Tokenizer |
|
|
| tokenizer = Tokenizer.from_pretrained("bert-base-uncased") |
| ``` |
|
|
| ### Importing a pretrained tokenizer from legacy vocabulary files |
|
|
| You can also import a pretrained tokenizer directly in, as long as you |
| have its vocabulary file. For instance, here is how to import the |
| classic pretrained BERT tokenizer: |
|
|
| ```python |
| from tokenizers import BertWordPieceTokenizer |
|
|
| tokenizer = BertWordPieceTokenizer("bert-base-uncased-vocab.txt", lowercase=True) |
| ``` |
|
|
| as long as you have downloaded the file `bert-base-uncased-vocab.txt` with |
|
|
| ```bash |
| wget https: |
| ``` |
| </python> |
| </tokenizerslangcontent> |