File size: 23,270 Bytes
72c0672 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 | # Quicktour
Let's have a quick look at the 🤗 Tokenizers library features. The
library provides an implementation of today's most used tokenizers that
is both easy to use and blazing fast.
## Build a tokenizer from scratch
To illustrate how fast the 🤗 Tokenizers library is, let's train a new
tokenizer on [wikitext-103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/)
(516M of text) in just a few seconds. First things first, you will need
to download this dataset and unzip it with:
``` bash
wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
unzip wikitext-103-raw-v1.zip
```
### Training the tokenizer
In this tour, we will build and train a Byte-Pair Encoding (BPE)
tokenizer. For more information about the different type of tokenizers,
check out this [guide](https://huggingface.co/transformers/tokenizer_summary.html) in
the 🤗 Transformers documentation. Here, training the tokenizer means it
will learn merge rules by:
- Start with all the characters present in the training corpus as
tokens.
- Identify the most common pair of tokens and merge it into one token.
- Repeat until the vocabulary (e.g., the number of tokens) has reached
the size we want.
The main API of the library is the `class` `Tokenizer`, here is how
we instantiate one with a BPE model:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START init_tokenizer",
"end-before": "END init_tokenizer",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_init_tokenizer",
"end-before": "END quicktour_init_tokenizer",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START init_tokenizer",
"end-before": "END init_tokenizer",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
To train our tokenizer on the wikitext files, we will need to
instantiate a [trainer]{.title-ref}, in this case a
`BpeTrainer`
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START init_trainer",
"end-before": "END init_trainer",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_init_trainer",
"end-before": "END quicktour_init_trainer",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START init_trainer",
"end-before": "END init_trainer",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
We can set the training arguments like `vocab_size` or `min_frequency` (here
left at their default values of 30,000 and 0) but the most important
part is to give the `special_tokens` we
plan to use later on (they are not used at all during training) so that
they get inserted in the vocabulary.
<Tip>
The order in which you write the special tokens list matters: here `"[UNK]"` will get the ID 0,
`"[CLS]"` will get the ID 1 and so forth.
</Tip>
We could train our tokenizer right now, but it wouldn't be optimal.
Without a pre-tokenizer that will split our inputs into words, we might
get tokens that overlap several words: for instance we could get an
`"it is"` token since those two words
often appear next to each other. Using a pre-tokenizer will ensure no
token is bigger than a word returned by the pre-tokenizer. Here we want
to train a subword BPE tokenizer, and we will use the easiest
pre-tokenizer possible by splitting on whitespace.
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START init_pretok",
"end-before": "END init_pretok",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_init_pretok",
"end-before": "END quicktour_init_pretok",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START init_pretok",
"end-before": "END init_pretok",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
Now, we can just call the `Tokenizer.train` method with any list of files we want to use:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START train",
"end-before": "END train",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_train",
"end-before": "END quicktour_train",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START train",
"end-before": "END train",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
This should only take a few seconds to train our tokenizer on the full
wikitext dataset! To save the tokenizer in one file that contains all
its configuration and vocabulary, just use the
`Tokenizer.save` method:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START save",
"end-before": "END save",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_save",
"end-before": "END quicktour_save",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START save",
"end-before": "END save",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
and you can reload your tokenizer from that file with the
`Tokenizer.from_file`
`classmethod`:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START reload_tokenizer",
"end-before": "END reload_tokenizer",
"dedent": 12}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_reload_tokenizer",
"end-before": "END quicktour_reload_tokenizer",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START reload_tokenizer",
"end-before": "END reload_tokenizer",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
### Using the tokenizer
Now that we have trained a tokenizer, we can use it on any text we want
with the `Tokenizer.encode` method:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START encode",
"end-before": "END encode",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_encode",
"end-before": "END quicktour_encode",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START encode",
"end-before": "END encode",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
This applied the full pipeline of the tokenizer on the text, returning
an `Encoding` object. To learn more
about this pipeline, and how to apply (or customize) parts of it, check out [this page](pipeline).
This `Encoding` object then has all the
attributes you need for your deep learning model (or other). The
`tokens` attribute contains the
segmentation of your text in tokens:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START print_tokens",
"end-before": "END print_tokens",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_print_tokens",
"end-before": "END quicktour_print_tokens",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START print_tokens",
"end-before": "END print_tokens",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
Similarly, the `ids` attribute will
contain the index of each of those tokens in the tokenizer's
vocabulary:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START print_ids",
"end-before": "END print_ids",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_print_ids",
"end-before": "END quicktour_print_ids",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START print_ids",
"end-before": "END print_ids",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
An important feature of the 🤗 Tokenizers library is that it comes with
full alignment tracking, meaning you can always get the part of your
original sentence that corresponds to a given token. Those are stored in
the `offsets` attribute of our
`Encoding` object. For instance, let's
assume we would want to find back what caused the
`"[UNK]"` token to appear, which is the
token at index 9 in the list, we can just ask for the offset at the
index:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START print_offsets",
"end-before": "END print_offsets",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_print_offsets",
"end-before": "END quicktour_print_offsets",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START print_offsets",
"end-before": "END print_offsets",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
and those are the indices that correspond to the emoji in the original
sentence:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START use_offsets",
"end-before": "END use_offsets",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_use_offsets",
"end-before": "END quicktour_use_offsets",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START use_offsets",
"end-before": "END use_offsets",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
### Post-processing
We might want our tokenizer to automatically add special tokens, like
`"[CLS]"` or `"[SEP]"`. To do this, we use a post-processor.
`TemplateProcessing` is the most
commonly used, you just have to specify a template for the processing of
single sentences and pairs of sentences, along with the special tokens
and their IDs.
When we built our tokenizer, we set `"[CLS]"` and `"[SEP]"` in positions 1
and 2 of our list of special tokens, so this should be their IDs. To
double-check, we can use the `Tokenizer.token_to_id` method:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START check_sep",
"end-before": "END check_sep",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_check_sep",
"end-before": "END quicktour_check_sep",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START check_sep",
"end-before": "END check_sep",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
Here is how we can set the post-processing to give us the traditional
BERT inputs:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START init_template_processing",
"end-before": "END init_template_processing",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_init_template_processing",
"end-before": "END quicktour_init_template_processing",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START init_template_processing",
"end-before": "END init_template_processing",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
Let's go over this snippet of code in more details. First we specify
the template for single sentences: those should have the form
`"[CLS] $A [SEP]"` where
`$A` represents our sentence.
Then, we specify the template for sentence pairs, which should have the
form `"[CLS] $A [SEP] $B [SEP]"` where
`$A` represents the first sentence and
`$B` the second one. The
`:1` added in the template represent the `type IDs` we want for each part of our input: it defaults
to 0 for everything (which is why we don't have
`$A:0`) and here we set it to 1 for the
tokens of the second sentence and the last `"[SEP]"` token.
Lastly, we specify the special tokens we used and their IDs in our
tokenizer's vocabulary.
To check out this worked properly, let's try to encode the same
sentence as before:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START print_special_tokens",
"end-before": "END print_special_tokens",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_print_special_tokens",
"end-before": "END quicktour_print_special_tokens",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START print_special_tokens",
"end-before": "END print_special_tokens",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
To check the results on a pair of sentences, we just pass the two
sentences to `Tokenizer.encode`:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START print_special_tokens_pair",
"end-before": "END print_special_tokens_pair",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_print_special_tokens_pair",
"end-before": "END quicktour_print_special_tokens_pair",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START print_special_tokens_pair",
"end-before": "END print_special_tokens_pair",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
You can then check the type IDs attributed to each token is correct with
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START print_type_ids",
"end-before": "END print_type_ids",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_print_type_ids",
"end-before": "END quicktour_print_type_ids",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START print_type_ids",
"end-before": "END print_type_ids",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
If you save your tokenizer with `Tokenizer.save`, the post-processor will be saved along.
### Encoding multiple sentences in a batch
To get the full speed of the 🤗 Tokenizers library, it's best to
process your texts by batches by using the
`Tokenizer.encode_batch` method:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START encode_batch",
"end-before": "END encode_batch",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_encode_batch",
"end-before": "END quicktour_encode_batch",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START encode_batch",
"end-before": "END encode_batch",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
The output is then a list of `Encoding`
objects like the ones we saw before. You can process together as many
texts as you like, as long as it fits in memory.
To process a batch of sentences pairs, pass two lists to the
`Tokenizer.encode_batch` method: the
list of sentences A and the list of sentences B:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START encode_batch_pair",
"end-before": "END encode_batch_pair",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_encode_batch_pair",
"end-before": "END quicktour_encode_batch_pair",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START encode_batch_pair",
"end-before": "END encode_batch_pair",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
When encoding multiple sentences, you can automatically pad the outputs
to the longest sentence present by using
`Tokenizer.enable_padding`, with the
`pad_token` and its ID (which we can
double-check the id for the padding token with
`Tokenizer.token_to_id` like before):
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START enable_padding",
"end-before": "END enable_padding",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_enable_padding",
"end-before": "END quicktour_enable_padding",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START enable_padding",
"end-before": "END enable_padding",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
We can set the `direction` of the padding
(defaults to the right) or a given `length` if we want to pad every sample to that specific number (here
we leave it unset to pad to the size of the longest text).
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START print_batch_tokens",
"end-before": "END print_batch_tokens",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_print_batch_tokens",
"end-before": "END quicktour_print_batch_tokens",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START print_batch_tokens",
"end-before": "END print_batch_tokens",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
In this case, the `attention mask` generated by the
tokenizer takes the padding into account:
<tokenizerslangcontent>
<python>
<literalinclude>
{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
"language": "python",
"start-after": "START print_attention_mask",
"end-before": "END print_attention_mask",
"dedent": 8}
</literalinclude>
</python>
<rust>
<literalinclude>
{"path": "../../tokenizers/tests/documentation.rs",
"language": "rust",
"start-after": "START quicktour_print_attention_mask",
"end-before": "END quicktour_print_attention_mask",
"dedent": 4}
</literalinclude>
</rust>
<node>
<literalinclude>
{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
"language": "js",
"start-after": "START print_attention_mask",
"end-before": "END print_attention_mask",
"dedent": 8}
</literalinclude>
</node>
</tokenizerslangcontent>
## Pretrained
<tokenizerslangcontent>
<python>
### Using a pretrained tokenizer
You can load any tokenizer from the Hugging Face Hub as long as a
`tokenizer.json` file is available in the repository.
```python
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
```
### Importing a pretrained tokenizer from legacy vocabulary files
You can also import a pretrained tokenizer directly in, as long as you
have its vocabulary file. For instance, here is how to import the
classic pretrained BERT tokenizer:
```python
from tokenizers import BertWordPieceTokenizer
tokenizer = BertWordPieceTokenizer("bert-base-uncased-vocab.txt", lowercase=True)
```
as long as you have downloaded the file `bert-base-uncased-vocab.txt` with
```bash
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt
```
</python>
</tokenizerslangcontent> |