Buckets:
| # Quicktour | |
| Let's have a quick look at the ๐ค Tokenizers library features. The | |
| library provides an implementation of today's most used tokenizers that | |
| is both easy to use and blazing fast. | |
| ## Build a tokenizer from scratch | |
| To illustrate how fast the ๐ค Tokenizers library is, let's train a new | |
| tokenizer on [wikitext-103](https://www.salesforce.com/blog/the-wikitext-long-term-dependency-language-modeling-dataset/) | |
| (516M of text) in just a few seconds. First things first, you will need | |
| to download this dataset and unzip it with: | |
| ``` bash | |
| wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip | |
| unzip wikitext-103-raw-v1.zip | |
| ``` | |
| ### Training the tokenizer | |
| In this tour, we will build and train a Byte-Pair Encoding (BPE) | |
| tokenizer. For more information about the different type of tokenizers, | |
| check out this [guide](https://huggingface.co/transformers/tokenizer_summary.html) in | |
| the ๐ค Transformers documentation. Here, training the tokenizer means it | |
| will learn merge rules by: | |
| - Start with all the characters present in the training corpus as | |
| tokens. | |
| - Identify the most common pair of tokens and merge it into one token. | |
| - Repeat until the vocabulary (e.g., the number of tokens) has reached | |
| the size we want. | |
| The main API of the library is the `class` `Tokenizer`, here is how | |
| we instantiate one with a BPE model: | |
| ```python | |
| from tokenizers import Tokenizer | |
| from tokenizers.models import BPE | |
| tokenizer = Tokenizer(BPE(unk_token="[UNK]")) | |
| ``` | |
| ```rust | |
| use tokenizers::models::bpe::BPE; | |
| let mut tokenizer: TokenizerImpl = TokenizerImpl::new( | |
| BPE::builder() | |
| .unk_token("[UNK]".to_string()) | |
| .build() | |
| .unwrap(), | |
| ); | |
| ``` | |
| ```js | |
| { Tokenizer } = require('tokenizers') | |
| { BPE } = require('tokenizers') | |
| tokenizer = new Tokenizer(BPE.init({}, [], { unkToken: '[UNK]' })) | |
| ``` | |
| To train our tokenizer on the wikitext files, we will need to | |
| instantiate a [trainer]{.title-ref}, in this case a | |
| `BpeTrainer` | |
| ```python | |
| from tokenizers.trainers import BpeTrainer | |
| trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]) | |
| ``` | |
| ```rust | |
| use tokenizers::models::bpe::BpeTrainer; | |
| let mut trainer = BpeTrainer::builder() | |
| .special_tokens(vec![ | |
| AddedToken::from("[UNK]", true), | |
| AddedToken::from("[CLS]", true), | |
| AddedToken::from("[SEP]", true), | |
| AddedToken::from("[PAD]", true), | |
| AddedToken::from("[MASK]", true), | |
| ]) | |
| .build(); | |
| ``` | |
| ```js | |
| { bpeTrainer } = require('tokenizers') | |
| trainer = bpeTrainer({ | |
| ecialTokens: ['[UNK]', '[CLS]', '[SEP]', '[PAD]', '[MASK]'], | |
| ``` | |
| We can set the training arguments like `vocab_size` or `min_frequency` (here | |
| left at their default values of 30,000 and 0) but the most important | |
| part is to give the `special_tokens` we | |
| plan to use later on (they are not used at all during training) so that | |
| they get inserted in the vocabulary. | |
| The order in which you write the special tokens list matters: here `"[UNK]"` will get the ID 0, | |
| `"[CLS]"` will get the ID 1 and so forth. | |
| We could train our tokenizer right now, but it wouldn't be optimal. | |
| Without a pre-tokenizer that will split our inputs into words, we might | |
| get tokens that overlap several words: for instance we could get an | |
| `"it is"` token since those two words | |
| often appear next to each other. Using a pre-tokenizer will ensure no | |
| token is bigger than a word returned by the pre-tokenizer. Here we want | |
| to train a subword BPE tokenizer, and we will use the easiest | |
| pre-tokenizer possible by splitting on whitespace. | |
| ```python | |
| from tokenizers.pre_tokenizers import Whitespace | |
| tokenizer.pre_tokenizer = Whitespace() | |
| ``` | |
| ```rust | |
| use tokenizers::pre_tokenizers::whitespace::Whitespace; | |
| tokenizer.with_pre_tokenizer(Some(Whitespace {})); | |
| ``` | |
| ```js | |
| { whitespacePreTokenizer } = require('tokenizers') | |
| nizer.setPreTokenizer(whitespacePreTokenizer()) | |
| ``` | |
| Now, we can just call the `Tokenizer.train` method with any list of files we want to use: | |
| ```python | |
| files = [f"data/wikitext-103-raw/wiki.{split}.raw" for split in ["test", "train", "valid"]] | |
| tokenizer.train(files, trainer) | |
| ``` | |
| ```rust | |
| let files = vec![ | |
| "data/wikitext-103-raw/wiki.train.raw".into(), | |
| "data/wikitext-103-raw/wiki.test.raw".into(), | |
| "data/wikitext-103-raw/wiki.valid.raw".into(), | |
| ]; | |
| tokenizer.train_from_files(&mut trainer, files)?; | |
| ``` | |
| ```js | |
| files = ['test', 'train', 'valid'].map((split) => `data/wikitext-103-raw/wiki.${split}.raw`) | |
| nizer.train(files, trainer) | |
| ``` | |
| This should only take a few seconds to train our tokenizer on the full | |
| wikitext dataset! To save the tokenizer in one file that contains all | |
| its configuration and vocabulary, just use the | |
| `Tokenizer.save` method: | |
| ```python | |
| tokenizer.save("data/tokenizer-wiki.json") | |
| ``` | |
| ```rust | |
| tokenizer.save("data/tokenizer-wiki.json", false)?; | |
| ``` | |
| ```js | |
| nizer.save('data/tokenizer-wiki.json') | |
| ``` | |
| and you can reload your tokenizer from that file with the | |
| `Tokenizer.from_file` | |
| `classmethod`: | |
| ```python | |
| tokenizer = Tokenizer.from_file("data/tokenizer-wiki.json") | |
| ``` | |
| ```rust | |
| let mut tokenizer = Tokenizer::from_file("data/tokenizer-wiki.json")?; | |
| ``` | |
| ```js | |
| tokenizer = Tokenizer.fromFile('data/tokenizer-wiki.json') | |
| ``` | |
| ### Using the tokenizer | |
| Now that we have trained a tokenizer, we can use it on any text we want | |
| with the `Tokenizer.encode` method: | |
| ```python | |
| output = tokenizer.encode("Hello, y'all! How are you ๐ ?") | |
| ``` | |
| ```rust | |
| let output = tokenizer.encode("Hello, y'all! How are you ๐ ?", true)?; | |
| ``` | |
| ```js | |
| output = await tokenizer.encode("Hello, y'all! How are you ๐ ?") | |
| ``` | |
| This applied the full pipeline of the tokenizer on the text, returning | |
| an `Encoding` object. To learn more | |
| about this pipeline, and how to apply (or customize) parts of it, check out [this page](https://github.com/huggingface/tokenizers/blob/main/docs/source-doc-builder/pipeline.mdx). | |
| This `Encoding` object then has all the | |
| attributes you need for your deep learning model (or other). The | |
| `tokens` attribute contains the | |
| segmentation of your text in tokens: | |
| ```python | |
| print(output.tokens) | |
| # ["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"] | |
| ``` | |
| ```rust | |
| println!("{:?}", output.get_tokens()); | |
| // ["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?",] | |
| ``` | |
| ```js | |
| ole.log(output.getTokens()) | |
| "Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"] | |
| ``` | |
| Similarly, the `ids` attribute will | |
| contain the index of each of those tokens in the tokenizer's | |
| vocabulary: | |
| ```python | |
| print(output.ids) | |
| # [27253, 16, 93, 11, 5097, 5, 7961, 5112, 6218, 0, 35] | |
| ``` | |
| ```rust | |
| println!("{:?}", output.get_ids()); | |
| // [27253, 16, 93, 11, 5097, 5, 7961, 5112, 6218, 0, 35] | |
| ``` | |
| ```js | |
| ole.log(output.getIds()) | |
| 27253, 16, 93, 11, 5097, 5, 7961, 5112, 6218, 0, 35] | |
| ``` | |
| An important feature of the ๐ค Tokenizers library is that it comes with | |
| full alignment tracking, meaning you can always get the part of your | |
| original sentence that corresponds to a given token. Those are stored in | |
| the `offsets` attribute of our | |
| `Encoding` object. For instance, let's | |
| assume we would want to find back what caused the | |
| `"[UNK]"` token to appear, which is the | |
| token at index 9 in the list, we can just ask for the offset at the | |
| index: | |
| ```python | |
| print(output.offsets[9]) | |
| # (26, 27) | |
| ``` | |
| ```rust | |
| println!("{:?}", output.get_offsets()[9]); | |
| // (26, 30) | |
| ``` | |
| ```js | |
| offsets = output.getOffsets() | |
| ole.log(offsets[9]) | |
| 26, 27) | |
| ``` | |
| and those are the indices that correspond to the emoji in the original | |
| sentence: | |
| ```python | |
| sentence = "Hello, y'all! How are you ๐ ?" | |
| sentence[26:27] | |
| # "๐" | |
| ``` | |
| ```rust | |
| let sentence = "Hello, y'all! How are you ๐ ?"; | |
| println!("{}", &sentence[26..30]); | |
| // "๐" | |
| ``` | |
| ```js | |
| { slice } = require('tokenizers') | |
| sentence = "Hello, y'all! How are you ๐ ?" | |
| [start, end] = offsets[9] | |
| ole.log(slice(sentence, start, end)) | |
| ๐" | |
| ``` | |
| ### Post-processing | |
| We might want our tokenizer to automatically add special tokens, like | |
| `"[CLS]"` or `"[SEP]"`. To do this, we use a post-processor. | |
| `TemplateProcessing` is the most | |
| commonly used, you just have to specify a template for the processing of | |
| single sentences and pairs of sentences, along with the special tokens | |
| and their IDs. | |
| When we built our tokenizer, we set `"[CLS]"` and `"[SEP]"` in positions 1 | |
| and 2 of our list of special tokens, so this should be their IDs. To | |
| double-check, we can use the `Tokenizer.token_to_id` method: | |
| ```python | |
| tokenizer.token_to_id("[SEP]") | |
| # 2 | |
| ``` | |
| ```rust | |
| println!("{}", tokenizer.token_to_id("[SEP]").unwrap()); | |
| // 2 | |
| ``` | |
| ```js | |
| ole.log(tokenizer.tokenToId('[SEP]')) | |
| ``` | |
| Here is how we can set the post-processing to give us the traditional | |
| BERT inputs: | |
| ```python | |
| from tokenizers.processors import TemplateProcessing | |
| tokenizer.post_processor = TemplateProcessing( | |
| single="[CLS] $A [SEP]", | |
| pair="[CLS] $A [SEP] $B:1 [SEP]:1", | |
| special_tokens=[ | |
| ("[CLS]", tokenizer.token_to_id("[CLS]")), | |
| ("[SEP]", tokenizer.token_to_id("[SEP]")), | |
| ], | |
| ) | |
| ``` | |
| ```rust | |
| use tokenizers::processors::template::TemplateProcessing; | |
| let special_tokens = vec![ | |
| ("[CLS]", tokenizer.token_to_id("[CLS]").unwrap()), | |
| ("[SEP]", tokenizer.token_to_id("[SEP]").unwrap()), | |
| ]; | |
| tokenizer.with_post_processor(Some( | |
| TemplateProcessing::builder() | |
| .try_single("[CLS] $A [SEP]") | |
| .unwrap() | |
| .try_pair("[CLS] $A [SEP] $B:1 [SEP]:1") | |
| .unwrap() | |
| .special_tokens(special_tokens) | |
| .build()?, | |
| )); | |
| ``` | |
| ```js | |
| { templateProcessing } = require('tokenizers') | |
| nizer.setPostProcessor( | |
| mplateProcessing('[CLS] $A [SEP]', '[CLS] $A [SEP] $B:1 [SEP]:1', [ | |
| ['[CLS]', tokenizer.tokenToId('[CLS]')], | |
| ['[SEP]', tokenizer.tokenToId('[SEP]')], | |
| , | |
| ``` | |
| Let's go over this snippet of code in more details. First we specify | |
| the template for single sentences: those should have the form | |
| `"[CLS] $A [SEP]"` where | |
| `$A` represents our sentence. | |
| Then, we specify the template for sentence pairs, which should have the | |
| form `"[CLS] $A [SEP] $B [SEP]"` where | |
| `$A` represents the first sentence and | |
| `$B` the second one. The | |
| `:1` added in the template represent the `type IDs` we want for each part of our input: it defaults | |
| to 0 for everything (which is why we don't have | |
| `$A:0`) and here we set it to 1 for the | |
| tokens of the second sentence and the last `"[SEP]"` token. | |
| Lastly, we specify the special tokens we used and their IDs in our | |
| tokenizer's vocabulary. | |
| To check out this worked properly, let's try to encode the same | |
| sentence as before: | |
| ```python | |
| output = tokenizer.encode("Hello, y'all! How are you ๐ ?") | |
| print(output.tokens) | |
| # ["[CLS]", "Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?", "[SEP]"] | |
| ``` | |
| ```rust | |
| let output = tokenizer.encode("Hello, y'all! How are you ๐ ?", true)?; | |
| println!("{:?}", output.get_tokens()); | |
| // ["[CLS]", "Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?", "[SEP]"] | |
| ``` | |
| ```js | |
| output = await tokenizer.encode("Hello, y'all! How are you ๐ ?") | |
| ole.log(output.getTokens()) | |
| "[CLS]", "Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?", "[SEP]"] | |
| ``` | |
| To check the results on a pair of sentences, we just pass the two | |
| sentences to `Tokenizer.encode`: | |
| ```python | |
| output = tokenizer.encode("Hello, y'all!", "How are you ๐ ?") | |
| print(output.tokens) | |
| # ["[CLS]", "Hello", ",", "y", "'", "all", "!", "[SEP]", "How", "are", "you", "[UNK]", "?", "[SEP]"] | |
| ``` | |
| ```rust | |
| let output = tokenizer.encode(("Hello, y'all!", "How are you ๐ ?"), true)?; | |
| println!("{:?}", output.get_tokens()); | |
| // ["[CLS]", "Hello", ",", "y", "'", "all", "!", "[SEP]", "How", "are", "you", "[UNK]", "?", "[SEP]"] | |
| ``` | |
| ```js | |
| output = await tokenizer.encode("Hello, y'all!", 'How are you ๐ ?') | |
| ole.log(output.getTokens()) | |
| "[CLS]", "Hello", ",", "y", "'", "all", "!", "[SEP]", "How", "are", "you", "[UNK]", "?", "[SEP]"] | |
| ``` | |
| You can then check the type IDs attributed to each token is correct with | |
| ```python | |
| print(output.type_ids) | |
| # [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1] | |
| ``` | |
| ```rust | |
| println!("{:?}", output.get_type_ids()); | |
| // [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1] | |
| ``` | |
| ```js | |
| ole.log(output.getTypeIds()) | |
| 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1] | |
| ``` | |
| If you save your tokenizer with `Tokenizer.save`, the post-processor will be saved along. | |
| ### Encoding multiple sentences in a batch | |
| To get the full speed of the ๐ค Tokenizers library, it's best to | |
| process your texts by batches by using the | |
| `Tokenizer.encode_batch` method: | |
| ```python | |
| output = tokenizer.encode_batch(["Hello, y'all!", "How are you ๐ ?"]) | |
| ``` | |
| ```rust | |
| let output = tokenizer.encode_batch(vec!["Hello, y'all!", "How are you ๐ ?"], true)?; | |
| ``` | |
| ```js | |
| output = await tokenizer.encodeBatch(["Hello, y'all!", 'How are you ๐ ?']) | |
| ``` | |
| The output is then a list of `Encoding` | |
| objects like the ones we saw before. You can process together as many | |
| texts as you like, as long as it fits in memory. | |
| To process a batch of sentences pairs, pass two lists to the | |
| `Tokenizer.encode_batch` method: the | |
| list of sentences A and the list of sentences B: | |
| ```python | |
| output = tokenizer.encode_batch( | |
| [["Hello, y'all!", "How are you ๐ ?"], ["Hello to you too!", "I'm fine, thank you!"]] | |
| ) | |
| ``` | |
| ```rust | |
| let output = tokenizer.encode_batch( | |
| vec![ | |
| ("Hello, y'all!", "How are you ๐ ?"), | |
| ("Hello to you too!", "I'm fine, thank you!"), | |
| ], | |
| true, | |
| )?; | |
| ``` | |
| ```js | |
| ar output = await tokenizer.encodeBatch( | |
| [["Hello, y'all!", "How are you ๐ ?"], ["Hello to you too!", "I'm fine, thank you!"]] | |
| ; | |
| ``` | |
| When encoding multiple sentences, you can automatically pad the outputs | |
| to the longest sentence present by using | |
| `Tokenizer.enable_padding`, with the | |
| `pad_token` and its ID (which we can | |
| double-check the id for the padding token with | |
| `Tokenizer.token_to_id` like before): | |
| ```python | |
| tokenizer.enable_padding(pad_id=3, pad_token="[PAD]") | |
| ``` | |
| ```rust | |
| use tokenizers::PaddingParams; | |
| tokenizer.with_padding(Some(PaddingParams { | |
| pad_id: 3, | |
| pad_token: "[PAD]".to_string(), | |
| ..PaddingParams::default() | |
| })); | |
| ``` | |
| ```js | |
| nizer.setPadding({ padId: 3, padToken: '[PAD]' }) | |
| ``` | |
| We can set the `direction` of the padding | |
| (defaults to the right) or a given `length` if we want to pad every sample to that specific number (here | |
| we leave it unset to pad to the size of the longest text). | |
| ```python | |
| output = tokenizer.encode_batch(["Hello, y'all!", "How are you ๐ ?"]) | |
| print(output[1].tokens) | |
| # ["[CLS]", "How", "are", "you", "[UNK]", "?", "[SEP]", "[PAD]"] | |
| ``` | |
| ```rust | |
| let output = tokenizer.encode_batch(vec!["Hello, y'all!", "How are you ๐ ?"], true)?; | |
| println!("{:?}", output[1].get_tokens()); | |
| // ["[CLS]", "How", "are", "you", "[UNK]", "?", "[SEP]", "[PAD]"] | |
| ``` | |
| ```js | |
| output = await tokenizer.encodeBatch(["Hello, y'all!", 'How are you ๐ ?']) | |
| ole.log(output[1].getTokens()) | |
| "[CLS]", "How", "are", "you", "[UNK]", "?", "[SEP]", "[PAD]"] | |
| ``` | |
| In this case, the `attention mask` generated by the | |
| tokenizer takes the padding into account: | |
| ```python | |
| print(output[1].attention_mask) | |
| # [1, 1, 1, 1, 1, 1, 1, 0] | |
| ``` | |
| ```rust | |
| println!("{:?}", output[1].get_attention_mask()); | |
| // [1, 1, 1, 1, 1, 1, 1, 0] | |
| ``` | |
| ```js | |
| ole.log(output[1].getAttentionMask()) | |
| 1, 1, 1, 1, 1, 1, 1, 0] | |
| ``` | |
| ## Pretrained | |
| ### Using a pretrained tokenizer | |
| You can load any tokenizer from the Hugging Face Hub as long as a | |
| `tokenizer.json` file is available in the repository. | |
| ```python | |
| from tokenizers import Tokenizer | |
| tokenizer = Tokenizer.from_pretrained("bert-base-uncased") | |
| ``` | |
| ### Importing a pretrained tokenizer from legacy vocabulary files | |
| You can also import a pretrained tokenizer directly in, as long as you | |
| have its vocabulary file. For instance, here is how to import the | |
| classic pretrained BERT tokenizer: | |
| ```python | |
| from tokenizers import BertWordPieceTokenizer | |
| tokenizer = BertWordPieceTokenizer("bert-base-uncased-vocab.txt", lowercase=True) | |
| ``` | |
| as long as you have downloaded the file `bert-base-uncased-vocab.txt` with | |
| ```bash | |
| wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt | |
| ``` | |
Xet Storage Details
- Size:
- 16 kB
- Xet hash:
- c3f0266946ae27e129d5651ca2a2312301326be72ef952cc91a14877bf5a17a7
ยท
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.