Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / tokenizers /pr_2012 /en /quicktour.md

rtrm

about 1 month ago

preview code

download

raw

16 kB

	# Quicktour

	Let's have a quick look at the 🤗 Tokenizers library features. The
	library provides an implementation of today's most used tokenizers that
	is both easy to use and blazing fast.

	## Build a tokenizer from scratch

	To illustrate how fast the 🤗 Tokenizers library is, let's train a new
	tokenizer on [wikitext-103](https://www.salesforce.com/blog/the-wikitext-long-term-dependency-language-modeling-dataset/)
	(516M of text) in just a few seconds. First things first, you will need
	to download this dataset and unzip it with:

	``` bash
	wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
	unzip wikitext-103-raw-v1.zip
	```

	### Training the tokenizer

	In this tour, we will build and train a Byte-Pair Encoding (BPE)
	tokenizer. For more information about the different type of tokenizers,
	check out this [guide](https://huggingface.co/transformers/tokenizer_summary.html) in
	the 🤗 Transformers documentation. Here, training the tokenizer means it
	will learn merge rules by:

	- Start with all the characters present in the training corpus as
	tokens.
	- Identify the most common pair of tokens and merge it into one token.
	- Repeat until the vocabulary (e.g., the number of tokens) has reached
	the size we want.

	The main API of the library is the `class` `Tokenizer`, here is how
	we instantiate one with a BPE model:

	```python
	from tokenizers import Tokenizer
	from tokenizers.models import BPE
	tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
	```

	```rust
	use tokenizers::models::bpe::BPE;
	let mut tokenizer: TokenizerImpl = TokenizerImpl::new(
	BPE::builder()
	.unk_token("[UNK]".to_string())
	.build()
	.unwrap(),
	);
	```

	```js
	{ Tokenizer } = require('tokenizers')
	{ BPE } = require('tokenizers')
	tokenizer = new Tokenizer(BPE.init({}, [], { unkToken: '[UNK]' }))
	```

	To train our tokenizer on the wikitext files, we will need to
	instantiate a [trainer]{.title-ref}, in this case a
	`BpeTrainer`

	```python
	from tokenizers.trainers import BpeTrainer
	trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
	```

	```rust
	use tokenizers::models::bpe::BpeTrainer;
	let mut trainer = BpeTrainer::builder()
	.special_tokens(vec![
	AddedToken::from("[UNK]", true),
	AddedToken::from("[CLS]", true),
	AddedToken::from("[SEP]", true),
	AddedToken::from("[PAD]", true),
	AddedToken::from("[MASK]", true),
	])
	.build();
	```

	```js
	{ bpeTrainer } = require('tokenizers')
	trainer = bpeTrainer({
	ecialTokens: ['[UNK]', '[CLS]', '[SEP]', '[PAD]', '[MASK]'],
	```

	We can set the training arguments like `vocab_size` or `min_frequency` (here
	left at their default values of 30,000 and 0) but the most important
	part is to give the `special_tokens` we
	plan to use later on (they are not used at all during training) so that
	they get inserted in the vocabulary.

	The order in which you write the special tokens list matters: here `"[UNK]"` will get the ID 0,
	`"[CLS]"` will get the ID 1 and so forth.

	We could train our tokenizer right now, but it wouldn't be optimal.
	Without a pre-tokenizer that will split our inputs into words, we might
	get tokens that overlap several words: for instance we could get an
	`"it is"` token since those two words
	often appear next to each other. Using a pre-tokenizer will ensure no
	token is bigger than a word returned by the pre-tokenizer. Here we want
	to train a subword BPE tokenizer, and we will use the easiest
	pre-tokenizer possible by splitting on whitespace.

	```python
	from tokenizers.pre_tokenizers import Whitespace
	tokenizer.pre_tokenizer = Whitespace()
	```

	```rust
	use tokenizers::pre_tokenizers::whitespace::Whitespace;
	tokenizer.with_pre_tokenizer(Some(Whitespace {}));
	```

	```js
	{ whitespacePreTokenizer } = require('tokenizers')
	nizer.setPreTokenizer(whitespacePreTokenizer())
	```

	Now, we can just call the `Tokenizer.train` method with any list of files we want to use:

	```python
	files = [f"data/wikitext-103-raw/wiki.{split}.raw" for split in ["test", "train", "valid"]]
	tokenizer.train(files, trainer)
	```

	```rust
	let files = vec![
	"data/wikitext-103-raw/wiki.train.raw".into(),
	"data/wikitext-103-raw/wiki.test.raw".into(),
	"data/wikitext-103-raw/wiki.valid.raw".into(),
	];
	tokenizer.train_from_files(&mut trainer, files)?;
	```

	```js
	files = ['test', 'train', 'valid'].map((split) => `data/wikitext-103-raw/wiki.${split}.raw`)
	nizer.train(files, trainer)
	```

	This should only take a few seconds to train our tokenizer on the full
	wikitext dataset! To save the tokenizer in one file that contains all
	its configuration and vocabulary, just use the
	`Tokenizer.save` method:

	```python
	tokenizer.save("data/tokenizer-wiki.json")
	```

	```rust
	tokenizer.save("data/tokenizer-wiki.json", false)?;
	```

	```js
	nizer.save('data/tokenizer-wiki.json')
	```

	and you can reload your tokenizer from that file with the
	`Tokenizer.from_file`
	`classmethod`:

	```python
	tokenizer = Tokenizer.from_file("data/tokenizer-wiki.json")
	```

	```rust
	let mut tokenizer = Tokenizer::from_file("data/tokenizer-wiki.json")?;
	```

	```js
	tokenizer = Tokenizer.fromFile('data/tokenizer-wiki.json')
	```

	### Using the tokenizer

	Now that we have trained a tokenizer, we can use it on any text we want
	with the `Tokenizer.encode` method:

	```python
	output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
	```

	```rust
	let output = tokenizer.encode("Hello, y'all! How are you 😁 ?", true)?;
	```

	```js
	output = await tokenizer.encode("Hello, y'all! How are you 😁 ?")
	```

	This applied the full pipeline of the tokenizer on the text, returning
	an `Encoding` object. To learn more
	about this pipeline, and how to apply (or customize) parts of it, check out [this page](https://github.com/huggingface/tokenizers/blob/main/docs/source-doc-builder/pipeline.mdx).

	This `Encoding` object then has all the
	attributes you need for your deep learning model (or other). The
	`tokens` attribute contains the
	segmentation of your text in tokens:

	```python
	print(output.tokens)
	# ["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"]
	```

	```rust
	println!("{:?}", output.get_tokens());
	// ["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?",]
	```

	```js
	ole.log(output.getTokens())
	"Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"]
	```

	Similarly, the `ids` attribute will
	contain the index of each of those tokens in the tokenizer's
	vocabulary:

	```python
	print(output.ids)
	# [27253, 16, 93, 11, 5097, 5, 7961, 5112, 6218, 0, 35]
	```

	```rust
	println!("{:?}", output.get_ids());
	// [27253, 16, 93, 11, 5097, 5, 7961, 5112, 6218, 0, 35]
	```

	```js
	ole.log(output.getIds())
	27253, 16, 93, 11, 5097, 5, 7961, 5112, 6218, 0, 35]
	```

	An important feature of the 🤗 Tokenizers library is that it comes with
	full alignment tracking, meaning you can always get the part of your
	original sentence that corresponds to a given token. Those are stored in
	the `offsets` attribute of our
	`Encoding` object. For instance, let's
	assume we would want to find back what caused the
	`"[UNK]"` token to appear, which is the
	token at index 9 in the list, we can just ask for the offset at the
	index:

	```python
	print(output.offsets[9])
	# (26, 27)
	```

	```rust
	println!("{:?}", output.get_offsets()[9]);
	// (26, 30)
	```

	```js
	offsets = output.getOffsets()
	ole.log(offsets[9])
	26, 27)
	```

	and those are the indices that correspond to the emoji in the original
	sentence:

	```python
	sentence = "Hello, y'all! How are you 😁 ?"
	sentence[26:27]
	# "😁"
	```

	```rust
	let sentence = "Hello, y'all! How are you 😁 ?";
	println!("{}", &sentence[26..30]);
	// "😁"
	```

	```js
	{ slice } = require('tokenizers')
	sentence = "Hello, y'all! How are you 😁 ?"
	[start, end] = offsets[9]
	ole.log(slice(sentence, start, end))
	😁"
	```

	### Post-processing

	We might want our tokenizer to automatically add special tokens, like
	`"[CLS]"` or `"[SEP]"`. To do this, we use a post-processor.
	`TemplateProcessing` is the most
	commonly used, you just have to specify a template for the processing of
	single sentences and pairs of sentences, along with the special tokens
	and their IDs.

	When we built our tokenizer, we set `"[CLS]"` and `"[SEP]"` in positions 1
	and 2 of our list of special tokens, so this should be their IDs. To
	double-check, we can use the `Tokenizer.token_to_id` method:

	```python
	tokenizer.token_to_id("[SEP]")
	# 2
	```

	```rust
	println!("{}", tokenizer.token_to_id("[SEP]").unwrap());
	// 2
	```

	```js
	ole.log(tokenizer.tokenToId('[SEP]'))
	```

	Here is how we can set the post-processing to give us the traditional
	BERT inputs:

	```python
	from tokenizers.processors import TemplateProcessing
	tokenizer.post_processor = TemplateProcessing(
	single="[CLS] $A [SEP]",
	pair="[CLS] $A [SEP] $B:1 [SEP]:1",
	special_tokens=[
	("[CLS]", tokenizer.token_to_id("[CLS]")),
	("[SEP]", tokenizer.token_to_id("[SEP]")),
	],
	)
	```

	```rust
	use tokenizers::processors::template::TemplateProcessing;
	let special_tokens = vec![
	("[CLS]", tokenizer.token_to_id("[CLS]").unwrap()),
	("[SEP]", tokenizer.token_to_id("[SEP]").unwrap()),
	];
	tokenizer.with_post_processor(Some(
	TemplateProcessing::builder()
	.try_single("[CLS] $A [SEP]")
	.unwrap()
	.try_pair("[CLS] $A [SEP] $B:1 [SEP]:1")
	.unwrap()
	.special_tokens(special_tokens)
	.build()?,
	));
	```

	```js
	{ templateProcessing } = require('tokenizers')
	nizer.setPostProcessor(
	mplateProcessing('[CLS] $A [SEP]', '[CLS] $A [SEP] $B:1 [SEP]:1', [
	['[CLS]', tokenizer.tokenToId('[CLS]')],
	['[SEP]', tokenizer.tokenToId('[SEP]')],
	,
	```

	Let's go over this snippet of code in more details. First we specify
	the template for single sentences: those should have the form
	`"[CLS] $A [SEP]"` where
	`$A` represents our sentence.

	Then, we specify the template for sentence pairs, which should have the
	form `"[CLS] $A [SEP] $B [SEP]"` where
	`$A` represents the first sentence and
	`$B` the second one. The
	`:1` added in the template represent the `type IDs` we want for each part of our input: it defaults
	to 0 for everything (which is why we don't have
	`$A:0`) and here we set it to 1 for the
	tokens of the second sentence and the last `"[SEP]"` token.

	Lastly, we specify the special tokens we used and their IDs in our
	tokenizer's vocabulary.

	To check out this worked properly, let's try to encode the same
	sentence as before:

	```python
	output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
	print(output.tokens)
	# ["[CLS]", "Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?", "[SEP]"]
	```

	```rust
	let output = tokenizer.encode("Hello, y'all! How are you 😁 ?", true)?;
	println!("{:?}", output.get_tokens());
	// ["[CLS]", "Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?", "[SEP]"]
	```

	```js
	output = await tokenizer.encode("Hello, y'all! How are you 😁 ?")
	ole.log(output.getTokens())
	"[CLS]", "Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?", "[SEP]"]
	```

	To check the results on a pair of sentences, we just pass the two
	sentences to `Tokenizer.encode`:

	```python
	output = tokenizer.encode("Hello, y'all!", "How are you 😁 ?")
	print(output.tokens)
	# ["[CLS]", "Hello", ",", "y", "'", "all", "!", "[SEP]", "How", "are", "you", "[UNK]", "?", "[SEP]"]
	```

	```rust
	let output = tokenizer.encode(("Hello, y'all!", "How are you 😁 ?"), true)?;
	println!("{:?}", output.get_tokens());
	// ["[CLS]", "Hello", ",", "y", "'", "all", "!", "[SEP]", "How", "are", "you", "[UNK]", "?", "[SEP]"]
	```

	```js
	output = await tokenizer.encode("Hello, y'all!", 'How are you 😁 ?')
	ole.log(output.getTokens())
	"[CLS]", "Hello", ",", "y", "'", "all", "!", "[SEP]", "How", "are", "you", "[UNK]", "?", "[SEP]"]
	```

	You can then check the type IDs attributed to each token is correct with

	```python
	print(output.type_ids)
	# [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
	```

	```rust
	println!("{:?}", output.get_type_ids());
	// [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
	```

	```js
	ole.log(output.getTypeIds())
	0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
	```

	If you save your tokenizer with `Tokenizer.save`, the post-processor will be saved along.

	### Encoding multiple sentences in a batch

	To get the full speed of the 🤗 Tokenizers library, it's best to
	process your texts by batches by using the
	`Tokenizer.encode_batch` method:

	```python
	output = tokenizer.encode_batch(["Hello, y'all!", "How are you 😁 ?"])
	```

	```rust
	let output = tokenizer.encode_batch(vec!["Hello, y'all!", "How are you 😁 ?"], true)?;
	```

	```js
	output = await tokenizer.encodeBatch(["Hello, y'all!", 'How are you 😁 ?'])
	```

	The output is then a list of `Encoding`
	objects like the ones we saw before. You can process together as many
	texts as you like, as long as it fits in memory.

	To process a batch of sentences pairs, pass two lists to the
	`Tokenizer.encode_batch` method: the
	list of sentences A and the list of sentences B:

	```python
	output = tokenizer.encode_batch(
	[["Hello, y'all!", "How are you 😁 ?"], ["Hello to you too!", "I'm fine, thank you!"]]
	)
	```

	```rust
	let output = tokenizer.encode_batch(
	vec![
	("Hello, y'all!", "How are you 😁 ?"),
	("Hello to you too!", "I'm fine, thank you!"),
	],
	true,
	)?;
	```

	```js
	ar output = await tokenizer.encodeBatch(
	[["Hello, y'all!", "How are you 😁 ?"], ["Hello to you too!", "I'm fine, thank you!"]]
	;
	```

	When encoding multiple sentences, you can automatically pad the outputs
	to the longest sentence present by using
	`Tokenizer.enable_padding`, with the
	`pad_token` and its ID (which we can
	double-check the id for the padding token with
	`Tokenizer.token_to_id` like before):

	```python
	tokenizer.enable_padding(pad_id=3, pad_token="[PAD]")
	```

	```rust
	use tokenizers::PaddingParams;
	tokenizer.with_padding(Some(PaddingParams {
	pad_id: 3,
	pad_token: "[PAD]".to_string(),
	..PaddingParams::default()
	}));
	```

	```js
	nizer.setPadding({ padId: 3, padToken: '[PAD]' })
	```

	We can set the `direction` of the padding
	(defaults to the right) or a given `length` if we want to pad every sample to that specific number (here
	we leave it unset to pad to the size of the longest text).

	```python
	output = tokenizer.encode_batch(["Hello, y'all!", "How are you 😁 ?"])
	print(output[1].tokens)
	# ["[CLS]", "How", "are", "you", "[UNK]", "?", "[SEP]", "[PAD]"]
	```

	```rust
	let output = tokenizer.encode_batch(vec!["Hello, y'all!", "How are you 😁 ?"], true)?;
	println!("{:?}", output[1].get_tokens());
	// ["[CLS]", "How", "are", "you", "[UNK]", "?", "[SEP]", "[PAD]"]
	```

	```js
	output = await tokenizer.encodeBatch(["Hello, y'all!", 'How are you 😁 ?'])
	ole.log(output[1].getTokens())
	"[CLS]", "How", "are", "you", "[UNK]", "?", "[SEP]", "[PAD]"]
	```

	In this case, the `attention mask` generated by the
	tokenizer takes the padding into account:

	```python
	print(output[1].attention_mask)
	# [1, 1, 1, 1, 1, 1, 1, 0]
	```

	```rust
	println!("{:?}", output[1].get_attention_mask());
	// [1, 1, 1, 1, 1, 1, 1, 0]
	```

	```js
	ole.log(output[1].getAttentionMask())
	1, 1, 1, 1, 1, 1, 1, 0]
	```

	## Pretrained

	### Using a pretrained tokenizer

	You can load any tokenizer from the Hugging Face Hub as long as a
	`tokenizer.json` file is available in the repository.

	```python
	from tokenizers import Tokenizer

	tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
	```

	### Importing a pretrained tokenizer from legacy vocabulary files

	You can also import a pretrained tokenizer directly in, as long as you
	have its vocabulary file. For instance, here is how to import the
	classic pretrained BERT tokenizer:

	```python
	from tokenizers import BertWordPieceTokenizer

	tokenizer = BertWordPieceTokenizer("bert-base-uncased-vocab.txt", lowercase=True)
	```

	as long as you have downloaded the file `bert-base-uncased-vocab.txt` with

	```bash
	wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt
	```

Xet Storage Details

Size:: 16 kB
Xet hash:: c3f0266946ae27e129d5651ca2a2312301326be72ef952cc91a14877bf5a17a7

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.