offline_compression_graph_code

72c0672 verified 4 months ago

23.3 kB

	# Quicktour

	Let's have a quick look at the 🤗 Tokenizers library features. The
	library provides an implementation of today's most used tokenizers that
	is both easy to use and blazing fast.

	## Build a tokenizer from scratch

	To illustrate how fast the 🤗 Tokenizers library is, let's train a new
	tokenizer on [wikitext-103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/)
	(516M of text) in just a few seconds. First things first, you will need
	to download this dataset and unzip it with:

	``` bash
	wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
	unzip wikitext-103-raw-v1.zip
	```

	### Training the tokenizer

	In this tour, we will build and train a Byte-Pair Encoding (BPE)
	tokenizer. For more information about the different type of tokenizers,
	check out this [guide](https://huggingface.co/transformers/tokenizer_summary.html) in
	the 🤗 Transformers documentation. Here, training the tokenizer means it
	will learn merge rules by:

	- Start with all the characters present in the training corpus as
	tokens.
	- Identify the most common pair of tokens and merge it into one token.
	- Repeat until the vocabulary (e.g., the number of tokens) has reached
	the size we want.

	The main API of the library is the `class` `Tokenizer`, here is how
	we instantiate one with a BPE model:

	<tokenizerslangcontent>
	<python>
	<literalinclude>
	{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
	"language": "python",
	"start-after": "START init_tokenizer",
	"end-before": "END init_tokenizer",
	"dedent": 8}
	</literalinclude>
	</python>
	<rust>
	<literalinclude>
	{"path": "../../tokenizers/tests/documentation.rs",
	"language": "rust",
	"start-after": "START quicktour_init_tokenizer",
	"end-before": "END quicktour_init_tokenizer",
	"dedent": 4}
	</literalinclude>
	</rust>
	<node>
	<literalinclude>
	{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
	"language": "js",
	"start-after": "START init_tokenizer",
	"end-before": "END init_tokenizer",
	"dedent": 8}
	</literalinclude>
	</node>
	</tokenizerslangcontent>

	To train our tokenizer on the wikitext files, we will need to
	instantiate a [trainer]{.title-ref}, in this case a
	`BpeTrainer`

	<tokenizerslangcontent>
	<python>
	<literalinclude>
	{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
	"language": "python",
	"start-after": "START init_trainer",
	"end-before": "END init_trainer",
	"dedent": 8}
	</literalinclude>
	</python>
	<rust>
	<literalinclude>
	{"path": "../../tokenizers/tests/documentation.rs",
	"language": "rust",
	"start-after": "START quicktour_init_trainer",
	"end-before": "END quicktour_init_trainer",
	"dedent": 4}
	</literalinclude>
	</rust>
	<node>
	<literalinclude>
	{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
	"language": "js",
	"start-after": "START init_trainer",
	"end-before": "END init_trainer",
	"dedent": 8}
	</literalinclude>
	</node>
	</tokenizerslangcontent>

	We can set the training arguments like `vocab_size` or `min_frequency` (here
	left at their default values of 30,000 and 0) but the most important
	part is to give the `special_tokens` we
	plan to use later on (they are not used at all during training) so that
	they get inserted in the vocabulary.

	<Tip>

	The order in which you write the special tokens list matters: here `"[UNK]"` will get the ID 0,
	`"[CLS]"` will get the ID 1 and so forth.

	</Tip>

	We could train our tokenizer right now, but it wouldn't be optimal.
	Without a pre-tokenizer that will split our inputs into words, we might
	get tokens that overlap several words: for instance we could get an
	`"it is"` token since those two words
	often appear next to each other. Using a pre-tokenizer will ensure no
	token is bigger than a word returned by the pre-tokenizer. Here we want
	to train a subword BPE tokenizer, and we will use the easiest
	pre-tokenizer possible by splitting on whitespace.

	<tokenizerslangcontent>
	<python>
	<literalinclude>
	{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
	"language": "python",
	"start-after": "START init_pretok",
	"end-before": "END init_pretok",
	"dedent": 8}
	</literalinclude>
	</python>
	<rust>
	<literalinclude>
	{"path": "../../tokenizers/tests/documentation.rs",
	"language": "rust",
	"start-after": "START quicktour_init_pretok",
	"end-before": "END quicktour_init_pretok",
	"dedent": 4}
	</literalinclude>
	</rust>
	<node>
	<literalinclude>
	{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
	"language": "js",
	"start-after": "START init_pretok",
	"end-before": "END init_pretok",
	"dedent": 8}
	</literalinclude>
	</node>
	</tokenizerslangcontent>

	Now, we can just call the `Tokenizer.train` method with any list of files we want to use:

	<tokenizerslangcontent>
	<python>
	<literalinclude>
	{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
	"language": "python",
	"start-after": "START train",
	"end-before": "END train",
	"dedent": 8}
	</literalinclude>
	</python>
	<rust>
	<literalinclude>
	{"path": "../../tokenizers/tests/documentation.rs",
	"language": "rust",
	"start-after": "START quicktour_train",
	"end-before": "END quicktour_train",
	"dedent": 4}
	</literalinclude>
	</rust>
	<node>
	<literalinclude>
	{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
	"language": "js",
	"start-after": "START train",
	"end-before": "END train",
	"dedent": 8}
	</literalinclude>
	</node>
	</tokenizerslangcontent>

	This should only take a few seconds to train our tokenizer on the full
	wikitext dataset! To save the tokenizer in one file that contains all
	its configuration and vocabulary, just use the
	`Tokenizer.save` method:

	<tokenizerslangcontent>
	<python>
	<literalinclude>
	{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
	"language": "python",
	"start-after": "START save",
	"end-before": "END save",
	"dedent": 8}
	</literalinclude>
	</python>
	<rust>
	<literalinclude>
	{"path": "../../tokenizers/tests/documentation.rs",
	"language": "rust",
	"start-after": "START quicktour_save",
	"end-before": "END quicktour_save",
	"dedent": 4}
	</literalinclude>
	</rust>
	<node>
	<literalinclude>
	{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
	"language": "js",
	"start-after": "START save",
	"end-before": "END save",
	"dedent": 8}
	</literalinclude>
	</node>
	</tokenizerslangcontent>

	and you can reload your tokenizer from that file with the
	`Tokenizer.from_file`
	`classmethod`:

	<tokenizerslangcontent>
	<python>
	<literalinclude>
	{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
	"language": "python",
	"start-after": "START reload_tokenizer",
	"end-before": "END reload_tokenizer",
	"dedent": 12}
	</literalinclude>
	</python>
	<rust>
	<literalinclude>
	{"path": "../../tokenizers/tests/documentation.rs",
	"language": "rust",
	"start-after": "START quicktour_reload_tokenizer",
	"end-before": "END quicktour_reload_tokenizer",
	"dedent": 4}
	</literalinclude>
	</rust>
	<node>
	<literalinclude>
	{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
	"language": "js",
	"start-after": "START reload_tokenizer",
	"end-before": "END reload_tokenizer",
	"dedent": 8}
	</literalinclude>
	</node>
	</tokenizerslangcontent>

	### Using the tokenizer

	Now that we have trained a tokenizer, we can use it on any text we want
	with the `Tokenizer.encode` method:

	<tokenizerslangcontent>
	<python>
	<literalinclude>
	{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
	"language": "python",
	"start-after": "START encode",
	"end-before": "END encode",
	"dedent": 8}
	</literalinclude>
	</python>
	<rust>
	<literalinclude>
	{"path": "../../tokenizers/tests/documentation.rs",
	"language": "rust",
	"start-after": "START quicktour_encode",
	"end-before": "END quicktour_encode",
	"dedent": 4}
	</literalinclude>
	</rust>
	<node>
	<literalinclude>
	{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
	"language": "js",
	"start-after": "START encode",
	"end-before": "END encode",
	"dedent": 8}
	</literalinclude>
	</node>
	</tokenizerslangcontent>

	This applied the full pipeline of the tokenizer on the text, returning
	an `Encoding` object. To learn more
	about this pipeline, and how to apply (or customize) parts of it, check out [this page](pipeline).

	This `Encoding` object then has all the
	attributes you need for your deep learning model (or other). The
	`tokens` attribute contains the
	segmentation of your text in tokens:

	<tokenizerslangcontent>
	<python>
	<literalinclude>
	{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
	"language": "python",
	"start-after": "START print_tokens",
	"end-before": "END print_tokens",
	"dedent": 8}
	</literalinclude>
	</python>
	<rust>
	<literalinclude>
	{"path": "../../tokenizers/tests/documentation.rs",
	"language": "rust",
	"start-after": "START quicktour_print_tokens",
	"end-before": "END quicktour_print_tokens",
	"dedent": 4}
	</literalinclude>
	</rust>
	<node>
	<literalinclude>
	{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
	"language": "js",
	"start-after": "START print_tokens",
	"end-before": "END print_tokens",
	"dedent": 8}
	</literalinclude>
	</node>
	</tokenizerslangcontent>

	Similarly, the `ids` attribute will
	contain the index of each of those tokens in the tokenizer's
	vocabulary:

	<tokenizerslangcontent>
	<python>
	<literalinclude>
	{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
	"language": "python",
	"start-after": "START print_ids",
	"end-before": "END print_ids",
	"dedent": 8}
	</literalinclude>
	</python>
	<rust>
	<literalinclude>
	{"path": "../../tokenizers/tests/documentation.rs",
	"language": "rust",
	"start-after": "START quicktour_print_ids",
	"end-before": "END quicktour_print_ids",
	"dedent": 4}
	</literalinclude>
	</rust>
	<node>
	<literalinclude>
	{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
	"language": "js",
	"start-after": "START print_ids",
	"end-before": "END print_ids",
	"dedent": 8}
	</literalinclude>
	</node>
	</tokenizerslangcontent>

	An important feature of the 🤗 Tokenizers library is that it comes with
	full alignment tracking, meaning you can always get the part of your
	original sentence that corresponds to a given token. Those are stored in
	the `offsets` attribute of our
	`Encoding` object. For instance, let's
	assume we would want to find back what caused the
	`"[UNK]"` token to appear, which is the
	token at index 9 in the list, we can just ask for the offset at the
	index:

	<tokenizerslangcontent>
	<python>
	<literalinclude>
	{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
	"language": "python",
	"start-after": "START print_offsets",
	"end-before": "END print_offsets",
	"dedent": 8}
	</literalinclude>
	</python>
	<rust>
	<literalinclude>
	{"path": "../../tokenizers/tests/documentation.rs",
	"language": "rust",
	"start-after": "START quicktour_print_offsets",
	"end-before": "END quicktour_print_offsets",
	"dedent": 4}
	</literalinclude>
	</rust>
	<node>
	<literalinclude>
	{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
	"language": "js",
	"start-after": "START print_offsets",
	"end-before": "END print_offsets",
	"dedent": 8}
	</literalinclude>
	</node>
	</tokenizerslangcontent>

	and those are the indices that correspond to the emoji in the original
	sentence:

	<tokenizerslangcontent>
	<python>
	<literalinclude>
	{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
	"language": "python",
	"start-after": "START use_offsets",
	"end-before": "END use_offsets",
	"dedent": 8}
	</literalinclude>
	</python>
	<rust>
	<literalinclude>
	{"path": "../../tokenizers/tests/documentation.rs",
	"language": "rust",
	"start-after": "START quicktour_use_offsets",
	"end-before": "END quicktour_use_offsets",
	"dedent": 4}
	</literalinclude>
	</rust>
	<node>
	<literalinclude>
	{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
	"language": "js",
	"start-after": "START use_offsets",
	"end-before": "END use_offsets",
	"dedent": 8}
	</literalinclude>
	</node>
	</tokenizerslangcontent>

	### Post-processing

	We might want our tokenizer to automatically add special tokens, like
	`"[CLS]"` or `"[SEP]"`. To do this, we use a post-processor.
	`TemplateProcessing` is the most
	commonly used, you just have to specify a template for the processing of
	single sentences and pairs of sentences, along with the special tokens
	and their IDs.

	When we built our tokenizer, we set `"[CLS]"` and `"[SEP]"` in positions 1
	and 2 of our list of special tokens, so this should be their IDs. To
	double-check, we can use the `Tokenizer.token_to_id` method:

	<tokenizerslangcontent>
	<python>
	<literalinclude>
	{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
	"language": "python",
	"start-after": "START check_sep",
	"end-before": "END check_sep",
	"dedent": 8}
	</literalinclude>
	</python>
	<rust>
	<literalinclude>
	{"path": "../../tokenizers/tests/documentation.rs",
	"language": "rust",
	"start-after": "START quicktour_check_sep",
	"end-before": "END quicktour_check_sep",
	"dedent": 4}
	</literalinclude>
	</rust>
	<node>
	<literalinclude>
	{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
	"language": "js",
	"start-after": "START check_sep",
	"end-before": "END check_sep",
	"dedent": 8}
	</literalinclude>
	</node>
	</tokenizerslangcontent>

	Here is how we can set the post-processing to give us the traditional
	BERT inputs:

	<tokenizerslangcontent>
	<python>
	<literalinclude>
	{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
	"language": "python",
	"start-after": "START init_template_processing",
	"end-before": "END init_template_processing",
	"dedent": 8}
	</literalinclude>
	</python>
	<rust>
	<literalinclude>
	{"path": "../../tokenizers/tests/documentation.rs",
	"language": "rust",
	"start-after": "START quicktour_init_template_processing",
	"end-before": "END quicktour_init_template_processing",
	"dedent": 4}
	</literalinclude>
	</rust>
	<node>
	<literalinclude>
	{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
	"language": "js",
	"start-after": "START init_template_processing",
	"end-before": "END init_template_processing",
	"dedent": 8}
	</literalinclude>
	</node>
	</tokenizerslangcontent>

	Let's go over this snippet of code in more details. First we specify
	the template for single sentences: those should have the form
	`"[CLS] $A [SEP]"` where
	`$A` represents our sentence.

	Then, we specify the template for sentence pairs, which should have the
	form `"[CLS] $A [SEP] $B [SEP]"` where
	`$A` represents the first sentence and
	`$B` the second one. The
	`:1` added in the template represent the `type IDs` we want for each part of our input: it defaults
	to 0 for everything (which is why we don't have
	`$A:0`) and here we set it to 1 for the
	tokens of the second sentence and the last `"[SEP]"` token.

	Lastly, we specify the special tokens we used and their IDs in our
	tokenizer's vocabulary.

	To check out this worked properly, let's try to encode the same
	sentence as before:

	<tokenizerslangcontent>
	<python>
	<literalinclude>
	{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
	"language": "python",
	"start-after": "START print_special_tokens",
	"end-before": "END print_special_tokens",
	"dedent": 8}
	</literalinclude>
	</python>
	<rust>
	<literalinclude>
	{"path": "../../tokenizers/tests/documentation.rs",
	"language": "rust",
	"start-after": "START quicktour_print_special_tokens",
	"end-before": "END quicktour_print_special_tokens",
	"dedent": 4}
	</literalinclude>
	</rust>
	<node>
	<literalinclude>
	{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
	"language": "js",
	"start-after": "START print_special_tokens",
	"end-before": "END print_special_tokens",
	"dedent": 8}
	</literalinclude>
	</node>
	</tokenizerslangcontent>

	To check the results on a pair of sentences, we just pass the two
	sentences to `Tokenizer.encode`:

	<tokenizerslangcontent>
	<python>
	<literalinclude>
	{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
	"language": "python",
	"start-after": "START print_special_tokens_pair",
	"end-before": "END print_special_tokens_pair",
	"dedent": 8}
	</literalinclude>
	</python>
	<rust>
	<literalinclude>
	{"path": "../../tokenizers/tests/documentation.rs",
	"language": "rust",
	"start-after": "START quicktour_print_special_tokens_pair",
	"end-before": "END quicktour_print_special_tokens_pair",
	"dedent": 4}
	</literalinclude>
	</rust>
	<node>
	<literalinclude>
	{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
	"language": "js",
	"start-after": "START print_special_tokens_pair",
	"end-before": "END print_special_tokens_pair",
	"dedent": 8}
	</literalinclude>
	</node>
	</tokenizerslangcontent>

	You can then check the type IDs attributed to each token is correct with

	<tokenizerslangcontent>
	<python>
	<literalinclude>
	{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
	"language": "python",
	"start-after": "START print_type_ids",
	"end-before": "END print_type_ids",
	"dedent": 8}
	</literalinclude>
	</python>
	<rust>
	<literalinclude>
	{"path": "../../tokenizers/tests/documentation.rs",
	"language": "rust",
	"start-after": "START quicktour_print_type_ids",
	"end-before": "END quicktour_print_type_ids",
	"dedent": 4}
	</literalinclude>
	</rust>
	<node>
	<literalinclude>
	{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
	"language": "js",
	"start-after": "START print_type_ids",
	"end-before": "END print_type_ids",
	"dedent": 8}
	</literalinclude>
	</node>
	</tokenizerslangcontent>

	If you save your tokenizer with `Tokenizer.save`, the post-processor will be saved along.

	### Encoding multiple sentences in a batch

	To get the full speed of the 🤗 Tokenizers library, it's best to
	process your texts by batches by using the
	`Tokenizer.encode_batch` method:

	<tokenizerslangcontent>
	<python>
	<literalinclude>
	{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
	"language": "python",
	"start-after": "START encode_batch",
	"end-before": "END encode_batch",
	"dedent": 8}
	</literalinclude>
	</python>
	<rust>
	<literalinclude>
	{"path": "../../tokenizers/tests/documentation.rs",
	"language": "rust",
	"start-after": "START quicktour_encode_batch",
	"end-before": "END quicktour_encode_batch",
	"dedent": 4}
	</literalinclude>
	</rust>
	<node>
	<literalinclude>
	{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
	"language": "js",
	"start-after": "START encode_batch",
	"end-before": "END encode_batch",
	"dedent": 8}
	</literalinclude>
	</node>
	</tokenizerslangcontent>

	The output is then a list of `Encoding`
	objects like the ones we saw before. You can process together as many
	texts as you like, as long as it fits in memory.

	To process a batch of sentences pairs, pass two lists to the
	`Tokenizer.encode_batch` method: the
	list of sentences A and the list of sentences B:

	<tokenizerslangcontent>
	<python>
	<literalinclude>
	{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
	"language": "python",
	"start-after": "START encode_batch_pair",
	"end-before": "END encode_batch_pair",
	"dedent": 8}
	</literalinclude>
	</python>
	<rust>
	<literalinclude>
	{"path": "../../tokenizers/tests/documentation.rs",
	"language": "rust",
	"start-after": "START quicktour_encode_batch_pair",
	"end-before": "END quicktour_encode_batch_pair",
	"dedent": 4}
	</literalinclude>
	</rust>
	<node>
	<literalinclude>
	{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
	"language": "js",
	"start-after": "START encode_batch_pair",
	"end-before": "END encode_batch_pair",
	"dedent": 8}
	</literalinclude>
	</node>
	</tokenizerslangcontent>

	When encoding multiple sentences, you can automatically pad the outputs
	to the longest sentence present by using
	`Tokenizer.enable_padding`, with the
	`pad_token` and its ID (which we can
	double-check the id for the padding token with
	`Tokenizer.token_to_id` like before):

	<tokenizerslangcontent>
	<python>
	<literalinclude>
	{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
	"language": "python",
	"start-after": "START enable_padding",
	"end-before": "END enable_padding",
	"dedent": 8}
	</literalinclude>
	</python>
	<rust>
	<literalinclude>
	{"path": "../../tokenizers/tests/documentation.rs",
	"language": "rust",
	"start-after": "START quicktour_enable_padding",
	"end-before": "END quicktour_enable_padding",
	"dedent": 4}
	</literalinclude>
	</rust>
	<node>
	<literalinclude>
	{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
	"language": "js",
	"start-after": "START enable_padding",
	"end-before": "END enable_padding",
	"dedent": 8}
	</literalinclude>
	</node>
	</tokenizerslangcontent>

	We can set the `direction` of the padding
	(defaults to the right) or a given `length` if we want to pad every sample to that specific number (here
	we leave it unset to pad to the size of the longest text).

	<tokenizerslangcontent>
	<python>
	<literalinclude>
	{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
	"language": "python",
	"start-after": "START print_batch_tokens",
	"end-before": "END print_batch_tokens",
	"dedent": 8}
	</literalinclude>
	</python>
	<rust>
	<literalinclude>
	{"path": "../../tokenizers/tests/documentation.rs",
	"language": "rust",
	"start-after": "START quicktour_print_batch_tokens",
	"end-before": "END quicktour_print_batch_tokens",
	"dedent": 4}
	</literalinclude>
	</rust>
	<node>
	<literalinclude>
	{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
	"language": "js",
	"start-after": "START print_batch_tokens",
	"end-before": "END print_batch_tokens",
	"dedent": 8}
	</literalinclude>
	</node>
	</tokenizerslangcontent>

	In this case, the `attention mask` generated by the
	tokenizer takes the padding into account:

	<tokenizerslangcontent>
	<python>
	<literalinclude>
	{"path": "../../bindings/python/tests/documentation/test_quicktour.py",
	"language": "python",
	"start-after": "START print_attention_mask",
	"end-before": "END print_attention_mask",
	"dedent": 8}
	</literalinclude>
	</python>
	<rust>
	<literalinclude>
	{"path": "../../tokenizers/tests/documentation.rs",
	"language": "rust",
	"start-after": "START quicktour_print_attention_mask",
	"end-before": "END quicktour_print_attention_mask",
	"dedent": 4}
	</literalinclude>
	</rust>
	<node>
	<literalinclude>
	{"path": "../../bindings/node/examples/documentation/quicktour.test.ts",
	"language": "js",
	"start-after": "START print_attention_mask",
	"end-before": "END print_attention_mask",
	"dedent": 8}
	</literalinclude>
	</node>
	</tokenizerslangcontent>

	## Pretrained

	<tokenizerslangcontent>
	<python>
	### Using a pretrained tokenizer

	You can load any tokenizer from the Hugging Face Hub as long as a
	`tokenizer.json` file is available in the repository.

	```python
	from tokenizers import Tokenizer

	tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
	```

	### Importing a pretrained tokenizer from legacy vocabulary files

	You can also import a pretrained tokenizer directly in, as long as you
	have its vocabulary file. For instance, here is how to import the
	classic pretrained BERT tokenizer:

	```python
	from tokenizers import BertWordPieceTokenizer

	tokenizer = BertWordPieceTokenizer("bert-base-uncased-vocab.txt", lowercase=True)
	```

	as long as you have downloaded the file `bert-base-uncased-vocab.txt` with

	```bash
	wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt
	```
	</python>
	</tokenizerslangcontent>