Buckets:
Tokenizer
Tokenizer[[tokenizers.Tokenizer]]
tokenizers.Tokenizer[[tokenizers.Tokenizer]]
A Tokenizer works as a pipeline. It processes some raw text as input
and outputs an Encoding.
The pipeline is structured as follows:
- The Normalizer normalizes the raw input text.
- The PreTokenizer splits the normalized text into word-level tokens.
- The Model tokenizes each word into subword tokens and maps them to IDs.
- The
PostProcessorapplies any final transformations (e.g., adding special tokens like[CLS]and[SEP]).
Example:
>>> from tokenizers import Tokenizer
>>> from tokenizers.models import BPE
>>> from tokenizers.normalizers import Lowercase
>>> from tokenizers.pre_tokenizers import Whitespace
>>> tokenizer = Tokenizer(BPE(unk_token="<unk>"))
>>> tokenizer.normalizer = Lowercase()
>>> tokenizer.pre_tokenizer = Whitespace()
>>> # Load a pre-built tokenizer from HuggingFace Hub
>>> tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
decodertokenizers.Tokenizer.decoder[]
The optional Decoder in use by the Tokenizer
Parameters:
model (Model) : The core algorithm that this Tokenizer should be using.
model[[tokenizers.Tokenizer.model]]
The Model in use by the Tokenizer
normalizer[[tokenizers.Tokenizer.normalizer]]
The optional Normalizer in use by the Tokenizer
padding[[tokenizers.Tokenizer.padding]]
Get the current padding parameters
Cannot be set, use enable_padding() instead
Returns:
(dict, *optional*)
A dict with the current padding parameters if padding is enabled
post_processor[[tokenizers.Tokenizer.post_processor]]
The optional PostProcessor in use by the Tokenizer
pre_tokenizer[[tokenizers.Tokenizer.pre_tokenizer]]
The optional PreTokenizer in use by the Tokenizer
truncation[[tokenizers.Tokenizer.truncation]]
Get the currently set truncation parameters
Cannot set, use enable_truncation() instead
Returns:
(dict, *optional*)
A dict with the current truncation parameters if truncation is enabled
add_special_tokens[[tokenizers.Tokenizer.add_special_tokens]]
Add the given special tokens to the Tokenizer.
If these tokens are already part of the vocabulary, it just let the Tokenizer know about them. If they don't exist, the Tokenizer creates them, giving them a new id.
These special tokens will never be processed by the model (ie won't be split into multiple tokens), and they can be removed from the output when decoding.
Parameters:
tokens (A List of AddedToken or str) : The list of special tokens we want to add to the vocabulary. Each token can either be a string or an instance of AddedToken for more customization.
Returns:
int
The number of tokens that were created in the vocabulary
add_tokens[[tokenizers.Tokenizer.add_tokens]]
Add the given tokens to the vocabulary
The given tokens are added only if they don't already exist in the vocabulary. Each token then gets a new attributed id.
Parameters:
tokens (A List of AddedToken or str) : The list of tokens we want to add to the vocabulary. Each token can be either a string or an instance of AddedToken for more customization.
Returns:
int
The number of tokens that were created in the vocabulary
async_decode_batch[[tokenizers.Tokenizer.async_decode_batch]]
Decode a batch of ids back to their corresponding string
Parameters:
sequences (List of List[int]) : The batch of sequences we want to decode
skip_special_tokens (bool, defaults to True) : Whether the special tokens should be removed from the decoded strings
Returns:
List[str]
A list of decoded strings
async_encode[[tokenizers.Tokenizer.async_encode]]
Asynchronously encode the given input with character offsets.
This is an async version of encode that can be awaited in async Python code.
Example:
Here are some examples of the inputs that are accepted:
await async_encode("A single sequence")
Parameters:
sequence (~tokenizers.InputSequence) : The main input sequence we want to encode. This sequence can be either raw text or pre-tokenized, according to the is_pretokenized argument: - If is_pretokenized=False: TextInputSequence - If is_pretokenized=True: PreTokenizedInputSequence()
pair (~tokenizers.InputSequence, optional) : An optional input sequence. The expected format is the same that for sequence.
is_pretokenized (bool, defaults to False) : Whether the input is already pre-tokenized
add_special_tokens (bool, defaults to True) : Whether to add the special tokens
Returns:
[Encoding](/docs/tokenizers/pr_2003/en/api/encoding#tokenizers.Encoding)
The encoded result
async_encode_batch[[tokenizers.Tokenizer.async_encode_batch]]
Asynchronously encode the given batch of inputs with character offsets.
This is an async version of encode_batch that can be awaited in async Python code.
Example:
Here are some examples of the inputs that are accepted:
await async_encode_batch([
"A single sequence",
("A tuple with a sequence", "And its pair"),
[ "A", "pre", "tokenized", "sequence" ],
([ "A", "pre", "tokenized", "sequence" ], "And its pair")
])
Parameters:
input (A List/``Tupleof~tokenizers.EncodeInput) : A list of single sequences or pair sequences to encode. Each sequence can be either raw text or pre-tokenized, according to the is_pretokenizedargument: - Ifis_pretokenized=False: TextEncodeInput()- Ifis_pretokenized=True: PreTokenizedEncodeInput()`
is_pretokenized (bool, defaults to False) : Whether the input is already pre-tokenized
add_special_tokens (bool, defaults to True) : Whether to add the special tokens
Returns:
A List of [~tokenizers.Encoding``]`
The encoded batch
async_encode_batch_fast[[tokenizers.Tokenizer.async_encode_batch_fast]]
Asynchronously encode the given batch of inputs without tracking character offsets.
This is an async version of encode_batch_fast that can be awaited in async Python code.
Example:
Here are some examples of the inputs that are accepted:
await async_encode_batch_fast([
"A single sequence",
("A tuple with a sequence", "And its pair"),
[ "A", "pre", "tokenized", "sequence" ],
([ "A", "pre", "tokenized", "sequence" ], "And its pair")
])
Parameters:
input (A List/``Tupleof~tokenizers.EncodeInput) : A list of single sequences or pair sequences to encode. Each sequence can be either raw text or pre-tokenized, according to the is_pretokenizedargument: - Ifis_pretokenized=False: TextEncodeInput()- Ifis_pretokenized=True: PreTokenizedEncodeInput()`
is_pretokenized (bool, defaults to False) : Whether the input is already pre-tokenized
add_special_tokens (bool, defaults to True) : Whether to add the special tokens
Returns:
A List of [~tokenizers.Encoding``]`
The encoded batch
decode[[tokenizers.Tokenizer.decode]]
Decode the given list of ids back to a string
This is used to decode anything coming back from a Language Model
Parameters:
ids (A List/Tuple of int) : The list of ids that we want to decode
skip_special_tokens (bool, defaults to True) : Whether the special tokens should be removed from the decoded string
Returns:
str
The decoded string
decode_batch[[tokenizers.Tokenizer.decode_batch]]
Decode a batch of ids back to their corresponding string
Parameters:
sequences (List of List[int]) : The batch of sequences we want to decode
skip_special_tokens (bool, defaults to True) : Whether the special tokens should be removed from the decoded strings
Returns:
List[str]
A list of decoded strings
enable_padding[[tokenizers.Tokenizer.enable_padding]]
Enable the padding
Parameters:
direction (str, optional, defaults to right) : The direction in which to pad. Can be either right or left
pad_to_multiple_of (int, optional) : If specified, the padding length should always snap to the next multiple of the given value. For example if we were going to pad witha length of 250 but pad_to_multiple_of=8 then we will pad to 256.
pad_id (int, defaults to 0) : The id to be used when padding
pad_type_id (int, defaults to 0) : The type id to be used when padding
pad_token (str, defaults to [PAD]) : The pad token to be used when padding
length (int, optional) : If specified, the length at which to pad. If not specified we pad using the size of the longest sequence in a batch.
enable_truncation[[tokenizers.Tokenizer.enable_truncation]]
Enable truncation
Parameters:
max_length (int) : The max length at which to truncate
stride (int, optional) : The length of the previous first sequence to be included in the overflowing sequence
strategy (str, optional, defaults to longest_first) : The strategy used to truncation. Can be one of longest_first, only_first or only_second.
direction (str, defaults to right) : Truncate direction
encode[[tokenizers.Tokenizer.encode]]
Encode the given sequence and pair. This method can process raw text sequences as well as already pre-tokenized sequences.
Example:
Here are some examples of the inputs that are accepted:
encode("A single sequence")*
encode("A sequence", "And its pair")*
encode([ "A", "pre", "tokenized", "sequence" ], is_pretokenized=True)`
encode(
[ "A", "pre", "tokenized", "sequence" ], [ "And", "its", "pair" ],
is_pretokenized=True
)
Parameters:
sequence (~tokenizers.InputSequence) : The main input sequence we want to encode. This sequence can be either raw text or pre-tokenized, according to the is_pretokenized argument: - If is_pretokenized=False: TextInputSequence - If is_pretokenized=True: PreTokenizedInputSequence()
pair (~tokenizers.InputSequence, optional) : An optional input sequence. The expected format is the same that for sequence.
is_pretokenized (bool, defaults to False) : Whether the input is already pre-tokenized
add_special_tokens (bool, defaults to True) : Whether to add the special tokens
Returns:
[Encoding](/docs/tokenizers/pr_2003/en/api/encoding#tokenizers.Encoding)
The encoded result
encode_batch[[tokenizers.Tokenizer.encode_batch]]
Encode the given batch of inputs. This method accept both raw text sequences as well as already pre-tokenized sequences. The reason we use PySequence is because it allows type checking with zero-cost (according to PyO3) as we don't have to convert to check.
Example:
Here are some examples of the inputs that are accepted:
encode_batch([
"A single sequence",
("A tuple with a sequence", "And its pair"),
[ "A", "pre", "tokenized", "sequence" ],
([ "A", "pre", "tokenized", "sequence" ], "And its pair")
])
Parameters:
input (A List/``Tupleof~tokenizers.EncodeInput) : A list of single sequences or pair sequences to encode. Each sequence can be either raw text or pre-tokenized, according to the is_pretokenizedargument: - Ifis_pretokenized=False: TextEncodeInput()- Ifis_pretokenized=True: PreTokenizedEncodeInput()`
is_pretokenized (bool, defaults to False) : Whether the input is already pre-tokenized
add_special_tokens (bool, defaults to True) : Whether to add the special tokens
Returns:
A List of [~tokenizers.Encoding``]`
The encoded batch
encode_batch_fast[[tokenizers.Tokenizer.encode_batch_fast]]
Encode the given batch of inputs. This method is faster than encode_batch because it doesn't keep track of offsets, they will be all zeros.
Example:
Here are some examples of the inputs that are accepted:
encode_batch_fast([
"A single sequence",
("A tuple with a sequence", "And its pair"),
[ "A", "pre", "tokenized", "sequence" ],
([ "A", "pre", "tokenized", "sequence" ], "And its pair")
])
Parameters:
input (A List/``Tupleof~tokenizers.EncodeInput) : A list of single sequences or pair sequences to encode. Each sequence can be either raw text or pre-tokenized, according to the is_pretokenizedargument: - Ifis_pretokenized=False: TextEncodeInput()- Ifis_pretokenized=True: PreTokenizedEncodeInput()`
is_pretokenized (bool, defaults to False) : Whether the input is already pre-tokenized
add_special_tokens (bool, defaults to True) : Whether to add the special tokens
Returns:
A List of [~tokenizers.Encoding``]`
The encoded batch
from_buffer[[tokenizers.Tokenizer.from_buffer]]
Instantiate a new Tokenizer from the given buffer.
Parameters:
buffer (bytes) : A buffer containing a previously serialized Tokenizer
Returns:
[Tokenizer](/docs/tokenizers/pr_2003/en/api/tokenizer#tokenizers.Tokenizer)
The new tokenizer
from_file[[tokenizers.Tokenizer.from_file]]
Instantiate a new Tokenizer from the file at the given path.
Parameters:
path (str) : A path to a local JSON file representing a previously serialized Tokenizer
Returns:
[Tokenizer](/docs/tokenizers/pr_2003/en/api/tokenizer#tokenizers.Tokenizer)
The new tokenizer
from_pretrained[[tokenizers.Tokenizer.from_pretrained]]
Instantiate a new Tokenizer from an existing file on the Hugging Face Hub.
Parameters:
identifier (str) : The identifier of a Model on the Hugging Face Hub, that contains a tokenizer.json file
revision (str, defaults to main) : A branch or commit id
token (str, optional, defaults to None) : An optional auth token used to access private repositories on the Hugging Face Hub
Returns:
[Tokenizer](/docs/tokenizers/pr_2003/en/api/tokenizer#tokenizers.Tokenizer)
The new tokenizer
from_str[[tokenizers.Tokenizer.from_str]]
Instantiate a new Tokenizer from the given JSON string.
Parameters:
json (str) : A valid JSON string representing a previously serialized Tokenizer
Returns:
[Tokenizer](/docs/tokenizers/pr_2003/en/api/tokenizer#tokenizers.Tokenizer)
The new tokenizer
get_added_tokens_decoder[[tokenizers.Tokenizer.get_added_tokens_decoder]]
Get the underlying vocabulary
Returns:
Dict[int, AddedToken]
The vocabulary
get_vocab[[tokenizers.Tokenizer.get_vocab]]
Get the underlying vocabulary
Parameters:
with_added_tokens (bool, defaults to True) : Whether to include the added tokens
Returns:
Dict[str, int]
The vocabulary
get_vocab_size[[tokenizers.Tokenizer.get_vocab_size]]
Get the size of the underlying vocabulary
Parameters:
with_added_tokens (bool, defaults to True) : Whether to include the added tokens
Returns:
int
The size of the vocabulary
id_to_token[[tokenizers.Tokenizer.id_to_token]]
Convert the given id to its corresponding token if it exists
Parameters:
id (int) : The id to convert
Returns:
Optional[str]
An optional token, None if out of vocabulary
no_padding[[tokenizers.Tokenizer.no_padding]]
Disable padding
no_truncation[[tokenizers.Tokenizer.no_truncation]]
Disable truncation
num_special_tokens_to_add[[tokenizers.Tokenizer.num_special_tokens_to_add]]
Return the number of special tokens that would be added for single/pair sentences. :param is_pair: Boolean indicating if the input would be a single sentence or a pair :return:
post_process[[tokenizers.Tokenizer.post_process]]
Apply all the post-processing steps to the given encodings.
The various steps are:
- Truncate according to the set truncation params (provided with
enable_truncation()) - Apply the
PostProcessor - Pad according to the set padding params (provided with
enable_padding())
Parameters:
encoding (Encoding) : The Encoding corresponding to the main sequence.
pair (Encoding, optional) : An optional Encoding corresponding to the pair sequence.
add_special_tokens (bool) : Whether to add the special tokens
Returns:
[Encoding](/docs/tokenizers/pr_2003/en/api/encoding#tokenizers.Encoding)
The final post-processed encoding
save[[tokenizers.Tokenizer.save]]
Save the Tokenizer to the file at the given path.
Parameters:
path (str) : A path to a file in which to save the serialized tokenizer.
pretty (bool, defaults to True) : Whether the JSON file should be pretty formatted.
to_str[[tokenizers.Tokenizer.to_str]]
Gets a serialized string representing this Tokenizer.
Parameters:
pretty (bool, defaults to False) : Whether the JSON string should be pretty formatted.
Returns:
str
A string representing the serialized Tokenizer
token_to_id[[tokenizers.Tokenizer.token_to_id]]
Convert the given token to its corresponding id if it exists
Parameters:
token (str) : The token to convert
Returns:
Optional[int]
An optional id, None if out of vocabulary
train[[tokenizers.Tokenizer.train]]
Train the Tokenizer using the given files.
Reads the files line by line, while keeping all the whitespace, even new lines.
If you want to train from data store in-memory, you can check
train_from_iterator()
Parameters:
files (List[str]) : A list of path to the files that we should use for training
trainer (~tokenizers.trainers.Trainer, optional) : An optional trainer that should be used to train our Model
train_from_iterator[[tokenizers.Tokenizer.train_from_iterator]]
Train the Tokenizer using the provided iterator.
You can provide anything that is a Python Iterator
- A list of sequences
List[str] - A generator that yields
strorList[str] - A Numpy array of strings
- ...
Parameters:
iterator (Iterator) : Any iterator over strings or list of strings
trainer (~tokenizers.trainers.Trainer, optional) : An optional trainer that should be used to train our Model
length (int, optional) : The total number of sequences in the iterator. This is used to provide meaningful progress tracking
The Rust API Reference is available directly on the Docs.rs website.
The node API has not been documented yet.
Xet Storage Details
- Size:
- 19.9 kB
- Xet hash:
- c8755ff000a9d4da06dac9110810a0d2304d955d67f1e9f6430831df86498024
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.