Buckets:
tokenizers
Tokenizers turn text into the integer ids a model understands, and
decode model output back into strings. Use AutoTokenizer.from_pretrained()
to load the right implementation for a model ID — the class is chosen from
the tokenizer's tokenizer_config.json.
For chat-trained models, tokenizer.apply_chat_template() renders an
OpenAI-style message list into the model's native prompt format.
Classes
AutoTokenizer
Helper class which is used to instantiate pretrained tokenizers with the from_pretrained function.
The chosen tokenizer class is determined by the type specified in the tokenizer config.
Example: Create an AutoTokenizer and use it to tokenize a sentence.
This will automatically detect the tokenizer type based on the tokenizer class defined in tokenizer_config.json.
import { AutoTokenizer } from '@huggingface/transformers';
const tokenizer = await AutoTokenizer.from_pretrained('Xenova/bert-base-uncased');
const { input_ids } = await tokenizer('I love transformers!');
// Tensor {
// data: BigInt64Array(6) [101n, 1045n, 2293n, 19081n, 999n, 102n],
// dims: [1, 6],
// type: 'int64',
// size: 6,
// }
AutoTokenizer.from_pretrained(pretrained_model_name_or_path, options)
Instantiate one of the tokenizer classes of the library from a pretrained model.
The tokenizer class to instantiate is selected based on the tokenizer_class property of the config object
(either passed as an argument or loaded from pretrained_model_name_or_path if possible)
Parameters
pretrained_model_name_or_path(string) — The name or path of the pretrained model. Can be either:- A string, the model id of a pretrained tokenizer hosted inside a model repo on huggingface.co.
Valid model ids can be located at the root-level, like
bert-base-uncased, or namespaced under a user or organization name, likedbmdz/bert-base-german-cased. - A path to a directory containing tokenizer files, e.g.,
./my_model_directory/.
- A string, the model id of a pretrained tokenizer hosted inside a model repo on huggingface.co.
Valid model ids can be located at the root-level, like
options(PretrainedTokenizerOptions) — Additional options for loading the tokenizer.
Returns: Promise<PreTrainedTokenizer> — The loaded tokenizer.
PreTrainedTokenizer
PreTrainedTokenizer is the base class for all tokenizers in Transformers.js.
PreTrainedTokenizer(text, [options])
Parameters
text(string[]?)options(TokenizerCallOptions<string[]?,boolean = true>) optional
Returns: BatchEncoding<BatchEncodingItem<string[]?, boolean = true>>
PreTrainedTokenizer.constructor(tokenizerJSON, tokenizerConfig)
Create a new PreTrainedTokenizer instance.
Parameters
tokenizerJSON(Object) — The JSON of the tokenizer.tokenizerConfig(Object) — The config of the tokenizer.
PreTrainedTokenizer.from_pretrained(pretrained_model_name_or_path, options)
Loads a pretrained tokenizer from the given pretrained_model_name_or_path.
Parameters
pretrained_model_name_or_path(string) — The path to the pretrained tokenizer.options(PretrainedTokenizerOptions) — Additional options for loading the tokenizer.
Returns: Promise<PreTrainedTokenizer> — A new instance of the PreTrainedTokenizer class.
Throws
Error— Throws an error if the tokenizer.json or tokenizer_config.json files are not found in thepretrained_model_name_or_path.
PreTrainedTokenizer.convert_tokens_to_ids(tokens)
Converts a token string (or a sequence of tokens) into a single integer id (or a sequence of ids), using the vocabulary.
Parameters
tokens(string[]?) — One or several token(s) to convert to token id(s).
Returns: string[]? — The token id or list of token ids.
PreTrainedTokenizer.tokenize(text, options)
Converts a string into a sequence of tokens.
Parameters
text(string) — The sequence to be encoded.options(Object) — An optional object containing the following properties:pair(string|null) optional — A second sequence to be encoded with the first.add_special_tokens(boolean) optional — defaults tofalse— Whether or not to add the special tokens associated with the corresponding model.
Returns: string[] — The list of tokens.
PreTrainedTokenizer.encode(text, options)
Encodes a single text or a pair of texts using the model's tokenizer.
Parameters
text(string) — The text to encode.options(Object) — An optional object containing the following properties:text_pair(string|null) optional — defaults tonull— The optional second text to encode.add_special_tokens(boolean) optional — defaults totrue— Whether or not to add the special tokens associated with the corresponding model.return_token_type_ids(boolean|null) optional — defaults tonull— Whether to return token_type_ids.
Returns: number[] — An array of token IDs representing the encoded text(s).
PreTrainedTokenizer.batch_decode(batch, decode_args)
Decode a batch of tokenized sequences.
Parameters
batch(number[][]|Tensor) — List/Tensor of tokenized input sequences.decode_args(Object) — (Optional) Object with decoding arguments.
Returns: string[] — List of decoded sequences.
PreTrainedTokenizer.decode(token_ids, [decode_args])
Decodes a sequence of token IDs back to a string.
Parameters
token_ids(number[]|bigint[]|Tensor) — List/Tensor of token IDs to decode.decode_args(Object) optional — defaults to{}skip_special_tokens(boolean) optional — defaults tofalse— If true, special tokens are removed from the output string.clean_up_tokenization_spaces(boolean) optional — defaults totrue— If true, spaces before punctuation and abbreviated forms are removed.
Returns: string — The decoded string.
Throws
Error— Iftoken_idsis not a non-empty array of integers.
PreTrainedTokenizer.decode_single(token_ids, decode_args)
Decode a single list of token ids to a string.
Parameters
token_ids(number[]|bigint[]) — List of token ids to decodedecode_args(Object) — Optional arguments for decodingskip_special_tokens(boolean) optional — defaults tofalse— Whether to skip special tokens during decodingclean_up_tokenization_spaces(boolean|null) optional — defaults tonull— Whether to clean up tokenization spaces during decoding. If null, the value is set tothis.decoder.cleanupif it exists, falling back tothis.clean_up_tokenization_spacesif it exists, falling back totrue.
Returns: string — The decoded string
PreTrainedTokenizer.get_chat_template(options)
Retrieve the chat template string used for tokenizing chat messages. This template is used
internally by the apply_chat_template method and can also be used externally to retrieve the model's chat
template for better generation tracking.
Parameters
options(Object) — An optional object containing the following properties:chat_template(string|null) optional — defaults tonull— A Jinja template or the name of a template to use for this conversion. It is usually not necessary to pass anything to this argument, as the model's template will be used by default.tools(Object[]) optional — defaults tonull— A list of tools (callable functions) that will be accessible to the model. If the template does not support function calling, this argument will have no effect. Each tool should be passed as a JSON Schema, giving the name, description and argument types for the tool. See our chat templating guide for more information.
Returns: string — The chat template string.
PreTrainedTokenizer.apply_chat_template(conversation, [options])
Converts a list of message objects with "role" and "content" keys to a list of token
ids. This method is intended for use with chat models, and will read the tokenizer's chat_template attribute to
determine the format and control tokens to use when converting.
See the chat templating guide for more information.
Parameters
conversation(Message[]) — A list of message objects with"role"and"content"keys, representing the chat history so far.options(ApplyChatTemplateOptions<TTokenize,TReturnTensor,TReturnDict>) optional — Options controlling template rendering and tokenization.
Returns: ApplyChatTemplateReturn<TTokenize, TReturnTensor, TReturnDict> — The tokenized output.
Example: Applying a chat template to a conversation.
import { AutoTokenizer } from "@huggingface/transformers";
const tokenizer = await AutoTokenizer.from_pretrained("Xenova/mistral-tokenizer-v1");
const chat = [
{ "role": "user", "content": "Hello, how are you?" },
{ "role": "assistant", "content": "I'm doing great. How can I help you today?" },
{ "role": "user", "content": "I'd like to show off how chat templating works!" },
]
const text = tokenizer.apply_chat_template(chat, { tokenize: false });
// "<s>[INST] Hello, how are you? [/INST]I'm doing great. How can I help you today?</s> [INST] I'd like to show off how chat templating works! [/INST]"
const input_ids = tokenizer.apply_chat_template(chat, { tokenize: true, return_tensor: false });
// [1, 733, 16289, 28793, 22557, 28725, 910, 460, 368, 28804, 733, 28748, 16289, 28793, 28737, 28742, 28719, 2548, 1598, 28723, 1602, 541, 315, 1316, 368, 3154, 28804, 2, 28705, 733, 16289, 28793, 315, 28742, 28715, 737, 298, 1347, 805, 910, 10706, 5752, 1077, 3791, 28808, 733, 28748, 16289, 28793]
Type Definitions
PretrainedTokenizerOptions
Type: PretrainedOptions
TextContent
Properties
type('text') — The type of content (must be 'text').text(string) — The text content.
ImageContent
Properties
type('image') — The type of content (must be 'image').image(string|RawImage) optional — Optional URL or instance of the image.Note: This works for SmolVLM. Qwen2VL and Idefics3 have different implementations.
MessageContent
A single content block inside a chat message. Extend the union to add
custom types (e.g. AudioContent) when targeting a specific model.
Type: TextContent | ImageContent | { type: string & {}, [key: string]: any }
Message
Properties
role('user'|'assistant'|'system'|(string & {})) — The role of the message.content(string|MessageContent[]) — The content of the message. Can be a simple string or an array of content objects.
BatchEncoding
The object returned from tokenizer(text). The fields are a Tensor by
default, or an Array when return_tensor: false is passed.
Properties
input_ids(any) — Token ids to be fed to the model.attention_mask(any) — Mask indicating which tokens should be attended to (1) versus padded (0).token_type_ids(any) optional — Segment ids, present only for tokenizers that distinguish sequence A vs B (e.g. BERT).
TokenizerCallOptions
Options passed to tokenizer(text, options).
Properties
text_pair(any) optional — defaults tonull— Optional second sequence to be encoded. Must match the shape oftext— string whentextis a string, array whentextis an array.padding(boolean|'max_length') optional — defaults tofalse— Whether to pad the input sequences.add_special_tokens(boolean) optional — defaults totrue— Whether or not to add the special tokens associated with the corresponding model.truncation(boolean|null) optional — defaults tonull— Whether to truncate the input sequences.max_length(number|null) optional — defaults tonull— Maximum length of the returned list and optionally padding length.return_tensor(any) optional — defaults totrue— Whether to return the results as Tensors or arrays.return_token_type_ids(boolean|null) optional — defaults tonull— Whether to return the token type ids.
ApplyChatTemplateOptions
Properties
chat_template(string|null) optional — defaults tonull— A Jinja template to use for this conversion. If omitted, the model's chat template is used.tools(Object[]|null) optional — defaults tonull— JSON Schema tool definitions exposed to templates that support function calling. See the chat templating guide.documents(Record<string,string>[] |null) optional — defaults tonull— Documents exposed to templates that support retrieval-augmented generation. See the RAG section of the chat templating guide.add_generation_prompt(boolean) optional — defaults tofalse— Whether to end the prompt with the token(s) that indicate the start of an assistant message. The template must support this argument for it to have any effect.tokenize(any) optional — defaults totrue— Whether to tokenize the output. If false, the output will be a string.padding(boolean) optional — defaults tofalse— Whether to pad sequences to the maximum length. Has no effect if tokenize is false.truncation(boolean) optional — defaults tofalse— Whether to truncate sequences to the maximum length. Has no effect if tokenize is false.max_length(number|null) optional — defaults tonull— Maximum length (in tokens) to use for padding or truncation. If omitted, the tokenizer'smax_lengthis used. Has no effect if tokenize is false.return_tensor(any) optional — defaults totrue— Whether to return the output as a Tensor or an Array. Has no effect if tokenize is false.return_dict(any) optional — defaults totrue— Whether to return a dictionary with named outputs. Has no effect if tokenize is false.tokenizer_kwargs(Object) optional — defaults to{}— Additional options to pass to the tokenizer.
Callbacks
PreTrainedTokenizerCallback
Parameters
text(string[]?)options(TokenizerCallOptions<string[]?,boolean = true>) optional
Returns: BatchEncoding<BatchEncodingItem<string[]?, boolean = true>>
Xet Storage Details
- Size:
- 15.8 kB
- Xet hash:
- 2132003a7cfdd842221de3e236c68af87f7880c0b63216c98a669070c5571371
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.