Buckets:

|
download
raw
15.8 kB

tokenizers

Tokenizers turn text into the integer ids a model understands, and decode model output back into strings. Use AutoTokenizer.from_pretrained() to load the right implementation for a model ID — the class is chosen from the tokenizer's tokenizer_config.json.

For chat-trained models, tokenizer.apply_chat_template() renders an OpenAI-style message list into the model's native prompt format.

Classes

AutoTokenizer

Helper class which is used to instantiate pretrained tokenizers with the from_pretrained function. The chosen tokenizer class is determined by the type specified in the tokenizer config.

Example: Create an AutoTokenizer and use it to tokenize a sentence. This will automatically detect the tokenizer type based on the tokenizer class defined in tokenizer_config.json.

import { AutoTokenizer } from '@huggingface/transformers';

const tokenizer = await AutoTokenizer.from_pretrained('Xenova/bert-base-uncased');
const { input_ids } = await tokenizer('I love transformers!');
// Tensor {
//   data: BigInt64Array(6) [101n, 1045n, 2293n, 19081n, 999n, 102n],
//   dims: [1, 6],
//   type: 'int64',
//   size: 6,
// }

AutoTokenizer.from_pretrained(pretrained_model_name_or_path, options)

Instantiate one of the tokenizer classes of the library from a pretrained model.

The tokenizer class to instantiate is selected based on the tokenizer_class property of the config object (either passed as an argument or loaded from pretrained_model_name_or_path if possible)

Parameters

  • pretrained_model_name_or_path (string) — The name or path of the pretrained model. Can be either:
    • A string, the model id of a pretrained tokenizer hosted inside a model repo on huggingface.co. Valid model ids can be located at the root-level, like bert-base-uncased, or namespaced under a user or organization name, like dbmdz/bert-base-german-cased.
    • A path to a directory containing tokenizer files, e.g., ./my_model_directory/.
  • options (PretrainedTokenizerOptions) — Additional options for loading the tokenizer.

Returns: Promise<PreTrainedTokenizer> — The loaded tokenizer.

PreTrainedTokenizer

PreTrainedTokenizer is the base class for all tokenizers in Transformers.js.

PreTrainedTokenizer(text, [options])

Parameters

Returns: BatchEncoding<BatchEncodingItem<string[]?, boolean = true>>

PreTrainedTokenizer.constructor(tokenizerJSON, tokenizerConfig)

Create a new PreTrainedTokenizer instance.

Parameters

  • tokenizerJSON (Object) — The JSON of the tokenizer.
  • tokenizerConfig (Object) — The config of the tokenizer.

PreTrainedTokenizer.from_pretrained(pretrained_model_name_or_path, options)

Loads a pretrained tokenizer from the given pretrained_model_name_or_path.

Parameters

  • pretrained_model_name_or_path (string) — The path to the pretrained tokenizer.
  • options (PretrainedTokenizerOptions) — Additional options for loading the tokenizer.

Returns: Promise<PreTrainedTokenizer> — A new instance of the PreTrainedTokenizer class.

Throws

  • Error — Throws an error if the tokenizer.json or tokenizer_config.json files are not found in the pretrained_model_name_or_path.

PreTrainedTokenizer.convert_tokens_to_ids(tokens)

Converts a token string (or a sequence of tokens) into a single integer id (or a sequence of ids), using the vocabulary.

Parameters

  • tokens (string[]?) — One or several token(s) to convert to token id(s).

Returns: string[]? — The token id or list of token ids.

PreTrainedTokenizer.tokenize(text, options)

Converts a string into a sequence of tokens.

Parameters

  • text (string) — The sequence to be encoded.
  • options (Object) — An optional object containing the following properties:
    • pair (string | null) optional — A second sequence to be encoded with the first.
    • add_special_tokens (boolean) optional — defaults to false — Whether or not to add the special tokens associated with the corresponding model.

Returns: string[] — The list of tokens.

PreTrainedTokenizer.encode(text, options)

Encodes a single text or a pair of texts using the model's tokenizer.

Parameters

  • text (string) — The text to encode.
  • options (Object) — An optional object containing the following properties:
    • text_pair (string | null) optional — defaults to null — The optional second text to encode.
    • add_special_tokens (boolean) optional — defaults to true — Whether or not to add the special tokens associated with the corresponding model.
    • return_token_type_ids (boolean | null) optional — defaults to null — Whether to return token_type_ids.

Returns: number[] — An array of token IDs representing the encoded text(s).

PreTrainedTokenizer.batch_decode(batch, decode_args)

Decode a batch of tokenized sequences.

Parameters

  • batch (number[][] | Tensor) — List/Tensor of tokenized input sequences.
  • decode_args (Object) — (Optional) Object with decoding arguments.

Returns: string[] — List of decoded sequences.

PreTrainedTokenizer.decode(token_ids, [decode_args])

Decodes a sequence of token IDs back to a string.

Parameters

  • token_ids (number[] | bigint[] | Tensor) — List/Tensor of token IDs to decode.
  • decode_args (Object) optional — defaults to {}
    • skip_special_tokens (boolean) optional — defaults to false — If true, special tokens are removed from the output string.
    • clean_up_tokenization_spaces (boolean) optional — defaults to true — If true, spaces before punctuation and abbreviated forms are removed.

Returns: string — The decoded string.

Throws

  • Error — If token_ids is not a non-empty array of integers.

PreTrainedTokenizer.decode_single(token_ids, decode_args)

Decode a single list of token ids to a string.

Parameters

  • token_ids (number[] | bigint[]) — List of token ids to decode
  • decode_args (Object) — Optional arguments for decoding
    • skip_special_tokens (boolean) optional — defaults to false — Whether to skip special tokens during decoding
    • clean_up_tokenization_spaces (boolean | null) optional — defaults to null — Whether to clean up tokenization spaces during decoding. If null, the value is set to this.decoder.cleanup if it exists, falling back to this.clean_up_tokenization_spaces if it exists, falling back to true.

Returns: string — The decoded string

PreTrainedTokenizer.get_chat_template(options)

Retrieve the chat template string used for tokenizing chat messages. This template is used internally by the apply_chat_template method and can also be used externally to retrieve the model's chat template for better generation tracking.

Parameters

  • options (Object) — An optional object containing the following properties:
    • chat_template (string | null) optional — defaults to null — A Jinja template or the name of a template to use for this conversion. It is usually not necessary to pass anything to this argument, as the model's template will be used by default.
    • tools (Object[]) optional — defaults to null — A list of tools (callable functions) that will be accessible to the model. If the template does not support function calling, this argument will have no effect. Each tool should be passed as a JSON Schema, giving the name, description and argument types for the tool. See our chat templating guide for more information.

Returns: string — The chat template string.

PreTrainedTokenizer.apply_chat_template(conversation, [options])

Converts a list of message objects with "role" and "content" keys to a list of token ids. This method is intended for use with chat models, and will read the tokenizer's chat_template attribute to determine the format and control tokens to use when converting.

See the chat templating guide for more information.

Parameters

  • conversation (Message[]) — A list of message objects with "role" and "content" keys, representing the chat history so far.
  • options (ApplyChatTemplateOptions<TTokenize, TReturnTensor, TReturnDict>) optional — Options controlling template rendering and tokenization.

Returns: ApplyChatTemplateReturn<TTokenize, TReturnTensor, TReturnDict> — The tokenized output.

Example: Applying a chat template to a conversation.

import { AutoTokenizer } from "@huggingface/transformers";

const tokenizer = await AutoTokenizer.from_pretrained("Xenova/mistral-tokenizer-v1");

const chat = [
  { "role": "user", "content": "Hello, how are you?" },
  { "role": "assistant", "content": "I'm doing great. How can I help you today?" },
  { "role": "user", "content": "I'd like to show off how chat templating works!" },
]

const text = tokenizer.apply_chat_template(chat, { tokenize: false });
// "<s>[INST] Hello, how are you? [/INST]I'm doing great. How can I help you today?</s> [INST] I'd like to show off how chat templating works! [/INST]"

const input_ids = tokenizer.apply_chat_template(chat, { tokenize: true, return_tensor: false });
// [1, 733, 16289, 28793, 22557, 28725, 910, 460, 368, 28804, 733, 28748, 16289, 28793, 28737, 28742, 28719, 2548, 1598, 28723, 1602, 541, 315, 1316, 368, 3154, 28804, 2, 28705, 733, 16289, 28793, 315, 28742, 28715, 737, 298, 1347, 805, 910, 10706, 5752, 1077, 3791, 28808, 733, 28748, 16289, 28793]

Type Definitions

PretrainedTokenizerOptions

Type: PretrainedOptions

TextContent

Properties

  • type ('text') — The type of content (must be 'text').
  • text (string) — The text content.

ImageContent

Properties

  • type ('image') — The type of content (must be 'image').

  • image (string | RawImage) optional — Optional URL or instance of the image.

    Note: This works for SmolVLM. Qwen2VL and Idefics3 have different implementations.

MessageContent

A single content block inside a chat message. Extend the union to add custom types (e.g. AudioContent) when targeting a specific model.

Type: TextContent | ImageContent | { type: string & {}, [key: string]: any }

Message

Properties

  • role ('user' | 'assistant' | 'system' | (string & {})) — The role of the message.
  • content (string | MessageContent[]) — The content of the message. Can be a simple string or an array of content objects.

BatchEncoding

The object returned from tokenizer(text). The fields are a Tensor by default, or an Array when return_tensor: false is passed.

Properties

  • input_ids (any) — Token ids to be fed to the model.
  • attention_mask (any) — Mask indicating which tokens should be attended to (1) versus padded (0).
  • token_type_ids (any) optional — Segment ids, present only for tokenizers that distinguish sequence A vs B (e.g. BERT).

TokenizerCallOptions

Options passed to tokenizer(text, options).

Properties

  • text_pair (any) optional — defaults to null — Optional second sequence to be encoded. Must match the shape of text — string when text is a string, array when text is an array.
  • padding (boolean | 'max_length') optional — defaults to false — Whether to pad the input sequences.
  • add_special_tokens (boolean) optional — defaults to true — Whether or not to add the special tokens associated with the corresponding model.
  • truncation (boolean | null) optional — defaults to null — Whether to truncate the input sequences.
  • max_length (number | null) optional — defaults to null — Maximum length of the returned list and optionally padding length.
  • return_tensor (any) optional — defaults to true — Whether to return the results as Tensors or arrays.
  • return_token_type_ids (boolean | null) optional — defaults to null — Whether to return the token type ids.

ApplyChatTemplateOptions

Properties

  • chat_template (string | null) optional — defaults to null — A Jinja template to use for this conversion. If omitted, the model's chat template is used.
  • tools (Object[] | null) optional — defaults to null — JSON Schema tool definitions exposed to templates that support function calling. See the chat templating guide.
  • documents (Record<string, string>[] | null) optional — defaults to null — Documents exposed to templates that support retrieval-augmented generation. See the RAG section of the chat templating guide.
  • add_generation_prompt (boolean) optional — defaults to false — Whether to end the prompt with the token(s) that indicate the start of an assistant message. The template must support this argument for it to have any effect.
  • tokenize (any) optional — defaults to true — Whether to tokenize the output. If false, the output will be a string.
  • padding (boolean) optional — defaults to false — Whether to pad sequences to the maximum length. Has no effect if tokenize is false.
  • truncation (boolean) optional — defaults to false — Whether to truncate sequences to the maximum length. Has no effect if tokenize is false.
  • max_length (number | null) optional — defaults to null — Maximum length (in tokens) to use for padding or truncation. If omitted, the tokenizer's max_length is used. Has no effect if tokenize is false.
  • return_tensor (any) optional — defaults to true — Whether to return the output as a Tensor or an Array. Has no effect if tokenize is false.
  • return_dict (any) optional — defaults to true — Whether to return a dictionary with named outputs. Has no effect if tokenize is false.
  • tokenizer_kwargs (Object) optional — defaults to {} — Additional options to pass to the tokenizer.

Callbacks

PreTrainedTokenizerCallback

Parameters

Returns: BatchEncoding<BatchEncodingItem<string[]?, boolean = true>>

Xet Storage Details

Size:
15.8 kB
·
Xet hash:
2132003a7cfdd842221de3e236c68af87f7880c0b63216c98a669070c5571371

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.