Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / transformers.js /pr_1665 /en /api /tokenizers.md

HuggingFaceDocBuilder

10 days ago

preview code

download

raw

15.8 kB

	# tokenizers

	Tokenizers turn text into the integer ids a model understands, and
	decode model output back into strings. Use `AutoTokenizer.from_pretrained()`
	to load the right implementation for a model ID — the class is chosen from
	the tokenizer's `tokenizer_config.json`.

	For chat-trained models, `tokenizer.apply_chat_template()` renders an
	OpenAI-style message list into the model's native prompt format.

	## Classes

	### AutoTokenizer

	Helper class which is used to instantiate pretrained tokenizers with the `from_pretrained` function.
	The chosen tokenizer class is determined by the type specified in the tokenizer config.

	Example: Create an `AutoTokenizer` and use it to tokenize a sentence.
	This will automatically detect the tokenizer type based on the tokenizer class defined in `tokenizer_config.json`.

	```javascript
	import { AutoTokenizer } from '@huggingface/transformers';

	const tokenizer = await AutoTokenizer.from_pretrained('Xenova/bert-base-uncased');
	const { input_ids } = await tokenizer('I love transformers!');
	// Tensor {
	// data: BigInt64Array(6) [101n, 1045n, 2293n, 19081n, 999n, 102n],
	// dims: [1, 6],
	// type: 'int64',
	// size: 6,
	// }
	```

	#### `AutoTokenizer.from_pretrained(pretrained_model_name_or_path, options)`

	Instantiate one of the tokenizer classes of the library from a pretrained model.

	The tokenizer class to instantiate is selected based on the `tokenizer_class` property of the config object
	(either passed as an argument or loaded from `pretrained_model_name_or_path` if possible)

	Parameters

	- `pretrained_model_name_or_path` (`string`) — The name or path of the pretrained model. Can be either:
	- A string, the model id of a pretrained tokenizer hosted inside a model repo on huggingface.co.
	Valid model ids can be located at the root-level, like `bert-base-uncased`, or namespaced under a
	user or organization name, like `dbmdz/bert-base-german-cased`.
	- A path to a directory containing tokenizer files, e.g., `./my_model_directory/`.
	- `options` ([`PretrainedTokenizerOptions`](./tokenizers#module_tokenizers.PretrainedTokenizerOptions)) — Additional options for loading the tokenizer.

	Returns: `Promise`<[`PreTrainedTokenizer`](./tokenizers#module_tokenizers.PreTrainedTokenizer)> — The loaded tokenizer.

	### PreTrainedTokenizer

	`PreTrainedTokenizer` is the base class for all tokenizers in Transformers.js.

	#### `PreTrainedTokenizer(text, [options])`

	Parameters

	- `text` (`string[]?`)
	- `options` ([`TokenizerCallOptions`](./tokenizers#module_tokenizers.TokenizerCallOptions)<`string[]?`, `boolean = true`>) _optional_

	Returns: [`BatchEncoding`](./tokenizers#module_tokenizers.BatchEncoding)<`BatchEncodingItem`<`string[]?`, `boolean = true`>>

	#### `PreTrainedTokenizer.constructor(tokenizerJSON, tokenizerConfig)`

	Create a new PreTrainedTokenizer instance.

	Parameters

	- `tokenizerJSON` (`Object`) — The JSON of the tokenizer.
	- `tokenizerConfig` (`Object`) — The config of the tokenizer.

	#### `PreTrainedTokenizer.from_pretrained(pretrained_model_name_or_path, options)`

	Loads a pretrained tokenizer from the given `pretrained_model_name_or_path`.

	Parameters

	- `pretrained_model_name_or_path` (`string`) — The path to the pretrained tokenizer.
	- `options` ([`PretrainedTokenizerOptions`](./tokenizers#module_tokenizers.PretrainedTokenizerOptions)) — Additional options for loading the tokenizer.

	Returns: `Promise`<[`PreTrainedTokenizer`](./tokenizers#module_tokenizers.PreTrainedTokenizer)> — A new instance of the `PreTrainedTokenizer` class.

	Throws

	- `Error` — Throws an error if the tokenizer.json or tokenizer_config.json files are not found in the `pretrained_model_name_or_path`.

	#### `PreTrainedTokenizer.convert_tokens_to_ids(tokens)`

	Converts a token string (or a sequence of tokens) into a single integer id (or a sequence of ids), using the vocabulary.

	Parameters

	- `tokens` (`string[]?`) — One or several token(s) to convert to token id(s).

	Returns: `string[]?` — The token id or list of token ids.

	#### `PreTrainedTokenizer.tokenize(text, options)`

	Converts a string into a sequence of tokens.

	Parameters

	- `text` (`string`) — The sequence to be encoded.
	- `options` (`Object`) — An optional object containing the following properties:
	- `pair` (`string` \| `null`) _optional_ — A second sequence to be encoded with the first.
	- `add_special_tokens` (`boolean`) _optional_ — defaults to `false` — Whether or not to add the special tokens associated with the corresponding model.

	Returns: `string[]` — The list of tokens.

	#### `PreTrainedTokenizer.encode(text, options)`

	Encodes a single text or a pair of texts using the model's tokenizer.

	Parameters

	- `text` (`string`) — The text to encode.
	- `options` (`Object`) — An optional object containing the following properties:
	- `text_pair` (`string` \| `null`) _optional_ — defaults to `null` — The optional second text to encode.
	- `add_special_tokens` (`boolean`) _optional_ — defaults to `true` — Whether or not to add the special tokens associated with the corresponding model.
	- `return_token_type_ids` (`boolean` \| `null`) _optional_ — defaults to `null` — Whether to return token_type_ids.

	Returns: `number[]` — An array of token IDs representing the encoded text(s).

	#### `PreTrainedTokenizer.batch_decode(batch, decode_args)`

	Decode a batch of tokenized sequences.

	Parameters

	- `batch` (`number[][]` \| [`Tensor`](./utils/tensor#module_utils/tensor.Tensor)) — List/Tensor of tokenized input sequences.
	- `decode_args` (`Object`) — (Optional) Object with decoding arguments.

	Returns: `string[]` — List of decoded sequences.

	#### `PreTrainedTokenizer.decode(token_ids, [decode_args])`

	Decodes a sequence of token IDs back to a string.

	Parameters

	- `token_ids` (`number[]` \| `bigint[]` \| [`Tensor`](./utils/tensor#module_utils/tensor.Tensor)) — List/Tensor of token IDs to decode.
	- `decode_args` (`Object`) _optional_ — defaults to `{}`
	- `skip_special_tokens` (`boolean`) _optional_ — defaults to `false` — If true, special tokens are removed from the output string.
	- `clean_up_tokenization_spaces` (`boolean`) _optional_ — defaults to `true` — If true, spaces before punctuation and abbreviated forms are removed.

	Returns: `string` — The decoded string.

	Throws

	- `Error` — If `token_ids` is not a non-empty array of integers.

	#### `PreTrainedTokenizer.decode_single(token_ids, decode_args)`

	Decode a single list of token ids to a string.

	Parameters

	- `token_ids` (`number[]` \| `bigint[]`) — List of token ids to decode
	- `decode_args` (`Object`) — Optional arguments for decoding
	- `skip_special_tokens` (`boolean`) _optional_ — defaults to `false` — Whether to skip special tokens during decoding
	- `clean_up_tokenization_spaces` (`boolean` \| `null`) _optional_ — defaults to `null` — Whether to clean up tokenization spaces during decoding.
	If null, the value is set to `this.decoder.cleanup` if it exists, falling back to `this.clean_up_tokenization_spaces` if it exists, falling back to `true`.

	Returns: `string` — The decoded string

	#### `PreTrainedTokenizer.get_chat_template(options)`

	Retrieve the chat template string used for tokenizing chat messages. This template is used
	internally by the `apply_chat_template` method and can also be used externally to retrieve the model's chat
	template for better generation tracking.

	Parameters

	- `options` (`Object`) — An optional object containing the following properties:
	- `chat_template` (`string` \| `null`) _optional_ — defaults to `null` — A Jinja template or the name of a template to use for this conversion.
	It is usually not necessary to pass anything to this argument,
	as the model's template will be used by default.
	- `tools` (`Object[]`) _optional_ — defaults to `null` — A list of tools (callable functions) that will be accessible to the model. If the template does not
	support function calling, this argument will have no effect. Each tool should be passed as a JSON Schema,
	giving the name, description and argument types for the tool. See our
	[chat templating guide](https://huggingface.co/docs/transformers/main/en/chat_templating#automated-function-conversion-for-tool-use)
	for more information.

	Returns: `string` — The chat template string.

	#### `PreTrainedTokenizer.apply_chat_template(conversation, [options])`

	Converts a list of message objects with `"role"` and `"content"` keys to a list of token
	ids. This method is intended for use with chat models, and will read the tokenizer's chat_template attribute to
	determine the format and control tokens to use when converting.

	See the [chat templating guide](https://huggingface.co/docs/transformers/chat_templating) for more information.

	Parameters

	- `conversation` ([`Message`](./tokenizers#module_tokenizers.Message)[]) — A list of message objects with `"role"` and `"content"` keys,
	representing the chat history so far.
	- `options` ([`ApplyChatTemplateOptions`](./tokenizers#module_tokenizers.ApplyChatTemplateOptions)<`TTokenize`, `TReturnTensor`, `TReturnDict`>) _optional_ — Options controlling
	template rendering and tokenization.

	Returns: `ApplyChatTemplateReturn`<`TTokenize`, `TReturnTensor`, `TReturnDict`> — The tokenized output.

	Example: Applying a chat template to a conversation.
	```javascript
	import { AutoTokenizer } from "@huggingface/transformers";

	const tokenizer = await AutoTokenizer.from_pretrained("Xenova/mistral-tokenizer-v1");

	const chat = [
	{ "role": "user", "content": "Hello, how are you?" },
	{ "role": "assistant", "content": "I'm doing great. How can I help you today?" },
	{ "role": "user", "content": "I'd like to show off how chat templating works!" },
	]

	const text = tokenizer.apply_chat_template(chat, { tokenize: false });
	// "<s>[INST] Hello, how are you? [/INST]I'm doing great. How can I help you today?</s> [INST] I'd like to show off how chat templating works! [/INST]"

	const input_ids = tokenizer.apply_chat_template(chat, { tokenize: true, return_tensor: false });
	// [1, 733, 16289, 28793, 22557, 28725, 910, 460, 368, 28804, 733, 28748, 16289, 28793, 28737, 28742, 28719, 2548, 1598, 28723, 1602, 541, 315, 1316, 368, 3154, 28804, 2, 28705, 733, 16289, 28793, 315, 28742, 28715, 737, 298, 1347, 805, 910, 10706, 5752, 1077, 3791, 28808, 733, 28748, 16289, 28793]
	```

	## Type Definitions

	### PretrainedTokenizerOptions

	_Type:_ [`PretrainedOptions`](./utils/hub#module_utils/hub.PretrainedOptions)

	### TextContent

	Properties

	- `type` (`'text'`) — The type of content (must be 'text').
	- `text` (`string`) — The text content.

	### ImageContent

	Properties

	- `type` (`'image'`) — The type of content (must be 'image').
	- `image` (`string` \| [`RawImage`](./utils/image#module_utils/image.RawImage)) _optional_ — Optional URL or instance of the image.

	Note: This works for SmolVLM. Qwen2VL and Idefics3 have different implementations.

	### MessageContent

	A single content block inside a chat message. Extend the union to add
	custom types (e.g. `AudioContent`) when targeting a specific model.

	_Type:_ [`TextContent`](./tokenizers#module_tokenizers.TextContent) \| [`ImageContent`](./tokenizers#module_tokenizers.ImageContent) \| `{ type: string & {}, [key: string]: any }`

	### Message

	Properties

	- `role` (`'user'` \| `'assistant'` \| `'system'` \| `(string & {})`) — The role of the message.
	- `content` (`string` \| [`MessageContent`](./tokenizers#module_tokenizers.MessageContent)[]) — The content of the message. Can be a simple string or an array of content objects.

	### BatchEncoding

	The object returned from `tokenizer(text)`. The fields are a `Tensor` by
	default, or an `Array` when `return_tensor: false` is passed.

	Properties

	- `input_ids` (`any`) — Token ids to be fed to the model.
	- `attention_mask` (`any`) — Mask indicating which tokens should be attended to (1) versus padded (0).
	- `token_type_ids` (`any`) _optional_ — Segment ids, present only for tokenizers that distinguish sequence A vs B (e.g. BERT).

	### TokenizerCallOptions

	Options passed to `tokenizer(text, options)`.

	Properties

	- `text_pair` (`any`) _optional_ — defaults to `null` — Optional second sequence to be encoded. Must match the shape of `text` — string when `text` is a string, array when `text` is an array.
	- `padding` (`boolean` \| `'max_length'`) _optional_ — defaults to `false` — Whether to pad the input sequences.
	- `add_special_tokens` (`boolean`) _optional_ — defaults to `true` — Whether or not to add the special tokens associated with the corresponding model.
	- `truncation` (`boolean` \| `null`) _optional_ — defaults to `null` — Whether to truncate the input sequences.
	- `max_length` (`number` \| `null`) _optional_ — defaults to `null` — Maximum length of the returned list and optionally padding length.
	- `return_tensor` (`any`) _optional_ — defaults to `true` — Whether to return the results as Tensors or arrays.
	- `return_token_type_ids` (`boolean` \| `null`) _optional_ — defaults to `null` — Whether to return the token type ids.

	### ApplyChatTemplateOptions

	Properties

	- `chat_template` (`string` \| `null`) _optional_ — defaults to `null` — A Jinja template to use for this conversion. If omitted, the model's chat template is used.
	- `tools` (`Object[]` \| `null`) _optional_ — defaults to `null` — JSON Schema tool definitions exposed to templates that support function calling.
	See the [chat templating guide](https://huggingface.co/docs/transformers/main/en/chat_templating#automated-function-conversion-for-tool-use).
	- `documents` (`Record`<`string`, `string`>[] \| `null`) _optional_ — defaults to `null` — Documents exposed to templates that support retrieval-augmented generation.
	See the [RAG section](https://huggingface.co/docs/transformers/main/en/chat_templating#arguments-for-RAG) of the chat templating guide.
	- `add_generation_prompt` (`boolean`) _optional_ — defaults to `false` — Whether to end the prompt with the token(s) that indicate the start of an assistant message.
	The template must support this argument for it to have any effect.
	- `tokenize` (`any`) _optional_ — defaults to `true` — Whether to tokenize the output. If false, the output will be a string.
	- `padding` (`boolean`) _optional_ — defaults to `false` — Whether to pad sequences to the maximum length. Has no effect if tokenize is false.
	- `truncation` (`boolean`) _optional_ — defaults to `false` — Whether to truncate sequences to the maximum length. Has no effect if tokenize is false.
	- `max_length` (`number` \| `null`) _optional_ — defaults to `null` — Maximum length (in tokens) to use for padding or truncation. If omitted, the tokenizer's `max_length` is used.
	Has no effect if tokenize is false.
	- `return_tensor` (`any`) _optional_ — defaults to `true` — Whether to return the output as a Tensor or an Array. Has no effect if tokenize is false.
	- `return_dict` (`any`) _optional_ — defaults to `true` — Whether to return a dictionary with named outputs. Has no effect if tokenize is false.
	- `tokenizer_kwargs` (`Object`) _optional_ — defaults to `{}` — Additional options to pass to the tokenizer.

	## Callbacks

	### PreTrainedTokenizerCallback

	Parameters

	- `text` (`string[]?`)
	- `options` ([`TokenizerCallOptions`](./tokenizers#module_tokenizers.TokenizerCallOptions)<`string[]?`, `boolean = true`>) _optional_

	Returns: [`BatchEncoding`](./tokenizers#module_tokenizers.BatchEncoding)<`BatchEncodingItem`<`string[]?`, `boolean = true`>>

Xet Storage Details

Size:: 15.8 kB
Xet hash:: 2132003a7cfdd842221de3e236c68af87f7880c0b63216c98a669070c5571371

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.