Buckets:
| # tokenizers | |
| Tokenizers are used to prepare textual inputs for a model. | |
| **Example:** Create an `AutoTokenizer` and use it to tokenize a sentence. | |
| This will automatically detect the tokenizer type based on the tokenizer class defined in `tokenizer.json`. | |
| ```javascript | |
| import { AutoTokenizer } from '@huggingface/transformers'; | |
| const tokenizer = await AutoTokenizer.from_pretrained('Xenova/bert-base-uncased'); | |
| const { input_ids } = await tokenizer('I love transformers!'); | |
| // Tensor { | |
| // data: BigInt64Array(6) [101n, 1045n, 2293n, 19081n, 999n, 102n], | |
| // dims: [1, 6], | |
| // type: 'int64', | |
| // size: 6, | |
| // } | |
| ``` | |
| * [tokenizers](#module_tokenizers) | |
| * _static_ | |
| * [.TokenizerModel](#module_tokenizers.TokenizerModel) ⇐ [<code>Callable</code>](#Callable) | |
| * [`new TokenizerModel(config)`](#new_module_tokenizers.TokenizerModel_new) | |
| * _instance_ | |
| * [`.vocab`](#module_tokenizers.TokenizerModel+vocab) : <code>Array.<string></code> | |
| * [`.tokens_to_ids`](#module_tokenizers.TokenizerModel+tokens_to_ids) : <code>Map.<string, number></code> | |
| * [`.fuse_unk`](#module_tokenizers.TokenizerModel+fuse_unk) : <code>boolean</code> | |
| * [`._call(tokens)`](#module_tokenizers.TokenizerModel+_call) ⇒ <code>Array.<string></code> | |
| * [`.encode(tokens)`](#module_tokenizers.TokenizerModel+encode) ⇒ <code>Array.<string></code> | |
| * [`.convert_tokens_to_ids(tokens)`](#module_tokenizers.TokenizerModel+convert_tokens_to_ids) ⇒ <code>Array.<number></code> | |
| * [`.convert_ids_to_tokens(ids)`](#module_tokenizers.TokenizerModel+convert_ids_to_tokens) ⇒ <code>Array.<string></code> | |
| * _static_ | |
| * [`.fromConfig(config, ...args)`](#module_tokenizers.TokenizerModel.fromConfig) ⇒ <code>TokenizerModel</code> | |
| * [.PreTrainedTokenizer](#module_tokenizers.PreTrainedTokenizer) | |
| * [`new PreTrainedTokenizer(tokenizerJSON, tokenizerConfig)`](#new_module_tokenizers.PreTrainedTokenizer_new) | |
| * _instance_ | |
| * [`.added_tokens`](#module_tokenizers.PreTrainedTokenizer+added_tokens) : <code>Array.<AddedToken></code> | |
| * [`.added_tokens_map`](#module_tokenizers.PreTrainedTokenizer+added_tokens_map) : <code>Map.<string, AddedToken></code> | |
| * [`.remove_space`](#module_tokenizers.PreTrainedTokenizer+remove_space) : <code>boolean</code> | |
| * [`._call(text, options)`](#module_tokenizers.PreTrainedTokenizer+_call) ⇒ <code>BatchEncoding</code> | |
| * [`._encode_text(text)`](#module_tokenizers.PreTrainedTokenizer+_encode_text) ⇒ <code>Array<string></code> | <code>null</code> | |
| * [`._tokenize_helper(text, options)`](#module_tokenizers.PreTrainedTokenizer+_tokenize_helper) ⇒ <code>*</code> | |
| * [`.tokenize(text, options)`](#module_tokenizers.PreTrainedTokenizer+tokenize) ⇒ <code>Array.<string></code> | |
| * [`.encode(text, options)`](#module_tokenizers.PreTrainedTokenizer+encode) ⇒ <code>Array.<number></code> | |
| * [`.batch_decode(batch, decode_args)`](#module_tokenizers.PreTrainedTokenizer+batch_decode) ⇒ <code>Array.<string></code> | |
| * [`.decode(token_ids, [decode_args])`](#module_tokenizers.PreTrainedTokenizer+decode) ⇒ <code>string</code> | |
| * [`.decode_single(token_ids, decode_args)`](#module_tokenizers.PreTrainedTokenizer+decode_single) ⇒ <code>string</code> | |
| * [`.get_chat_template(options)`](#module_tokenizers.PreTrainedTokenizer+get_chat_template) ⇒ <code>string</code> | |
| * [`.apply_chat_template(conversation, options)`](#module_tokenizers.PreTrainedTokenizer+apply_chat_template) ⇒ <code>string</code> | [<code>Tensor</code>](#Tensor) | <code>Array<number></code> | <code>Array<Array<number>></code> | <code>BatchEncoding</code> | |
| * _static_ | |
| * [`.from_pretrained(pretrained_model_name_or_path, options)`](#module_tokenizers.PreTrainedTokenizer.from_pretrained) ⇒ <code>Promise.<PreTrainedTokenizer></code> | |
| * [.BertTokenizer](#module_tokenizers.BertTokenizer) ⇐ <code>PreTrainedTokenizer</code> | |
| * [.AlbertTokenizer](#module_tokenizers.AlbertTokenizer) ⇐ <code>PreTrainedTokenizer</code> | |
| * [.NllbTokenizer](#module_tokenizers.NllbTokenizer) | |
| * [`._build_translation_inputs(raw_inputs, tokenizer_options, generate_kwargs)`](#module_tokenizers.NllbTokenizer+_build_translation_inputs) ⇒ <code>Object</code> | |
| * [.M2M100Tokenizer](#module_tokenizers.M2M100Tokenizer) | |
| * [`._build_translation_inputs(raw_inputs, tokenizer_options, generate_kwargs)`](#module_tokenizers.M2M100Tokenizer+_build_translation_inputs) ⇒ <code>Object</code> | |
| * [.WhisperTokenizer](#module_tokenizers.WhisperTokenizer) ⇐ <code>PreTrainedTokenizer</code> | |
| * [`._decode_asr(sequences, options)`](#module_tokenizers.WhisperTokenizer+_decode_asr) ⇒ <code>*</code> | |
| * [`.decode()`](#module_tokenizers.WhisperTokenizer+decode) : <code>*</code> | |
| * [.MarianTokenizer](#module_tokenizers.MarianTokenizer) | |
| * [`new MarianTokenizer(tokenizerJSON, tokenizerConfig)`](#new_module_tokenizers.MarianTokenizer_new) | |
| * [`._encode_text(text)`](#module_tokenizers.MarianTokenizer+_encode_text) ⇒ <code>Array</code> | |
| * [.AutoTokenizer](#module_tokenizers.AutoTokenizer) | |
| * [`new AutoTokenizer()`](#new_module_tokenizers.AutoTokenizer_new) | |
| * [`.from_pretrained(pretrained_model_name_or_path, options)`](#module_tokenizers.AutoTokenizer.from_pretrained) ⇒ <code>Promise.<PreTrainedTokenizer></code> | |
| * [`.is_chinese_char(cp)`](#module_tokenizers.is_chinese_char) ⇒ <code>boolean</code> | |
| * _inner_ | |
| * [~AddedToken](#module_tokenizers..AddedToken) | |
| * [`new AddedToken(config)`](#new_module_tokenizers..AddedToken_new) | |
| * [~WordPieceTokenizer](#module_tokenizers..WordPieceTokenizer) ⇐ <code>TokenizerModel</code> | |
| * [`new WordPieceTokenizer(config)`](#new_module_tokenizers..WordPieceTokenizer_new) | |
| * [`.tokens_to_ids`](#module_tokenizers..WordPieceTokenizer+tokens_to_ids) : <code>Map.<string, number></code> | |
| * [`.unk_token_id`](#module_tokenizers..WordPieceTokenizer+unk_token_id) : <code>number</code> | |
| * [`.unk_token`](#module_tokenizers..WordPieceTokenizer+unk_token) : <code>string</code> | |
| * [`.max_input_chars_per_word`](#module_tokenizers..WordPieceTokenizer+max_input_chars_per_word) : <code>number</code> | |
| * [`.vocab`](#module_tokenizers..WordPieceTokenizer+vocab) : <code>Array.<string></code> | |
| * [`.encode(tokens)`](#module_tokenizers..WordPieceTokenizer+encode) ⇒ <code>Array.<string></code> | |
| * [~Unigram](#module_tokenizers..Unigram) ⇐ <code>TokenizerModel</code> | |
| * [`new Unigram(config, moreConfig)`](#new_module_tokenizers..Unigram_new) | |
| * [`.scores`](#module_tokenizers..Unigram+scores) : <code>Array.<number></code> | |
| * [`.populateNodes(lattice)`](#module_tokenizers..Unigram+populateNodes) | |
| * [`.tokenize(normalized)`](#module_tokenizers..Unigram+tokenize) ⇒ <code>Array.<string></code> | |
| * [`.encode(tokens)`](#module_tokenizers..Unigram+encode) ⇒ <code>Array.<string></code> | |
| * [~BPE](#module_tokenizers..BPE) ⇐ <code>TokenizerModel</code> | |
| * [`new BPE(config)`](#new_module_tokenizers..BPE_new) | |
| * [`.tokens_to_ids`](#module_tokenizers..BPE+tokens_to_ids) : <code>Map.<string, number></code> | |
| * [`.merges`](#module_tokenizers..BPE+merges) : <code>*</code> | |
| * [`.config.merges`](#module_tokenizers..BPE+merges.config.merges) : <code>*</code> | |
| * [`.max_length_to_cache`](#module_tokenizers..BPE+max_length_to_cache) | |
| * [`.cache_capacity`](#module_tokenizers..BPE+cache_capacity) | |
| * [`.clear_cache()`](#module_tokenizers..BPE+clear_cache) | |
| * [`.bpe(token)`](#module_tokenizers..BPE+bpe) ⇒ <code>Array.<string></code> | |
| * [`.encode(tokens)`](#module_tokenizers..BPE+encode) ⇒ <code>Array.<string></code> | |
| * [~LegacyTokenizerModel](#module_tokenizers..LegacyTokenizerModel) | |
| * [`new LegacyTokenizerModel(config, moreConfig)`](#new_module_tokenizers..LegacyTokenizerModel_new) | |
| * [`.tokens_to_ids`](#module_tokenizers..LegacyTokenizerModel+tokens_to_ids) : <code>Map.<string, number></code> | |
| * *[~Normalizer](#module_tokenizers..Normalizer)* | |
| * *[`new Normalizer(config)`](#new_module_tokenizers..Normalizer_new)* | |
| * _instance_ | |
| * **[`.normalize(text)`](#module_tokenizers..Normalizer+normalize) ⇒ <code>string</code>** | |
| * *[`._call(text)`](#module_tokenizers..Normalizer+_call) ⇒ <code>string</code>* | |
| * _static_ | |
| * *[`.fromConfig(config)`](#module_tokenizers..Normalizer.fromConfig) ⇒ <code>Normalizer</code>* | |
| * [~Replace](#module_tokenizers..Replace) ⇐ <code>Normalizer</code> | |
| * [`.normalize(text)`](#module_tokenizers..Replace+normalize) ⇒ <code>string</code> | |
| * *[~UnicodeNormalizer](#module_tokenizers..UnicodeNormalizer) ⇐ <code>Normalizer</code>* | |
| * *[`.form`](#module_tokenizers..UnicodeNormalizer+form) : <code>string</code>* | |
| * *[`.normalize(text)`](#module_tokenizers..UnicodeNormalizer+normalize) ⇒ <code>string</code>* | |
| * [~NFC](#module_tokenizers..NFC) ⇐ <code>UnicodeNormalizer</code> | |
| * [~NFD](#module_tokenizers..NFD) ⇐ <code>UnicodeNormalizer</code> | |
| * [~NFKC](#module_tokenizers..NFKC) ⇐ <code>UnicodeNormalizer</code> | |
| * [~NFKD](#module_tokenizers..NFKD) ⇐ <code>UnicodeNormalizer</code> | |
| * [~StripNormalizer](#module_tokenizers..StripNormalizer) | |
| * [`.normalize(text)`](#module_tokenizers..StripNormalizer+normalize) ⇒ <code>string</code> | |
| * [~StripAccents](#module_tokenizers..StripAccents) ⇐ <code>Normalizer</code> | |
| * [`.normalize(text)`](#module_tokenizers..StripAccents+normalize) ⇒ <code>string</code> | |
| * [~Lowercase](#module_tokenizers..Lowercase) ⇐ <code>Normalizer</code> | |
| * [`.normalize(text)`](#module_tokenizers..Lowercase+normalize) ⇒ <code>string</code> | |
| * [~Prepend](#module_tokenizers..Prepend) ⇐ <code>Normalizer</code> | |
| * [`.normalize(text)`](#module_tokenizers..Prepend+normalize) ⇒ <code>string</code> | |
| * [~NormalizerSequence](#module_tokenizers..NormalizerSequence) ⇐ <code>Normalizer</code> | |
| * [`new NormalizerSequence(config)`](#new_module_tokenizers..NormalizerSequence_new) | |
| * [`.normalize(text)`](#module_tokenizers..NormalizerSequence+normalize) ⇒ <code>string</code> | |
| * [~BertNormalizer](#module_tokenizers..BertNormalizer) ⇐ <code>Normalizer</code> | |
| * [`._tokenize_chinese_chars(text)`](#module_tokenizers..BertNormalizer+_tokenize_chinese_chars) ⇒ <code>string</code> | |
| * [`.stripAccents(text)`](#module_tokenizers..BertNormalizer+stripAccents) ⇒ <code>string</code> | |
| * [`.normalize(text)`](#module_tokenizers..BertNormalizer+normalize) ⇒ <code>string</code> | |
| * [~PreTokenizer](#module_tokenizers..PreTokenizer) ⇐ [<code>Callable</code>](#Callable) | |
| * _instance_ | |
| * *[`.pre_tokenize_text(text, [options])`](#module_tokenizers..PreTokenizer+pre_tokenize_text) ⇒ <code>Array.<string></code>* | |
| * [`.pre_tokenize(text, [options])`](#module_tokenizers..PreTokenizer+pre_tokenize) ⇒ <code>Array.<string></code> | |
| * [`._call(text, [options])`](#module_tokenizers..PreTokenizer+_call) ⇒ <code>Array.<string></code> | |
| * _static_ | |
| * [`.fromConfig(config)`](#module_tokenizers..PreTokenizer.fromConfig) ⇒ <code>PreTokenizer</code> | |
| * [~BertPreTokenizer](#module_tokenizers..BertPreTokenizer) ⇐ <code>PreTokenizer</code> | |
| * [`new BertPreTokenizer(config)`](#new_module_tokenizers..BertPreTokenizer_new) | |
| * [`.pre_tokenize_text(text, [options])`](#module_tokenizers..BertPreTokenizer+pre_tokenize_text) ⇒ <code>Array.<string></code> | |
| * [~ByteLevelPreTokenizer](#module_tokenizers..ByteLevelPreTokenizer) ⇐ <code>PreTokenizer</code> | |
| * [`new ByteLevelPreTokenizer(config)`](#new_module_tokenizers..ByteLevelPreTokenizer_new) | |
| * [`.add_prefix_space`](#module_tokenizers..ByteLevelPreTokenizer+add_prefix_space) : <code>boolean</code> | |
| * [`.trim_offsets`](#module_tokenizers..ByteLevelPreTokenizer+trim_offsets) : <code>boolean</code> | |
| * [`.use_regex`](#module_tokenizers..ByteLevelPreTokenizer+use_regex) : <code>boolean</code> | |
| * [`.pre_tokenize_text(text, [options])`](#module_tokenizers..ByteLevelPreTokenizer+pre_tokenize_text) ⇒ <code>Array.<string></code> | |
| * [~SplitPreTokenizer](#module_tokenizers..SplitPreTokenizer) ⇐ <code>PreTokenizer</code> | |
| * [`new SplitPreTokenizer(config)`](#new_module_tokenizers..SplitPreTokenizer_new) | |
| * [`.pre_tokenize_text(text, [options])`](#module_tokenizers..SplitPreTokenizer+pre_tokenize_text) ⇒ <code>Array.<string></code> | |
| * [~PunctuationPreTokenizer](#module_tokenizers..PunctuationPreTokenizer) ⇐ <code>PreTokenizer</code> | |
| * [`new PunctuationPreTokenizer(config)`](#new_module_tokenizers..PunctuationPreTokenizer_new) | |
| * [`.pre_tokenize_text(text, [options])`](#module_tokenizers..PunctuationPreTokenizer+pre_tokenize_text) ⇒ <code>Array.<string></code> | |
| * [~DigitsPreTokenizer](#module_tokenizers..DigitsPreTokenizer) ⇐ <code>PreTokenizer</code> | |
| * [`new DigitsPreTokenizer(config)`](#new_module_tokenizers..DigitsPreTokenizer_new) | |
| * [`.pre_tokenize_text(text, [options])`](#module_tokenizers..DigitsPreTokenizer+pre_tokenize_text) ⇒ <code>Array.<string></code> | |
| * [~PostProcessor](#module_tokenizers..PostProcessor) ⇐ [<code>Callable</code>](#Callable) | |
| * [`new PostProcessor(config)`](#new_module_tokenizers..PostProcessor_new) | |
| * _instance_ | |
| * [`.post_process(tokens, ...args)`](#module_tokenizers..PostProcessor+post_process) ⇒ <code>PostProcessedOutput</code> | |
| * [`._call(tokens, ...args)`](#module_tokenizers..PostProcessor+_call) ⇒ <code>PostProcessedOutput</code> | |
| * _static_ | |
| * [`.fromConfig(config)`](#module_tokenizers..PostProcessor.fromConfig) ⇒ <code>PostProcessor</code> | |
| * [~BertProcessing](#module_tokenizers..BertProcessing) | |
| * [`new BertProcessing(config)`](#new_module_tokenizers..BertProcessing_new) | |
| * [`.post_process(tokens, [tokens_pair])`](#module_tokenizers..BertProcessing+post_process) ⇒ <code>PostProcessedOutput</code> | |
| * [~TemplateProcessing](#module_tokenizers..TemplateProcessing) ⇐ <code>PostProcessor</code> | |
| * [`new TemplateProcessing(config)`](#new_module_tokenizers..TemplateProcessing_new) | |
| * [`.post_process(tokens, [tokens_pair])`](#module_tokenizers..TemplateProcessing+post_process) ⇒ <code>PostProcessedOutput</code> | |
| * [~ByteLevelPostProcessor](#module_tokenizers..ByteLevelPostProcessor) ⇐ <code>PostProcessor</code> | |
| * [`.post_process(tokens, [tokens_pair])`](#module_tokenizers..ByteLevelPostProcessor+post_process) ⇒ <code>PostProcessedOutput</code> | |
| * [~PostProcessorSequence](#module_tokenizers..PostProcessorSequence) | |
| * [`new PostProcessorSequence(config)`](#new_module_tokenizers..PostProcessorSequence_new) | |
| * [`.post_process(tokens, [tokens_pair])`](#module_tokenizers..PostProcessorSequence+post_process) ⇒ <code>PostProcessedOutput</code> | |
| * [~Decoder](#module_tokenizers..Decoder) ⇐ [<code>Callable</code>](#Callable) | |
| * [`new Decoder(config)`](#new_module_tokenizers..Decoder_new) | |
| * _instance_ | |
| * [`.added_tokens`](#module_tokenizers..Decoder+added_tokens) : <code>Array.<AddedToken></code> | |
| * [`._call(tokens)`](#module_tokenizers..Decoder+_call) ⇒ <code>string</code> | |
| * [`.decode(tokens)`](#module_tokenizers..Decoder+decode) ⇒ <code>string</code> | |
| * [`.decode_chain(tokens)`](#module_tokenizers..Decoder+decode_chain) ⇒ <code>Array.<string></code> | |
| * _static_ | |
| * [`.fromConfig(config)`](#module_tokenizers..Decoder.fromConfig) ⇒ <code>Decoder</code> | |
| * [~FuseDecoder](#module_tokenizers..FuseDecoder) | |
| * [`.decode_chain()`](#module_tokenizers..FuseDecoder+decode_chain) : <code>*</code> | |
| * [~WordPieceDecoder](#module_tokenizers..WordPieceDecoder) ⇐ <code>Decoder</code> | |
| * [`new WordPieceDecoder(config)`](#new_module_tokenizers..WordPieceDecoder_new) | |
| * [`.decode_chain()`](#module_tokenizers..WordPieceDecoder+decode_chain) : <code>*</code> | |
| * [~ByteLevelDecoder](#module_tokenizers..ByteLevelDecoder) ⇐ <code>Decoder</code> | |
| * [`new ByteLevelDecoder(config)`](#new_module_tokenizers..ByteLevelDecoder_new) | |
| * [`.convert_tokens_to_string(tokens)`](#module_tokenizers..ByteLevelDecoder+convert_tokens_to_string) ⇒ <code>string</code> | |
| * [`.decode_chain()`](#module_tokenizers..ByteLevelDecoder+decode_chain) : <code>*</code> | |
| * [~CTCDecoder](#module_tokenizers..CTCDecoder) | |
| * [`.convert_tokens_to_string(tokens)`](#module_tokenizers..CTCDecoder+convert_tokens_to_string) ⇒ <code>string</code> | |
| * [`.decode_chain()`](#module_tokenizers..CTCDecoder+decode_chain) : <code>*</code> | |
| * [~DecoderSequence](#module_tokenizers..DecoderSequence) ⇐ <code>Decoder</code> | |
| * [`new DecoderSequence(config)`](#new_module_tokenizers..DecoderSequence_new) | |
| * [`.decode_chain()`](#module_tokenizers..DecoderSequence+decode_chain) : <code>*</code> | |
| * [~MetaspacePreTokenizer](#module_tokenizers..MetaspacePreTokenizer) ⇐ <code>PreTokenizer</code> | |
| * [`new MetaspacePreTokenizer(config)`](#new_module_tokenizers..MetaspacePreTokenizer_new) | |
| * [`.pre_tokenize_text(text, [options])`](#module_tokenizers..MetaspacePreTokenizer+pre_tokenize_text) ⇒ <code>Array.<string></code> | |
| * [~MetaspaceDecoder](#module_tokenizers..MetaspaceDecoder) ⇐ <code>Decoder</code> | |
| * [`new MetaspaceDecoder(config)`](#new_module_tokenizers..MetaspaceDecoder_new) | |
| * [`.decode_chain()`](#module_tokenizers..MetaspaceDecoder+decode_chain) : <code>*</code> | |
| * [~Precompiled](#module_tokenizers..Precompiled) ⇐ <code>Normalizer</code> | |
| * [`new Precompiled(config)`](#new_module_tokenizers..Precompiled_new) | |
| * [`.normalize(text)`](#module_tokenizers..Precompiled+normalize) ⇒ <code>string</code> | |
| * [~PreTokenizerSequence](#module_tokenizers..PreTokenizerSequence) ⇐ <code>PreTokenizer</code> | |
| * [`new PreTokenizerSequence(config)`](#new_module_tokenizers..PreTokenizerSequence_new) | |
| * [`.pre_tokenize_text(text, [options])`](#module_tokenizers..PreTokenizerSequence+pre_tokenize_text) ⇒ <code>Array.<string></code> | |
| * [~WhitespacePreTokenizer](#module_tokenizers..WhitespacePreTokenizer) | |
| * [`new WhitespacePreTokenizer(config)`](#new_module_tokenizers..WhitespacePreTokenizer_new) | |
| * [`.pre_tokenize_text(text, [options])`](#module_tokenizers..WhitespacePreTokenizer+pre_tokenize_text) ⇒ <code>Array.<string></code> | |
| * [~WhitespaceSplit](#module_tokenizers..WhitespaceSplit) ⇐ <code>PreTokenizer</code> | |
| * [`new WhitespaceSplit(config)`](#new_module_tokenizers..WhitespaceSplit_new) | |
| * [`.pre_tokenize_text(text, [options])`](#module_tokenizers..WhitespaceSplit+pre_tokenize_text) ⇒ <code>Array.<string></code> | |
| * [~ReplacePreTokenizer](#module_tokenizers..ReplacePreTokenizer) | |
| * [`new ReplacePreTokenizer(config)`](#new_module_tokenizers..ReplacePreTokenizer_new) | |
| * [`.pre_tokenize_text(text, [options])`](#module_tokenizers..ReplacePreTokenizer+pre_tokenize_text) ⇒ <code>Array.<string></code> | |
| * [`~BYTES_TO_UNICODE`](#module_tokenizers..BYTES_TO_UNICODE) ⇒ <code>Object</code> | |
| * [`~loadTokenizer(pretrained_model_name_or_path, options)`](#module_tokenizers..loadTokenizer) ⇒ <code>Promise.<Array<any>></code> | |
| * [`~regexSplit(text, regex)`](#module_tokenizers..regexSplit) ⇒ <code>Array.<string></code> | |
| * [`~createPattern(pattern, invert)`](#module_tokenizers..createPattern) ⇒ <code>RegExp</code> | <code>null</code> | |
| * [`~objectToMap(obj)`](#module_tokenizers..objectToMap) ⇒ <code>Map.<string, any></code> | |
| * [`~prepareTensorForDecode(tensor)`](#module_tokenizers..prepareTensorForDecode) ⇒ <code>Array.<number></code> | |
| * [`~clean_up_tokenization(text)`](#module_tokenizers..clean_up_tokenization) ⇒ <code>string</code> | |
| * [`~remove_accents(text)`](#module_tokenizers..remove_accents) ⇒ <code>string</code> | |
| * [`~lowercase_and_remove_accent(text)`](#module_tokenizers..lowercase_and_remove_accent) ⇒ <code>string</code> | |
| * [`~whitespace_split(text)`](#module_tokenizers..whitespace_split) ⇒ <code>Array.<string></code> | |
| * [`~PretrainedTokenizerOptions`](#module_tokenizers..PretrainedTokenizerOptions) : <code>Object</code> | |
| * [`~BPENode`](#module_tokenizers..BPENode) : <code>Object</code> | |
| * [`~SplitDelimiterBehavior`](#module_tokenizers..SplitDelimiterBehavior) : <code>'removed'</code> | <code>'isolated'</code> | <code>'mergedWithPrevious'</code> | <code>'mergedWithNext'</code> | <code>'contiguous'</code> | |
| * [`~PostProcessedOutput`](#module_tokenizers..PostProcessedOutput) : <code>Object</code> | |
| * [`~EncodingSingle`](#module_tokenizers..EncodingSingle) : <code>Object</code> | |
| * [`~Message`](#module_tokenizers..Message) : <code>Object</code> | |
| * [`~BatchEncoding`](#module_tokenizers..BatchEncoding) : <code>Array<number></code> | <code>Array<Array<number>></code> | [<code>Tensor</code>](#Tensor) | |
| * * * | |
| <a id="module_tokenizers.TokenizerModel" class="group"></a> | |
| ## tokenizers.TokenizerModel ⇐ [<code>Callable</code>](#Callable) | |
| Abstract base class for tokenizer models. | |
| **Kind**: static class of [<code>tokenizers</code>](#module_tokenizers) | |
| **Extends**: [<code>Callable</code>](#Callable) | |
| * [.TokenizerModel](#module_tokenizers.TokenizerModel) ⇐ [<code>Callable</code>](#Callable) | |
| * [`new TokenizerModel(config)`](#new_module_tokenizers.TokenizerModel_new) | |
| * _instance_ | |
| * [`.vocab`](#module_tokenizers.TokenizerModel+vocab) : <code>Array.<string></code> | |
| * [`.tokens_to_ids`](#module_tokenizers.TokenizerModel+tokens_to_ids) : <code>Map.<string, number></code> | |
| * [`.fuse_unk`](#module_tokenizers.TokenizerModel+fuse_unk) : <code>boolean</code> | |
| * [`._call(tokens)`](#module_tokenizers.TokenizerModel+_call) ⇒ <code>Array.<string></code> | |
| * [`.encode(tokens)`](#module_tokenizers.TokenizerModel+encode) ⇒ <code>Array.<string></code> | |
| * [`.convert_tokens_to_ids(tokens)`](#module_tokenizers.TokenizerModel+convert_tokens_to_ids) ⇒ <code>Array.<number></code> | |
| * [`.convert_ids_to_tokens(ids)`](#module_tokenizers.TokenizerModel+convert_ids_to_tokens) ⇒ <code>Array.<string></code> | |
| * _static_ | |
| * [`.fromConfig(config, ...args)`](#module_tokenizers.TokenizerModel.fromConfig) ⇒ <code>TokenizerModel</code> | |
| * * * | |
| <a id="new_module_tokenizers.TokenizerModel_new" class="group"></a> | |
| ### `new TokenizerModel(config)` | |
| Creates a new instance of TokenizerModel. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>config</td><td><code>Object</code></td><td><p>The configuration object for the TokenizerModel.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers.TokenizerModel+vocab" class="group"></a> | |
| ### `tokenizerModel.vocab` : <code>Array.<string></code> | |
| **Kind**: instance property of [<code>TokenizerModel</code>](#module_tokenizers.TokenizerModel) | |
| * * * | |
| <a id="module_tokenizers.TokenizerModel+tokens_to_ids" class="group"></a> | |
| ### `tokenizerModel.tokens_to_ids` : <code>Map.<string, number></code> | |
| A mapping of tokens to ids. | |
| **Kind**: instance property of [<code>TokenizerModel</code>](#module_tokenizers.TokenizerModel) | |
| * * * | |
| <a id="module_tokenizers.TokenizerModel+fuse_unk" class="group"></a> | |
| ### `tokenizerModel.fuse_unk` : <code>boolean</code> | |
| Whether to fuse unknown tokens when encoding. Defaults to false. | |
| **Kind**: instance property of [<code>TokenizerModel</code>](#module_tokenizers.TokenizerModel) | |
| * * * | |
| <a id="module_tokenizers.TokenizerModel+_call" class="group"></a> | |
| ### `tokenizerModel._call(tokens)` ⇒ <code>Array.<string></code> | |
| Internal function to call the TokenizerModel instance. | |
| **Kind**: instance method of [<code>TokenizerModel</code>](#module_tokenizers.TokenizerModel) | |
| **Overrides**: [<code>_call</code>](#Callable+_call) | |
| **Returns**: <code>Array.<string></code> - The encoded tokens. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>tokens</td><td><code>Array.<string></code></td><td><p>The tokens to encode.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers.TokenizerModel+encode" class="group"></a> | |
| ### `tokenizerModel.encode(tokens)` ⇒ <code>Array.<string></code> | |
| Encodes a list of tokens into a list of token IDs. | |
| **Kind**: instance method of [<code>TokenizerModel</code>](#module_tokenizers.TokenizerModel) | |
| **Returns**: <code>Array.<string></code> - The encoded tokens. | |
| **Throws**: | |
| - Will throw an error if not implemented in a subclass. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>tokens</td><td><code>Array.<string></code></td><td><p>The tokens to encode.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers.TokenizerModel+convert_tokens_to_ids" class="group"></a> | |
| ### `tokenizerModel.convert_tokens_to_ids(tokens)` ⇒ <code>Array.<number></code> | |
| Converts a list of tokens into a list of token IDs. | |
| **Kind**: instance method of [<code>TokenizerModel</code>](#module_tokenizers.TokenizerModel) | |
| **Returns**: <code>Array.<number></code> - The converted token IDs. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>tokens</td><td><code>Array.<string></code></td><td><p>The tokens to convert.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers.TokenizerModel+convert_ids_to_tokens" class="group"></a> | |
| ### `tokenizerModel.convert_ids_to_tokens(ids)` ⇒ <code>Array.<string></code> | |
| Converts a list of token IDs into a list of tokens. | |
| **Kind**: instance method of [<code>TokenizerModel</code>](#module_tokenizers.TokenizerModel) | |
| **Returns**: <code>Array.<string></code> - The converted tokens. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>ids</td><td><code>Array<number></code> | <code>Array<bigint></code></td><td><p>The token IDs to convert.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers.TokenizerModel.fromConfig" class="group"></a> | |
| ### `TokenizerModel.fromConfig(config, ...args)` ⇒ <code>TokenizerModel</code> | |
| Instantiates a new TokenizerModel instance based on the configuration object provided. | |
| **Kind**: static method of [<code>TokenizerModel</code>](#module_tokenizers.TokenizerModel) | |
| **Returns**: <code>TokenizerModel</code> - A new instance of a TokenizerModel. | |
| **Throws**: | |
| - Will throw an error if the TokenizerModel type in the config is not recognized. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>config</td><td><code>Object</code></td><td><p>The configuration object for the TokenizerModel.</p> | |
| </td> | |
| </tr><tr> | |
| <td>...args</td><td><code>*</code></td><td><p>Optional arguments to pass to the specific TokenizerModel constructor.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers.PreTrainedTokenizer" class="group"></a> | |
| ## tokenizers.PreTrainedTokenizer | |
| **Kind**: static class of [<code>tokenizers</code>](#module_tokenizers) | |
| * [.PreTrainedTokenizer](#module_tokenizers.PreTrainedTokenizer) | |
| * [`new PreTrainedTokenizer(tokenizerJSON, tokenizerConfig)`](#new_module_tokenizers.PreTrainedTokenizer_new) | |
| * _instance_ | |
| * [`.added_tokens`](#module_tokenizers.PreTrainedTokenizer+added_tokens) : <code>Array.<AddedToken></code> | |
| * [`.added_tokens_map`](#module_tokenizers.PreTrainedTokenizer+added_tokens_map) : <code>Map.<string, AddedToken></code> | |
| * [`.remove_space`](#module_tokenizers.PreTrainedTokenizer+remove_space) : <code>boolean</code> | |
| * [`._call(text, options)`](#module_tokenizers.PreTrainedTokenizer+_call) ⇒ <code>BatchEncoding</code> | |
| * [`._encode_text(text)`](#module_tokenizers.PreTrainedTokenizer+_encode_text) ⇒ <code>Array<string></code> | <code>null</code> | |
| * [`._tokenize_helper(text, options)`](#module_tokenizers.PreTrainedTokenizer+_tokenize_helper) ⇒ <code>*</code> | |
| * [`.tokenize(text, options)`](#module_tokenizers.PreTrainedTokenizer+tokenize) ⇒ <code>Array.<string></code> | |
| * [`.encode(text, options)`](#module_tokenizers.PreTrainedTokenizer+encode) ⇒ <code>Array.<number></code> | |
| * [`.batch_decode(batch, decode_args)`](#module_tokenizers.PreTrainedTokenizer+batch_decode) ⇒ <code>Array.<string></code> | |
| * [`.decode(token_ids, [decode_args])`](#module_tokenizers.PreTrainedTokenizer+decode) ⇒ <code>string</code> | |
| * [`.decode_single(token_ids, decode_args)`](#module_tokenizers.PreTrainedTokenizer+decode_single) ⇒ <code>string</code> | |
| * [`.get_chat_template(options)`](#module_tokenizers.PreTrainedTokenizer+get_chat_template) ⇒ <code>string</code> | |
| * [`.apply_chat_template(conversation, options)`](#module_tokenizers.PreTrainedTokenizer+apply_chat_template) ⇒ <code>string</code> | [<code>Tensor</code>](#Tensor) | <code>Array<number></code> | <code>Array<Array<number>></code> | <code>BatchEncoding</code> | |
| * _static_ | |
| * [`.from_pretrained(pretrained_model_name_or_path, options)`](#module_tokenizers.PreTrainedTokenizer.from_pretrained) ⇒ <code>Promise.<PreTrainedTokenizer></code> | |
| * * * | |
| <a id="new_module_tokenizers.PreTrainedTokenizer_new" class="group"></a> | |
| ### `new PreTrainedTokenizer(tokenizerJSON, tokenizerConfig)` | |
| Create a new PreTrainedTokenizer instance. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>tokenizerJSON</td><td><code>Object</code></td><td><p>The JSON of the tokenizer.</p> | |
| </td> | |
| </tr><tr> | |
| <td>tokenizerConfig</td><td><code>Object</code></td><td><p>The config of the tokenizer.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers.PreTrainedTokenizer+added_tokens" class="group"></a> | |
| ### `preTrainedTokenizer.added_tokens` : <code>Array.<AddedToken></code> | |
| **Kind**: instance property of [<code>PreTrainedTokenizer</code>](#module_tokenizers.PreTrainedTokenizer) | |
| * * * | |
| <a id="module_tokenizers.PreTrainedTokenizer+added_tokens_map" class="group"></a> | |
| ### `preTrainedTokenizer.added_tokens_map` : <code>Map.<string, AddedToken></code> | |
| **Kind**: instance property of [<code>PreTrainedTokenizer</code>](#module_tokenizers.PreTrainedTokenizer) | |
| * * * | |
| <a id="module_tokenizers.PreTrainedTokenizer+remove_space" class="group"></a> | |
| ### `preTrainedTokenizer.remove_space` : <code>boolean</code> | |
| Whether or not to strip the text when tokenizing (removing excess spaces before and after the string). | |
| **Kind**: instance property of [<code>PreTrainedTokenizer</code>](#module_tokenizers.PreTrainedTokenizer) | |
| * * * | |
| <a id="module_tokenizers.PreTrainedTokenizer+_call" class="group"></a> | |
| ### `preTrainedTokenizer._call(text, options)` ⇒ <code>BatchEncoding</code> | |
| Encode/tokenize the given text(s). | |
| **Kind**: instance method of [<code>PreTrainedTokenizer</code>](#module_tokenizers.PreTrainedTokenizer) | |
| **Returns**: <code>BatchEncoding</code> - Object to be passed to the model. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Default</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>text</td><td><code>string</code> | <code>Array<string></code></td><td></td><td><p>The text to tokenize.</p> | |
| </td> | |
| </tr><tr> | |
| <td>options</td><td><code>Object</code></td><td></td><td><p>An optional object containing the following properties:</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.text_pair]</td><td><code>string</code> | <code>Array<string></code></td><td><code>null</code></td><td><p>Optional second sequence to be encoded. If set, must be the same type as text.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.padding]</td><td><code>boolean</code> | <code>'max_length'</code></td><td><code>false</code></td><td><p>Whether to pad the input sequences.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.add_special_tokens]</td><td><code>boolean</code></td><td><code>true</code></td><td><p>Whether or not to add the special tokens associated with the corresponding model.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.truncation]</td><td><code>boolean</code></td><td><code></code></td><td><p>Whether to truncate the input sequences.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.max_length]</td><td><code>number</code></td><td><code></code></td><td><p>Maximum length of the returned list and optionally padding length.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.return_tensor]</td><td><code>boolean</code></td><td><code>true</code></td><td><p>Whether to return the results as Tensors or arrays.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.return_token_type_ids]</td><td><code>boolean</code></td><td><code></code></td><td><p>Whether to return the token type ids.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers.PreTrainedTokenizer+_encode_text" class="group"></a> | |
| ### `preTrainedTokenizer._encode_text(text)` ⇒ <code>Array<string></code> | <code>null</code> | |
| Encodes a single text using the preprocessor pipeline of the tokenizer. | |
| **Kind**: instance method of [<code>PreTrainedTokenizer</code>](#module_tokenizers.PreTrainedTokenizer) | |
| **Returns**: <code>Array<string></code> | <code>null</code> - The encoded tokens. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>text</td><td><code>string</code> | <code>null</code></td><td><p>The text to encode.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers.PreTrainedTokenizer+_tokenize_helper" class="group"></a> | |
| ### `preTrainedTokenizer._tokenize_helper(text, options)` ⇒ <code>*</code> | |
| Internal helper function to tokenize a text, and optionally a pair of texts. | |
| **Kind**: instance method of [<code>PreTrainedTokenizer</code>](#module_tokenizers.PreTrainedTokenizer) | |
| **Returns**: <code>*</code> - An object containing the tokens and optionally the token type IDs. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Default</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>text</td><td><code>string</code></td><td></td><td><p>The text to tokenize.</p> | |
| </td> | |
| </tr><tr> | |
| <td>options</td><td><code>Object</code></td><td></td><td><p>An optional object containing the following properties:</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.pair]</td><td><code>string</code></td><td><code>null</code></td><td><p>The optional second text to tokenize.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.add_special_tokens]</td><td><code>boolean</code></td><td><code>false</code></td><td><p>Whether or not to add the special tokens associated with the corresponding model.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers.PreTrainedTokenizer+tokenize" class="group"></a> | |
| ### `preTrainedTokenizer.tokenize(text, options)` ⇒ <code>Array.<string></code> | |
| Converts a string into a sequence of tokens. | |
| **Kind**: instance method of [<code>PreTrainedTokenizer</code>](#module_tokenizers.PreTrainedTokenizer) | |
| **Returns**: <code>Array.<string></code> - The list of tokens. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Default</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>text</td><td><code>string</code></td><td></td><td><p>The sequence to be encoded.</p> | |
| </td> | |
| </tr><tr> | |
| <td>options</td><td><code>Object</code></td><td></td><td><p>An optional object containing the following properties:</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.pair]</td><td><code>string</code></td><td></td><td><p>A second sequence to be encoded with the first.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.add_special_tokens]</td><td><code>boolean</code></td><td><code>false</code></td><td><p>Whether or not to add the special tokens associated with the corresponding model.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers.PreTrainedTokenizer+encode" class="group"></a> | |
| ### `preTrainedTokenizer.encode(text, options)` ⇒ <code>Array.<number></code> | |
| Encodes a single text or a pair of texts using the model's tokenizer. | |
| **Kind**: instance method of [<code>PreTrainedTokenizer</code>](#module_tokenizers.PreTrainedTokenizer) | |
| **Returns**: <code>Array.<number></code> - An array of token IDs representing the encoded text(s). | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Default</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>text</td><td><code>string</code></td><td></td><td><p>The text to encode.</p> | |
| </td> | |
| </tr><tr> | |
| <td>options</td><td><code>Object</code></td><td></td><td><p>An optional object containing the following properties:</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.text_pair]</td><td><code>string</code></td><td><code>null</code></td><td><p>The optional second text to encode.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.add_special_tokens]</td><td><code>boolean</code></td><td><code>true</code></td><td><p>Whether or not to add the special tokens associated with the corresponding model.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.return_token_type_ids]</td><td><code>boolean</code></td><td><code></code></td><td><p>Whether to return token_type_ids.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers.PreTrainedTokenizer+batch_decode" class="group"></a> | |
| ### `preTrainedTokenizer.batch_decode(batch, decode_args)` ⇒ <code>Array.<string></code> | |
| Decode a batch of tokenized sequences. | |
| **Kind**: instance method of [<code>PreTrainedTokenizer</code>](#module_tokenizers.PreTrainedTokenizer) | |
| **Returns**: <code>Array.<string></code> - List of decoded sequences. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>batch</td><td><code>Array<Array<number>></code> | <code><a href="#Tensor">Tensor</a></code></td><td><p>List/Tensor of tokenized input sequences.</p> | |
| </td> | |
| </tr><tr> | |
| <td>decode_args</td><td><code>Object</code></td><td><p>(Optional) Object with decoding arguments.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers.PreTrainedTokenizer+decode" class="group"></a> | |
| ### `preTrainedTokenizer.decode(token_ids, [decode_args])` ⇒ <code>string</code> | |
| Decodes a sequence of token IDs back to a string. | |
| **Kind**: instance method of [<code>PreTrainedTokenizer</code>](#module_tokenizers.PreTrainedTokenizer) | |
| **Returns**: <code>string</code> - The decoded string. | |
| **Throws**: | |
| - <code>Error</code> If `token_ids` is not a non-empty array of integers. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Default</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>token_ids</td><td><code>Array<number></code> | <code>Array<bigint></code> | <code><a href="#Tensor">Tensor</a></code></td><td></td><td><p>List/Tensor of token IDs to decode.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[decode_args]</td><td><code>Object</code></td><td><code>{}</code></td><td></td> | |
| </tr><tr> | |
| <td>[decode_args.skip_special_tokens]</td><td><code>boolean</code></td><td><code>false</code></td><td><p>If true, special tokens are removed from the output string.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[decode_args.clean_up_tokenization_spaces]</td><td><code>boolean</code></td><td><code>true</code></td><td><p>If true, spaces before punctuations and abbreviated forms are removed.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers.PreTrainedTokenizer+decode_single" class="group"></a> | |
| ### `preTrainedTokenizer.decode_single(token_ids, decode_args)` ⇒ <code>string</code> | |
| Decode a single list of token ids to a string. | |
| **Kind**: instance method of [<code>PreTrainedTokenizer</code>](#module_tokenizers.PreTrainedTokenizer) | |
| **Returns**: <code>string</code> - The decoded string | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Default</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>token_ids</td><td><code>Array<number></code> | <code>Array<bigint></code></td><td></td><td><p>List of token ids to decode</p> | |
| </td> | |
| </tr><tr> | |
| <td>decode_args</td><td><code>Object</code></td><td></td><td><p>Optional arguments for decoding</p> | |
| </td> | |
| </tr><tr> | |
| <td>[decode_args.skip_special_tokens]</td><td><code>boolean</code></td><td><code>false</code></td><td><p>Whether to skip special tokens during decoding</p> | |
| </td> | |
| </tr><tr> | |
| <td>[decode_args.clean_up_tokenization_spaces]</td><td><code>boolean</code></td><td><code></code></td><td><p>Whether to clean up tokenization spaces during decoding. | |
| If null, the value is set to <code>this.decoder.cleanup</code> if it exists, falling back to <code>this.clean_up_tokenization_spaces</code> if it exists, falling back to <code>true</code>.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers.PreTrainedTokenizer+get_chat_template" class="group"></a> | |
| ### `preTrainedTokenizer.get_chat_template(options)` ⇒ <code>string</code> | |
| Retrieve the chat template string used for tokenizing chat messages. This template is used | |
| internally by the `apply_chat_template` method and can also be used externally to retrieve the model's chat | |
| template for better generation tracking. | |
| **Kind**: instance method of [<code>PreTrainedTokenizer</code>](#module_tokenizers.PreTrainedTokenizer) | |
| **Returns**: <code>string</code> - The chat template string. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Default</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>options</td><td><code>Object</code></td><td></td><td><p>An optional object containing the following properties:</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.chat_template]</td><td><code>string</code></td><td><code>null</code></td><td><p>A Jinja template or the name of a template to use for this conversion. | |
| It is usually not necessary to pass anything to this argument, | |
| as the model's template will be used by default.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.tools]</td><td><code>Array.<Object></code></td><td><code></code></td><td><p>A list of tools (callable functions) that will be accessible to the model. If the template does not | |
| support function calling, this argument will have no effect. Each tool should be passed as a JSON Schema, | |
| giving the name, description and argument types for the tool. See our | |
| <a href="https://huggingface.co/docs/transformers/main/en/chat_templating#automated-function-conversion-for-tool-use">chat templating guide</a> | |
| for more information.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers.PreTrainedTokenizer+apply_chat_template" class="group"></a> | |
| ### `preTrainedTokenizer.apply_chat_template(conversation, options)` ⇒ <code>string</code> | [<code>Tensor</code>](#Tensor) | <code>Array<number></code> | <code>Array<Array<number>></code> | <code>BatchEncoding</code> | |
| Converts a list of message objects with `"role"` and `"content"` keys to a list of token | |
| ids. This method is intended for use with chat models, and will read the tokenizer's chat_template attribute to | |
| determine the format and control tokens to use when converting. | |
| See [here](https://huggingface.co/docs/transformers/chat_templating) for more information. | |
| **Example:** Applying a chat template to a conversation. | |
| ```javascript | |
| import { AutoTokenizer } from "@huggingface/transformers"; | |
| const tokenizer = await AutoTokenizer.from_pretrained("Xenova/mistral-tokenizer-v1"); | |
| const chat = [ | |
| { "role": "user", "content": "Hello, how are you?" }, | |
| { "role": "assistant", "content": "I'm doing great. How can I help you today?" }, | |
| { "role": "user", "content": "I'd like to show off how chat templating works!" }, | |
| ] | |
| const text = tokenizer.apply_chat_template(chat, { tokenize: false }); | |
| // "<s>[INST] Hello, how are you? [/INST]I'm doing great. How can I help you today?</s> [INST] I'd like to show off how chat templating works! [/INST]" | |
| const input_ids = tokenizer.apply_chat_template(chat, { tokenize: true, return_tensor: false }); | |
| // [1, 733, 16289, 28793, 22557, 28725, 910, 460, 368, 28804, 733, 28748, 16289, 28793, 28737, 28742, 28719, 2548, 1598, 28723, 1602, 541, 315, 1316, 368, 3154, 28804, 2, 28705, 733, 16289, 28793, 315, 28742, 28715, 737, 298, 1347, 805, 910, 10706, 5752, 1077, 3791, 28808, 733, 28748, 16289, 28793] | |
| ``` | |
| **Kind**: instance method of [<code>PreTrainedTokenizer</code>](#module_tokenizers.PreTrainedTokenizer) | |
| **Returns**: <code>string</code> | [<code>Tensor</code>](#Tensor) | <code>Array<number></code> | <code>Array<Array<number>></code> | <code>BatchEncoding</code> - The tokenized output. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Default</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>conversation</td><td><code>Array.<Message></code></td><td></td><td><p>A list of message objects with <code>"role"</code> and <code>"content"</code> keys, | |
| representing the chat history so far.</p> | |
| </td> | |
| </tr><tr> | |
| <td>options</td><td><code>Object</code></td><td></td><td><p>An optional object containing the following properties:</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.chat_template]</td><td><code>string</code></td><td><code>null</code></td><td><p>A Jinja template to use for this conversion. If | |
| this is not passed, the model's chat template will be used instead.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.tools]</td><td><code>Array.<Object></code></td><td><code></code></td><td><p>A list of tools (callable functions) that will be accessible to the model. If the template does not | |
| support function calling, this argument will have no effect. Each tool should be passed as a JSON Schema, | |
| giving the name, description and argument types for the tool. See our | |
| <a href="https://huggingface.co/docs/transformers/main/en/chat_templating#automated-function-conversion-for-tool-use">chat templating guide</a> | |
| for more information.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.documents]</td><td><code>*</code></td><td><code></code></td><td><p>A list of dicts representing documents that will be accessible to the model if it is performing RAG | |
| (retrieval-augmented generation). If the template does not support RAG, this argument will have no | |
| effect. We recommend that each document should be a dict containing "title" and "text" keys. Please | |
| see the RAG section of the <a href="https://huggingface.co/docs/transformers/main/en/chat_templating#arguments-for-RAG">chat templating guide</a> | |
| for examples of passing documents with chat templates.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.add_generation_prompt]</td><td><code>boolean</code></td><td><code>false</code></td><td><p>Whether to end the prompt with the token(s) that indicate | |
| the start of an assistant message. This is useful when you want to generate a response from the model. | |
| Note that this argument will be passed to the chat template, and so it must be supported in the | |
| template for this argument to have any effect.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.tokenize]</td><td><code>boolean</code></td><td><code>true</code></td><td><p>Whether to tokenize the output. If false, the output will be a string.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.padding]</td><td><code>boolean</code></td><td><code>false</code></td><td><p>Whether to pad sequences to the maximum length. Has no effect if tokenize is false.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.truncation]</td><td><code>boolean</code></td><td><code>false</code></td><td><p>Whether to truncate sequences to the maximum length. Has no effect if tokenize is false.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.max_length]</td><td><code>number</code></td><td><code></code></td><td><p>Maximum length (in tokens) to use for padding or truncation. Has no effect if tokenize is false. | |
| If not specified, the tokenizer's <code>max_length</code> attribute will be used as a default.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.return_tensor]</td><td><code>boolean</code></td><td><code>true</code></td><td><p>Whether to return the output as a Tensor or an Array. Has no effect if tokenize is false.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.return_dict]</td><td><code>boolean</code></td><td><code>true</code></td><td><p>Whether to return a dictionary with named outputs. Has no effect if tokenize is false.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.tokenizer_kwargs]</td><td><code>Object</code></td><td><code>{}</code></td><td><p>Additional options to pass to the tokenizer.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers.PreTrainedTokenizer.from_pretrained" class="group"></a> | |
| ### `PreTrainedTokenizer.from_pretrained(pretrained_model_name_or_path, options)` ⇒ <code>Promise.<PreTrainedTokenizer></code> | |
| Loads a pre-trained tokenizer from the given `pretrained_model_name_or_path`. | |
| **Kind**: static method of [<code>PreTrainedTokenizer</code>](#module_tokenizers.PreTrainedTokenizer) | |
| **Returns**: <code>Promise.<PreTrainedTokenizer></code> - A new instance of the `PreTrainedTokenizer` class. | |
| **Throws**: | |
| - <code>Error</code> Throws an error if the tokenizer.json or tokenizer_config.json files are not found in the `pretrained_model_name_or_path`. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>pretrained_model_name_or_path</td><td><code>string</code></td><td><p>The path to the pre-trained tokenizer.</p> | |
| </td> | |
| </tr><tr> | |
| <td>options</td><td><code>PretrainedTokenizerOptions</code></td><td><p>Additional options for loading the tokenizer.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers.BertTokenizer" class="group"></a> | |
| ## tokenizers.BertTokenizer ⇐ <code>PreTrainedTokenizer</code> | |
| BertTokenizer is a class used to tokenize text for BERT models. | |
| **Kind**: static class of [<code>tokenizers</code>](#module_tokenizers) | |
| **Extends**: <code>PreTrainedTokenizer</code> | |
| * * * | |
| <a id="module_tokenizers.AlbertTokenizer" class="group"></a> | |
| ## tokenizers.AlbertTokenizer ⇐ <code>PreTrainedTokenizer</code> | |
| Albert tokenizer | |
| **Kind**: static class of [<code>tokenizers</code>](#module_tokenizers) | |
| **Extends**: <code>PreTrainedTokenizer</code> | |
| * * * | |
| <a id="module_tokenizers.NllbTokenizer" class="group"></a> | |
| ## tokenizers.NllbTokenizer | |
| The NllbTokenizer class is used to tokenize text for NLLB ("No Language Left Behind") models. | |
| No Language Left Behind (NLLB) is a first-of-its-kind, AI breakthrough project | |
| that open-sources models capable of delivering high-quality translations directly | |
| between any pair of 200+ languages — including low-resource languages like Asturian, | |
| Luganda, Urdu and more. It aims to help people communicate with anyone, anywhere, | |
| regardless of their language preferences. For more information, check out their | |
| [paper](https://huggingface.co/papers/2207.04672). | |
| For a list of supported languages (along with their language codes), | |
| **Kind**: static class of [<code>tokenizers</code>](#module_tokenizers) | |
| **See**: [https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200](https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200) | |
| * * * | |
| <a id="module_tokenizers.NllbTokenizer+_build_translation_inputs" class="group"></a> | |
| ### `nllbTokenizer._build_translation_inputs(raw_inputs, tokenizer_options, generate_kwargs)` ⇒ <code>Object</code> | |
| Helper function to build translation inputs for an `NllbTokenizer`. | |
| **Kind**: instance method of [<code>NllbTokenizer</code>](#module_tokenizers.NllbTokenizer) | |
| **Returns**: <code>Object</code> - Object to be passed to the model. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>raw_inputs</td><td><code>string</code> | <code>Array<string></code></td><td><p>The text to tokenize.</p> | |
| </td> | |
| </tr><tr> | |
| <td>tokenizer_options</td><td><code>Object</code></td><td><p>Options to be sent to the tokenizer</p> | |
| </td> | |
| </tr><tr> | |
| <td>generate_kwargs</td><td><code>Object</code></td><td><p>Generation options.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers.M2M100Tokenizer" class="group"></a> | |
| ## tokenizers.M2M100Tokenizer | |
| The M2M100Tokenizer class is used to tokenize text for M2M100 ("Many-to-Many") models. | |
| M2M100 is a multilingual encoder-decoder (seq-to-seq) model trained for Many-to-Many | |
| multilingual translation. It was introduced in this [paper](https://huggingface.co/papers/2010.11125) | |
| and first released in [this](https://github.com/pytorch/fairseq/tree/master/examples/m2m_100) repository. | |
| For a list of supported languages (along with their language codes), | |
| **Kind**: static class of [<code>tokenizers</code>](#module_tokenizers) | |
| **See**: [https://huggingface.co/facebook/m2m100_418M#languages-covered](https://huggingface.co/facebook/m2m100_418M#languages-covered) | |
| * * * | |
| <a id="module_tokenizers.M2M100Tokenizer+_build_translation_inputs" class="group"></a> | |
| ### `m2M100Tokenizer._build_translation_inputs(raw_inputs, tokenizer_options, generate_kwargs)` ⇒ <code>Object</code> | |
| Helper function to build translation inputs for an `M2M100Tokenizer`. | |
| **Kind**: instance method of [<code>M2M100Tokenizer</code>](#module_tokenizers.M2M100Tokenizer) | |
| **Returns**: <code>Object</code> - Object to be passed to the model. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>raw_inputs</td><td><code>string</code> | <code>Array<string></code></td><td><p>The text to tokenize.</p> | |
| </td> | |
| </tr><tr> | |
| <td>tokenizer_options</td><td><code>Object</code></td><td><p>Options to be sent to the tokenizer</p> | |
| </td> | |
| </tr><tr> | |
| <td>generate_kwargs</td><td><code>Object</code></td><td><p>Generation options.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers.WhisperTokenizer" class="group"></a> | |
| ## tokenizers.WhisperTokenizer ⇐ <code>PreTrainedTokenizer</code> | |
| WhisperTokenizer tokenizer | |
| **Kind**: static class of [<code>tokenizers</code>](#module_tokenizers) | |
| **Extends**: <code>PreTrainedTokenizer</code> | |
| * [.WhisperTokenizer](#module_tokenizers.WhisperTokenizer) ⇐ <code>PreTrainedTokenizer</code> | |
| * [`._decode_asr(sequences, options)`](#module_tokenizers.WhisperTokenizer+_decode_asr) ⇒ <code>*</code> | |
| * [`.decode()`](#module_tokenizers.WhisperTokenizer+decode) : <code>*</code> | |
| * * * | |
| <a id="module_tokenizers.WhisperTokenizer+_decode_asr" class="group"></a> | |
| ### `whisperTokenizer._decode_asr(sequences, options)` ⇒ <code>*</code> | |
| Decodes automatic speech recognition (ASR) sequences. | |
| **Kind**: instance method of [<code>WhisperTokenizer</code>](#module_tokenizers.WhisperTokenizer) | |
| **Returns**: <code>*</code> - The decoded sequences. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>sequences</td><td><code>*</code></td><td><p>The sequences to decode.</p> | |
| </td> | |
| </tr><tr> | |
| <td>options</td><td><code>Object</code></td><td><p>The options to use for decoding.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers.WhisperTokenizer+decode" class="group"></a> | |
| ### `whisperTokenizer.decode()` : <code>*</code> | |
| **Kind**: instance method of [<code>WhisperTokenizer</code>](#module_tokenizers.WhisperTokenizer) | |
| * * * | |
| <a id="module_tokenizers.MarianTokenizer" class="group"></a> | |
| ## tokenizers.MarianTokenizer | |
| **Kind**: static class of [<code>tokenizers</code>](#module_tokenizers) | |
| **Todo** | |
| - This model is not yet supported by Hugging Face's "fast" tokenizers library (https://github.com/huggingface/tokenizers). | |
| Therefore, this implementation (which is based on fast tokenizers) may produce slightly inaccurate results. | |
| * [.MarianTokenizer](#module_tokenizers.MarianTokenizer) | |
| * [`new MarianTokenizer(tokenizerJSON, tokenizerConfig)`](#new_module_tokenizers.MarianTokenizer_new) | |
| * [`._encode_text(text)`](#module_tokenizers.MarianTokenizer+_encode_text) ⇒ <code>Array</code> | |
| * * * | |
| <a id="new_module_tokenizers.MarianTokenizer_new" class="group"></a> | |
| ### `new MarianTokenizer(tokenizerJSON, tokenizerConfig)` | |
| Create a new MarianTokenizer instance. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>tokenizerJSON</td><td><code>Object</code></td><td><p>The JSON of the tokenizer.</p> | |
| </td> | |
| </tr><tr> | |
| <td>tokenizerConfig</td><td><code>Object</code></td><td><p>The config of the tokenizer.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers.MarianTokenizer+_encode_text" class="group"></a> | |
| ### `marianTokenizer._encode_text(text)` ⇒ <code>Array</code> | |
| Encodes a single text. Overriding this method is necessary since the language codes | |
| must be removed before encoding with sentencepiece model. | |
| **Kind**: instance method of [<code>MarianTokenizer</code>](#module_tokenizers.MarianTokenizer) | |
| **Returns**: <code>Array</code> - The encoded tokens. | |
| **See**: https://github.com/huggingface/transformers/blob/12d51db243a00726a548a43cc333390ebae731e3/src/transformers/models/marian/tokenization_marian.py#L204-L213 | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>text</td><td><code>string</code> | <code>null</code></td><td><p>The text to encode.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers.AutoTokenizer" class="group"></a> | |
| ## tokenizers.AutoTokenizer | |
| Helper class which is used to instantiate pretrained tokenizers with the `from_pretrained` function. | |
| The chosen tokenizer class is determined by the type specified in the tokenizer config. | |
| **Kind**: static class of [<code>tokenizers</code>](#module_tokenizers) | |
| * [.AutoTokenizer](#module_tokenizers.AutoTokenizer) | |
| * [`new AutoTokenizer()`](#new_module_tokenizers.AutoTokenizer_new) | |
| * [`.from_pretrained(pretrained_model_name_or_path, options)`](#module_tokenizers.AutoTokenizer.from_pretrained) ⇒ <code>Promise.<PreTrainedTokenizer></code> | |
| * * * | |
| <a id="new_module_tokenizers.AutoTokenizer_new" class="group"></a> | |
| ### `new AutoTokenizer()` | |
| **Example** | |
| ```js | |
| const tokenizer = await AutoTokenizer.from_pretrained('Xenova/bert-base-uncased'); | |
| ``` | |
| * * * | |
| <a id="module_tokenizers.AutoTokenizer.from_pretrained" class="group"></a> | |
| ### `AutoTokenizer.from_pretrained(pretrained_model_name_or_path, options)` ⇒ <code>Promise.<PreTrainedTokenizer></code> | |
| Instantiate one of the tokenizer classes of the library from a pretrained model. | |
| The tokenizer class to instantiate is selected based on the `tokenizer_class` property of the config object | |
| (either passed as an argument or loaded from `pretrained_model_name_or_path` if possible) | |
| **Kind**: static method of [<code>AutoTokenizer</code>](#module_tokenizers.AutoTokenizer) | |
| **Returns**: <code>Promise.<PreTrainedTokenizer></code> - A new instance of the PreTrainedTokenizer class. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>pretrained_model_name_or_path</td><td><code>string</code></td><td><p>The name or path of the pretrained model. Can be either:</p> | |
| <ul> | |
| <li>A string, the <em>model id</em> of a pretrained tokenizer hosted inside a model repo on huggingface.co. | |
| Valid model ids can be located at the root-level, like <code>bert-base-uncased</code>, or namespaced under a | |
| user or organization name, like <code>dbmdz/bert-base-german-cased</code>.</li> | |
| <li>A path to a <em>directory</em> containing tokenizer files, e.g., <code>./my_model_directory/</code>.</li> | |
| </ul> | |
| </td> | |
| </tr><tr> | |
| <td>options</td><td><code>PretrainedTokenizerOptions</code></td><td><p>Additional options for loading the tokenizer.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers.is_chinese_char" class="group"></a> | |
| ## `tokenizers.is_chinese_char(cp)` ⇒ <code>boolean</code> | |
| Checks whether the given Unicode codepoint represents a CJK (Chinese, Japanese, or Korean) character. | |
| A "chinese character" is defined as anything in the CJK Unicode block: | |
| https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block) | |
| Note that the CJK Unicode block is NOT all Japanese and Korean characters, despite its name. | |
| The modern Korean Hangul alphabet is a different block, as is Japanese Hiragana and Katakana. | |
| Those alphabets are used to write space-separated words, so they are not treated specially | |
| and are handled like all other languages. | |
| **Kind**: static method of [<code>tokenizers</code>](#module_tokenizers) | |
| **Returns**: <code>boolean</code> - True if the codepoint represents a CJK character, false otherwise. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>cp</td><td><code>number</code> | <code>bigint</code></td><td><p>The Unicode codepoint to check.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..AddedToken" class="group"></a> | |
| ## tokenizers~AddedToken | |
| Represent a token added by the user on top of the existing Model vocabulary. | |
| AddedToken can be configured to specify the behavior they should have in various situations like: | |
| - Whether they should only match single words | |
| - Whether to include any whitespace on its left or right | |
| **Kind**: inner class of [<code>tokenizers</code>](#module_tokenizers) | |
| * * * | |
| <a id="new_module_tokenizers..AddedToken_new" class="group"></a> | |
| ### `new AddedToken(config)` | |
| Creates a new instance of AddedToken. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Default</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>config</td><td><code>Object</code></td><td></td><td><p>Added token configuration object.</p> | |
| </td> | |
| </tr><tr> | |
| <td>config.content</td><td><code>string</code></td><td></td><td><p>The content of the added token.</p> | |
| </td> | |
| </tr><tr> | |
| <td>config.id</td><td><code>number</code></td><td></td><td><p>The id of the added token.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[config.single_word]</td><td><code>boolean</code></td><td><code>false</code></td><td><p>Whether this token must be a single word or can break words.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[config.lstrip]</td><td><code>boolean</code></td><td><code>false</code></td><td><p>Whether this token should strip whitespaces on its left.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[config.rstrip]</td><td><code>boolean</code></td><td><code>false</code></td><td><p>Whether this token should strip whitespaces on its right.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[config.normalized]</td><td><code>boolean</code></td><td><code>false</code></td><td><p>Whether this token should be normalized.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[config.special]</td><td><code>boolean</code></td><td><code>false</code></td><td><p>Whether this token is special.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..WordPieceTokenizer" class="group"></a> | |
| ## tokenizers~WordPieceTokenizer ⇐ <code>TokenizerModel</code> | |
| A subclass of TokenizerModel that uses WordPiece encoding to encode tokens. | |
| **Kind**: inner class of [<code>tokenizers</code>](#module_tokenizers) | |
| **Extends**: <code>TokenizerModel</code> | |
| * [~WordPieceTokenizer](#module_tokenizers..WordPieceTokenizer) ⇐ <code>TokenizerModel</code> | |
| * [`new WordPieceTokenizer(config)`](#new_module_tokenizers..WordPieceTokenizer_new) | |
| * [`.tokens_to_ids`](#module_tokenizers..WordPieceTokenizer+tokens_to_ids) : <code>Map.<string, number></code> | |
| * [`.unk_token_id`](#module_tokenizers..WordPieceTokenizer+unk_token_id) : <code>number</code> | |
| * [`.unk_token`](#module_tokenizers..WordPieceTokenizer+unk_token) : <code>string</code> | |
| * [`.max_input_chars_per_word`](#module_tokenizers..WordPieceTokenizer+max_input_chars_per_word) : <code>number</code> | |
| * [`.vocab`](#module_tokenizers..WordPieceTokenizer+vocab) : <code>Array.<string></code> | |
| * [`.encode(tokens)`](#module_tokenizers..WordPieceTokenizer+encode) ⇒ <code>Array.<string></code> | |
| * * * | |
| <a id="new_module_tokenizers..WordPieceTokenizer_new" class="group"></a> | |
| ### `new WordPieceTokenizer(config)` | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Default</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>config</td><td><code>Object</code></td><td></td><td><p>The configuration object.</p> | |
| </td> | |
| </tr><tr> | |
| <td>config.vocab</td><td><code>Object</code></td><td></td><td><p>A mapping of tokens to ids.</p> | |
| </td> | |
| </tr><tr> | |
| <td>config.unk_token</td><td><code>string</code></td><td></td><td><p>The unknown token string.</p> | |
| </td> | |
| </tr><tr> | |
| <td>config.continuing_subword_prefix</td><td><code>string</code></td><td></td><td><p>The prefix to use for continuing subwords.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[config.max_input_chars_per_word]</td><td><code>number</code></td><td><code>100</code></td><td><p>The maximum number of characters per word.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..WordPieceTokenizer+tokens_to_ids" class="group"></a> | |
| ### `wordPieceTokenizer.tokens_to_ids` : <code>Map.<string, number></code> | |
| A mapping of tokens to ids. | |
| **Kind**: instance property of [<code>WordPieceTokenizer</code>](#module_tokenizers..WordPieceTokenizer) | |
| * * * | |
| <a id="module_tokenizers..WordPieceTokenizer+unk_token_id" class="group"></a> | |
| ### `wordPieceTokenizer.unk_token_id` : <code>number</code> | |
| The id of the unknown token. | |
| **Kind**: instance property of [<code>WordPieceTokenizer</code>](#module_tokenizers..WordPieceTokenizer) | |
| * * * | |
| <a id="module_tokenizers..WordPieceTokenizer+unk_token" class="group"></a> | |
| ### `wordPieceTokenizer.unk_token` : <code>string</code> | |
| The unknown token string. | |
| **Kind**: instance property of [<code>WordPieceTokenizer</code>](#module_tokenizers..WordPieceTokenizer) | |
| * * * | |
| <a id="module_tokenizers..WordPieceTokenizer+max_input_chars_per_word" class="group"></a> | |
| ### `wordPieceTokenizer.max_input_chars_per_word` : <code>number</code> | |
| The maximum number of characters allowed per word. | |
| **Kind**: instance property of [<code>WordPieceTokenizer</code>](#module_tokenizers..WordPieceTokenizer) | |
| * * * | |
| <a id="module_tokenizers..WordPieceTokenizer+vocab" class="group"></a> | |
| ### `wordPieceTokenizer.vocab` : <code>Array.<string></code> | |
| An array of tokens. | |
| **Kind**: instance property of [<code>WordPieceTokenizer</code>](#module_tokenizers..WordPieceTokenizer) | |
| * * * | |
| <a id="module_tokenizers..WordPieceTokenizer+encode" class="group"></a> | |
| ### `wordPieceTokenizer.encode(tokens)` ⇒ <code>Array.<string></code> | |
| Encodes an array of tokens using WordPiece encoding. | |
| **Kind**: instance method of [<code>WordPieceTokenizer</code>](#module_tokenizers..WordPieceTokenizer) | |
| **Returns**: <code>Array.<string></code> - An array of encoded tokens. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>tokens</td><td><code>Array.<string></code></td><td><p>The tokens to encode.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..Unigram" class="group"></a> | |
| ## tokenizers~Unigram ⇐ <code>TokenizerModel</code> | |
| Class representing a Unigram tokenizer model. | |
| **Kind**: inner class of [<code>tokenizers</code>](#module_tokenizers) | |
| **Extends**: <code>TokenizerModel</code> | |
| * [~Unigram](#module_tokenizers..Unigram) ⇐ <code>TokenizerModel</code> | |
| * [`new Unigram(config, moreConfig)`](#new_module_tokenizers..Unigram_new) | |
| * [`.scores`](#module_tokenizers..Unigram+scores) : <code>Array.<number></code> | |
| * [`.populateNodes(lattice)`](#module_tokenizers..Unigram+populateNodes) | |
| * [`.tokenize(normalized)`](#module_tokenizers..Unigram+tokenize) ⇒ <code>Array.<string></code> | |
| * [`.encode(tokens)`](#module_tokenizers..Unigram+encode) ⇒ <code>Array.<string></code> | |
| * * * | |
| <a id="new_module_tokenizers..Unigram_new" class="group"></a> | |
| ### `new Unigram(config, moreConfig)` | |
| Create a new Unigram tokenizer model. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>config</td><td><code>Object</code></td><td><p>The configuration object for the Unigram model.</p> | |
| </td> | |
| </tr><tr> | |
| <td>config.unk_id</td><td><code>number</code></td><td><p>The ID of the unknown token</p> | |
| </td> | |
| </tr><tr> | |
| <td>config.vocab</td><td><code>*</code></td><td><p>A 2D array representing a mapping of tokens to scores.</p> | |
| </td> | |
| </tr><tr> | |
| <td>moreConfig</td><td><code>Object</code></td><td><p>Additional configuration object for the Unigram model.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..Unigram+scores" class="group"></a> | |
| ### `unigram.scores` : <code>Array.<number></code> | |
| **Kind**: instance property of [<code>Unigram</code>](#module_tokenizers..Unigram) | |
| * * * | |
| <a id="module_tokenizers..Unigram+populateNodes" class="group"></a> | |
| ### `unigram.populateNodes(lattice)` | |
| Populates lattice nodes. | |
| **Kind**: instance method of [<code>Unigram</code>](#module_tokenizers..Unigram) | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>lattice</td><td><code>TokenLattice</code></td><td><p>The token lattice to populate with nodes.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..Unigram+tokenize" class="group"></a> | |
| ### `unigram.tokenize(normalized)` ⇒ <code>Array.<string></code> | |
| Encodes an array of tokens into an array of subtokens using the unigram model. | |
| **Kind**: instance method of [<code>Unigram</code>](#module_tokenizers..Unigram) | |
| **Returns**: <code>Array.<string></code> - An array of subtokens obtained by encoding the input tokens using the unigram model. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>normalized</td><td><code>string</code></td><td><p>The normalized string.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..Unigram+encode" class="group"></a> | |
| ### `unigram.encode(tokens)` ⇒ <code>Array.<string></code> | |
| Encodes an array of tokens using Unigram encoding. | |
| **Kind**: instance method of [<code>Unigram</code>](#module_tokenizers..Unigram) | |
| **Returns**: <code>Array.<string></code> - An array of encoded tokens. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>tokens</td><td><code>Array.<string></code></td><td><p>The tokens to encode.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..BPE" class="group"></a> | |
| ## tokenizers~BPE ⇐ <code>TokenizerModel</code> | |
| BPE class for encoding text into Byte-Pair-Encoding (BPE) tokens. | |
| **Kind**: inner class of [<code>tokenizers</code>](#module_tokenizers) | |
| **Extends**: <code>TokenizerModel</code> | |
| * [~BPE](#module_tokenizers..BPE) ⇐ <code>TokenizerModel</code> | |
| * [`new BPE(config)`](#new_module_tokenizers..BPE_new) | |
| * [`.tokens_to_ids`](#module_tokenizers..BPE+tokens_to_ids) : <code>Map.<string, number></code> | |
| * [`.merges`](#module_tokenizers..BPE+merges) : <code>*</code> | |
| * [`.config.merges`](#module_tokenizers..BPE+merges.config.merges) : <code>*</code> | |
| * [`.max_length_to_cache`](#module_tokenizers..BPE+max_length_to_cache) | |
| * [`.cache_capacity`](#module_tokenizers..BPE+cache_capacity) | |
| * [`.clear_cache()`](#module_tokenizers..BPE+clear_cache) | |
| * [`.bpe(token)`](#module_tokenizers..BPE+bpe) ⇒ <code>Array.<string></code> | |
| * [`.encode(tokens)`](#module_tokenizers..BPE+encode) ⇒ <code>Array.<string></code> | |
| * * * | |
| <a id="new_module_tokenizers..BPE_new" class="group"></a> | |
| ### `new BPE(config)` | |
| Create a BPE instance. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Default</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>config</td><td><code>Object</code></td><td></td><td><p>The configuration object for BPE.</p> | |
| </td> | |
| </tr><tr> | |
| <td>config.vocab</td><td><code>Object</code></td><td></td><td><p>A mapping of tokens to ids.</p> | |
| </td> | |
| </tr><tr> | |
| <td>config.merges</td><td><code>*</code></td><td></td><td><p>An array of BPE merges as strings.</p> | |
| </td> | |
| </tr><tr> | |
| <td>config.unk_token</td><td><code>string</code></td><td></td><td><p>The unknown token used for out of vocabulary words.</p> | |
| </td> | |
| </tr><tr> | |
| <td>config.end_of_word_suffix</td><td><code>string</code></td><td></td><td><p>The suffix to place at the end of each word.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[config.continuing_subword_suffix]</td><td><code>string</code></td><td></td><td><p>The suffix to insert between words.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[config.byte_fallback]</td><td><code>boolean</code></td><td><code>false</code></td><td><p>Whether to use spm byte-fallback trick (defaults to False)</p> | |
| </td> | |
| </tr><tr> | |
| <td>[config.ignore_merges]</td><td><code>boolean</code></td><td><code>false</code></td><td><p>Whether or not to match tokens with the vocab before using merges.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..BPE+tokens_to_ids" class="group"></a> | |
| ### `bpE.tokens_to_ids` : <code>Map.<string, number></code> | |
| **Kind**: instance property of [<code>BPE</code>](#module_tokenizers..BPE) | |
| * * * | |
| <a id="module_tokenizers..BPE+merges" class="group"></a> | |
| ### `bpE.merges` : <code>*</code> | |
| **Kind**: instance property of [<code>BPE</code>](#module_tokenizers..BPE) | |
| * * * | |
| <a id="module_tokenizers..BPE+merges.config.merges" class="group"></a> | |
| #### `merges.config.merges` : <code>*</code> | |
| **Kind**: static property of [<code>merges</code>](#module_tokenizers..BPE+merges) | |
| * * * | |
| <a id="module_tokenizers..BPE+max_length_to_cache" class="group"></a> | |
| ### `bpE.max_length_to_cache` | |
| The maximum length we should cache in a model. | |
| Strings that are too long have minimal chances to cache hit anyway | |
| **Kind**: instance property of [<code>BPE</code>](#module_tokenizers..BPE) | |
| * * * | |
| <a id="module_tokenizers..BPE+cache_capacity" class="group"></a> | |
| ### `bpE.cache_capacity` | |
| The default capacity for a `BPE`'s internal cache. | |
| **Kind**: instance property of [<code>BPE</code>](#module_tokenizers..BPE) | |
| * * * | |
| <a id="module_tokenizers..BPE+clear_cache" class="group"></a> | |
| ### `bpE.clear_cache()` | |
| Clears the cache. | |
| **Kind**: instance method of [<code>BPE</code>](#module_tokenizers..BPE) | |
| * * * | |
| <a id="module_tokenizers..BPE+bpe" class="group"></a> | |
| ### `bpE.bpe(token)` ⇒ <code>Array.<string></code> | |
| Apply Byte-Pair-Encoding (BPE) to a given token. Efficient heap-based priority | |
| queue implementation adapted from https://github.com/belladoreai/llama-tokenizer-js. | |
| **Kind**: instance method of [<code>BPE</code>](#module_tokenizers..BPE) | |
| **Returns**: <code>Array.<string></code> - The BPE encoded tokens. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>token</td><td><code>string</code></td><td><p>The token to encode.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..BPE+encode" class="group"></a> | |
| ### `bpE.encode(tokens)` ⇒ <code>Array.<string></code> | |
| Encodes the input sequence of tokens using the BPE algorithm and returns the resulting subword tokens. | |
| **Kind**: instance method of [<code>BPE</code>](#module_tokenizers..BPE) | |
| **Returns**: <code>Array.<string></code> - The resulting subword tokens after applying the BPE algorithm to the input sequence of tokens. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>tokens</td><td><code>Array.<string></code></td><td><p>The input sequence of tokens to encode.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..LegacyTokenizerModel" class="group"></a> | |
| ## tokenizers~LegacyTokenizerModel | |
| Legacy tokenizer class for tokenizers with only a vocabulary. | |
| **Kind**: inner class of [<code>tokenizers</code>](#module_tokenizers) | |
| * [~LegacyTokenizerModel](#module_tokenizers..LegacyTokenizerModel) | |
| * [`new LegacyTokenizerModel(config, moreConfig)`](#new_module_tokenizers..LegacyTokenizerModel_new) | |
| * [`.tokens_to_ids`](#module_tokenizers..LegacyTokenizerModel+tokens_to_ids) : <code>Map.<string, number></code> | |
| * * * | |
| <a id="new_module_tokenizers..LegacyTokenizerModel_new" class="group"></a> | |
| ### `new LegacyTokenizerModel(config, moreConfig)` | |
| Create a LegacyTokenizerModel instance. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>config</td><td><code>Object</code></td><td><p>The configuration object for LegacyTokenizerModel.</p> | |
| </td> | |
| </tr><tr> | |
| <td>config.vocab</td><td><code>Object</code></td><td><p>A (possibly nested) mapping of tokens to ids.</p> | |
| </td> | |
| </tr><tr> | |
| <td>moreConfig</td><td><code>Object</code></td><td><p>Additional configuration object for the LegacyTokenizerModel model.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..LegacyTokenizerModel+tokens_to_ids" class="group"></a> | |
| ### `legacyTokenizerModel.tokens_to_ids` : <code>Map.<string, number></code> | |
| **Kind**: instance property of [<code>LegacyTokenizerModel</code>](#module_tokenizers..LegacyTokenizerModel) | |
| * * * | |
| <a id="module_tokenizers..Normalizer" class="group"></a> | |
| ## *tokenizers~Normalizer* | |
| A base class for text normalization. | |
| **Kind**: inner abstract class of [<code>tokenizers</code>](#module_tokenizers) | |
| * *[~Normalizer](#module_tokenizers..Normalizer)* | |
| * *[`new Normalizer(config)`](#new_module_tokenizers..Normalizer_new)* | |
| * _instance_ | |
| * **[`.normalize(text)`](#module_tokenizers..Normalizer+normalize) ⇒ <code>string</code>** | |
| * *[`._call(text)`](#module_tokenizers..Normalizer+_call) ⇒ <code>string</code>* | |
| * _static_ | |
| * *[`.fromConfig(config)`](#module_tokenizers..Normalizer.fromConfig) ⇒ <code>Normalizer</code>* | |
| * * * | |
| <a id="new_module_tokenizers..Normalizer_new" class="group"></a> | |
| ### *`new Normalizer(config)`* | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>config</td><td><code>Object</code></td><td><p>The configuration object for the normalizer.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..Normalizer+normalize" class="group"></a> | |
| ### **`normalizer.normalize(text)` ⇒ <code>string</code>** | |
| Normalize the input text. | |
| **Kind**: instance abstract method of [<code>Normalizer</code>](#module_tokenizers..Normalizer) | |
| **Returns**: <code>string</code> - The normalized text. | |
| **Throws**: | |
| - <code>Error</code> If this method is not implemented in a subclass. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>text</td><td><code>string</code></td><td><p>The text to normalize.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..Normalizer+_call" class="group"></a> | |
| ### *`normalizer._call(text)` ⇒ <code>string</code>* | |
| Alias for [Normalizer#normalize](Normalizer#normalize). | |
| **Kind**: instance method of [<code>Normalizer</code>](#module_tokenizers..Normalizer) | |
| **Returns**: <code>string</code> - The normalized text. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>text</td><td><code>string</code></td><td><p>The text to normalize.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..Normalizer.fromConfig" class="group"></a> | |
| ### *`Normalizer.fromConfig(config)` ⇒ <code>Normalizer</code>* | |
| Factory method for creating normalizers from config objects. | |
| **Kind**: static method of [<code>Normalizer</code>](#module_tokenizers..Normalizer) | |
| **Returns**: <code>Normalizer</code> - A Normalizer object. | |
| **Throws**: | |
| - <code>Error</code> If an unknown Normalizer type is specified in the config. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>config</td><td><code>Object</code></td><td><p>The configuration object for the normalizer.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..Replace" class="group"></a> | |
| ## tokenizers~Replace ⇐ <code>Normalizer</code> | |
| Replace normalizer that replaces occurrences of a pattern with a given string or regular expression. | |
| **Kind**: inner class of [<code>tokenizers</code>](#module_tokenizers) | |
| **Extends**: <code>Normalizer</code> | |
| * * * | |
| <a id="module_tokenizers..Replace+normalize" class="group"></a> | |
| ### `replace.normalize(text)` ⇒ <code>string</code> | |
| Normalize the input text by replacing the pattern with the content. | |
| **Kind**: instance method of [<code>Replace</code>](#module_tokenizers..Replace) | |
| **Returns**: <code>string</code> - The normalized text after replacing the pattern with the content. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>text</td><td><code>string</code></td><td><p>The input text to be normalized.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..UnicodeNormalizer" class="group"></a> | |
| ## *tokenizers~UnicodeNormalizer ⇐ <code>Normalizer</code>* | |
| A normalizer that applies Unicode normalization to the input text. | |
| **Kind**: inner abstract class of [<code>tokenizers</code>](#module_tokenizers) | |
| **Extends**: <code>Normalizer</code> | |
| * *[~UnicodeNormalizer](#module_tokenizers..UnicodeNormalizer) ⇐ <code>Normalizer</code>* | |
| * *[`.form`](#module_tokenizers..UnicodeNormalizer+form) : <code>string</code>* | |
| * *[`.normalize(text)`](#module_tokenizers..UnicodeNormalizer+normalize) ⇒ <code>string</code>* | |
| * * * | |
| <a id="module_tokenizers..UnicodeNormalizer+form" class="group"></a> | |
| ### *`unicodeNormalizer.form` : <code>string</code>* | |
| The Unicode normalization form to apply.Should be one of: 'NFC', 'NFD', 'NFKC', or 'NFKD'. | |
| **Kind**: instance property of [<code>UnicodeNormalizer</code>](#module_tokenizers..UnicodeNormalizer) | |
| * * * | |
| <a id="module_tokenizers..UnicodeNormalizer+normalize" class="group"></a> | |
| ### *`unicodeNormalizer.normalize(text)` ⇒ <code>string</code>* | |
| Normalize the input text by applying Unicode normalization. | |
| **Kind**: instance method of [<code>UnicodeNormalizer</code>](#module_tokenizers..UnicodeNormalizer) | |
| **Returns**: <code>string</code> - The normalized text. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>text</td><td><code>string</code></td><td><p>The input text to be normalized.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..NFC" class="group"></a> | |
| ## tokenizers~NFC ⇐ <code>UnicodeNormalizer</code> | |
| A normalizer that applies Unicode normalization form C (NFC) to the input text. | |
| Canonical Decomposition, followed by Canonical Composition. | |
| **Kind**: inner class of [<code>tokenizers</code>](#module_tokenizers) | |
| **Extends**: <code>UnicodeNormalizer</code> | |
| * * * | |
| <a id="module_tokenizers..NFD" class="group"></a> | |
| ## tokenizers~NFD ⇐ <code>UnicodeNormalizer</code> | |
| A normalizer that applies Unicode normalization form D (NFD) to the input text. | |
| Canonical Decomposition. | |
| **Kind**: inner class of [<code>tokenizers</code>](#module_tokenizers) | |
| **Extends**: <code>UnicodeNormalizer</code> | |
| * * * | |
| <a id="module_tokenizers..NFKC" class="group"></a> | |
| ## tokenizers~NFKC ⇐ <code>UnicodeNormalizer</code> | |
| A normalizer that applies Unicode normalization form KC (NFKC) to the input text. | |
| Compatibility Decomposition, followed by Canonical Composition. | |
| **Kind**: inner class of [<code>tokenizers</code>](#module_tokenizers) | |
| **Extends**: <code>UnicodeNormalizer</code> | |
| * * * | |
| <a id="module_tokenizers..NFKD" class="group"></a> | |
| ## tokenizers~NFKD ⇐ <code>UnicodeNormalizer</code> | |
| A normalizer that applies Unicode normalization form KD (NFKD) to the input text. | |
| Compatibility Decomposition. | |
| **Kind**: inner class of [<code>tokenizers</code>](#module_tokenizers) | |
| **Extends**: <code>UnicodeNormalizer</code> | |
| * * * | |
| <a id="module_tokenizers..StripNormalizer" class="group"></a> | |
| ## tokenizers~StripNormalizer | |
| A normalizer that strips leading and/or trailing whitespace from the input text. | |
| **Kind**: inner class of [<code>tokenizers</code>](#module_tokenizers) | |
| * * * | |
| <a id="module_tokenizers..StripNormalizer+normalize" class="group"></a> | |
| ### `stripNormalizer.normalize(text)` ⇒ <code>string</code> | |
| Strip leading and/or trailing whitespace from the input text. | |
| **Kind**: instance method of [<code>StripNormalizer</code>](#module_tokenizers..StripNormalizer) | |
| **Returns**: <code>string</code> - The normalized text. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>text</td><td><code>string</code></td><td><p>The input text.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..StripAccents" class="group"></a> | |
| ## tokenizers~StripAccents ⇐ <code>Normalizer</code> | |
| StripAccents normalizer removes all accents from the text. | |
| **Kind**: inner class of [<code>tokenizers</code>](#module_tokenizers) | |
| **Extends**: <code>Normalizer</code> | |
| * * * | |
| <a id="module_tokenizers..StripAccents+normalize" class="group"></a> | |
| ### `stripAccents.normalize(text)` ⇒ <code>string</code> | |
| Remove all accents from the text. | |
| **Kind**: instance method of [<code>StripAccents</code>](#module_tokenizers..StripAccents) | |
| **Returns**: <code>string</code> - The normalized text without accents. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>text</td><td><code>string</code></td><td><p>The input text.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..Lowercase" class="group"></a> | |
| ## tokenizers~Lowercase ⇐ <code>Normalizer</code> | |
| A Normalizer that lowercases the input string. | |
| **Kind**: inner class of [<code>tokenizers</code>](#module_tokenizers) | |
| **Extends**: <code>Normalizer</code> | |
| * * * | |
| <a id="module_tokenizers..Lowercase+normalize" class="group"></a> | |
| ### `lowercase.normalize(text)` ⇒ <code>string</code> | |
| Lowercases the input string. | |
| **Kind**: instance method of [<code>Lowercase</code>](#module_tokenizers..Lowercase) | |
| **Returns**: <code>string</code> - The normalized text. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>text</td><td><code>string</code></td><td><p>The text to normalize.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..Prepend" class="group"></a> | |
| ## tokenizers~Prepend ⇐ <code>Normalizer</code> | |
| A Normalizer that prepends a string to the input string. | |
| **Kind**: inner class of [<code>tokenizers</code>](#module_tokenizers) | |
| **Extends**: <code>Normalizer</code> | |
| * * * | |
| <a id="module_tokenizers..Prepend+normalize" class="group"></a> | |
| ### `prepend.normalize(text)` ⇒ <code>string</code> | |
| Prepends the input string. | |
| **Kind**: instance method of [<code>Prepend</code>](#module_tokenizers..Prepend) | |
| **Returns**: <code>string</code> - The normalized text. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>text</td><td><code>string</code></td><td><p>The text to normalize.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..NormalizerSequence" class="group"></a> | |
| ## tokenizers~NormalizerSequence ⇐ <code>Normalizer</code> | |
| A Normalizer that applies a sequence of Normalizers. | |
| **Kind**: inner class of [<code>tokenizers</code>](#module_tokenizers) | |
| **Extends**: <code>Normalizer</code> | |
| * [~NormalizerSequence](#module_tokenizers..NormalizerSequence) ⇐ <code>Normalizer</code> | |
| * [`new NormalizerSequence(config)`](#new_module_tokenizers..NormalizerSequence_new) | |
| * [`.normalize(text)`](#module_tokenizers..NormalizerSequence+normalize) ⇒ <code>string</code> | |
| * * * | |
| <a id="new_module_tokenizers..NormalizerSequence_new" class="group"></a> | |
| ### `new NormalizerSequence(config)` | |
| Create a new instance of NormalizerSequence. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>config</td><td><code>Object</code></td><td><p>The configuration object.</p> | |
| </td> | |
| </tr><tr> | |
| <td>config.normalizers</td><td><code>Array.<Object></code></td><td><p>An array of Normalizer configuration objects.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..NormalizerSequence+normalize" class="group"></a> | |
| ### `normalizerSequence.normalize(text)` ⇒ <code>string</code> | |
| Apply a sequence of Normalizers to the input text. | |
| **Kind**: instance method of [<code>NormalizerSequence</code>](#module_tokenizers..NormalizerSequence) | |
| **Returns**: <code>string</code> - The normalized text. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>text</td><td><code>string</code></td><td><p>The text to normalize.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..BertNormalizer" class="group"></a> | |
| ## tokenizers~BertNormalizer ⇐ <code>Normalizer</code> | |
| A class representing a normalizer used in BERT tokenization. | |
| **Kind**: inner class of [<code>tokenizers</code>](#module_tokenizers) | |
| **Extends**: <code>Normalizer</code> | |
| * [~BertNormalizer](#module_tokenizers..BertNormalizer) ⇐ <code>Normalizer</code> | |
| * [`._tokenize_chinese_chars(text)`](#module_tokenizers..BertNormalizer+_tokenize_chinese_chars) ⇒ <code>string</code> | |
| * [`.stripAccents(text)`](#module_tokenizers..BertNormalizer+stripAccents) ⇒ <code>string</code> | |
| * [`.normalize(text)`](#module_tokenizers..BertNormalizer+normalize) ⇒ <code>string</code> | |
| * * * | |
| <a id="module_tokenizers..BertNormalizer+_tokenize_chinese_chars" class="group"></a> | |
| ### `bertNormalizer._tokenize_chinese_chars(text)` ⇒ <code>string</code> | |
| Adds whitespace around any CJK (Chinese, Japanese, or Korean) character in the input text. | |
| **Kind**: instance method of [<code>BertNormalizer</code>](#module_tokenizers..BertNormalizer) | |
| **Returns**: <code>string</code> - The tokenized text with whitespace added around CJK characters. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>text</td><td><code>string</code></td><td><p>The input text to tokenize.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..BertNormalizer+stripAccents" class="group"></a> | |
| ### `bertNormalizer.stripAccents(text)` ⇒ <code>string</code> | |
| Strips accents from the given text. | |
| **Kind**: instance method of [<code>BertNormalizer</code>](#module_tokenizers..BertNormalizer) | |
| **Returns**: <code>string</code> - The text with accents removed. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>text</td><td><code>string</code></td><td><p>The text to strip accents from.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..BertNormalizer+normalize" class="group"></a> | |
| ### `bertNormalizer.normalize(text)` ⇒ <code>string</code> | |
| Normalizes the given text based on the configuration. | |
| **Kind**: instance method of [<code>BertNormalizer</code>](#module_tokenizers..BertNormalizer) | |
| **Returns**: <code>string</code> - The normalized text. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>text</td><td><code>string</code></td><td><p>The text to normalize.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..PreTokenizer" class="group"></a> | |
| ## tokenizers~PreTokenizer ⇐ [<code>Callable</code>](#Callable) | |
| A callable class representing a pre-tokenizer used in tokenization. Subclasses | |
| should implement the `pre_tokenize_text` method to define the specific pre-tokenization logic. | |
| **Kind**: inner class of [<code>tokenizers</code>](#module_tokenizers) | |
| **Extends**: [<code>Callable</code>](#Callable) | |
| * [~PreTokenizer](#module_tokenizers..PreTokenizer) ⇐ [<code>Callable</code>](#Callable) | |
| * _instance_ | |
| * *[`.pre_tokenize_text(text, [options])`](#module_tokenizers..PreTokenizer+pre_tokenize_text) ⇒ <code>Array.<string></code>* | |
| * [`.pre_tokenize(text, [options])`](#module_tokenizers..PreTokenizer+pre_tokenize) ⇒ <code>Array.<string></code> | |
| * [`._call(text, [options])`](#module_tokenizers..PreTokenizer+_call) ⇒ <code>Array.<string></code> | |
| * _static_ | |
| * [`.fromConfig(config)`](#module_tokenizers..PreTokenizer.fromConfig) ⇒ <code>PreTokenizer</code> | |
| * * * | |
| <a id="module_tokenizers..PreTokenizer+pre_tokenize_text" class="group"></a> | |
| ### *`preTokenizer.pre_tokenize_text(text, [options])` ⇒ <code>Array.<string></code>* | |
| Method that should be implemented by subclasses to define the specific pre-tokenization logic. | |
| **Kind**: instance abstract method of [<code>PreTokenizer</code>](#module_tokenizers..PreTokenizer) | |
| **Returns**: <code>Array.<string></code> - The pre-tokenized text. | |
| **Throws**: | |
| - <code>Error</code> If the method is not implemented in the subclass. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>text</td><td><code>string</code></td><td><p>The text to pre-tokenize.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options]</td><td><code>Object</code></td><td><p>Additional options for the pre-tokenization logic.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..PreTokenizer+pre_tokenize" class="group"></a> | |
| ### `preTokenizer.pre_tokenize(text, [options])` ⇒ <code>Array.<string></code> | |
| Tokenizes the given text into pre-tokens. | |
| **Kind**: instance method of [<code>PreTokenizer</code>](#module_tokenizers..PreTokenizer) | |
| **Returns**: <code>Array.<string></code> - An array of pre-tokens. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>text</td><td><code>string</code> | <code>Array<string></code></td><td><p>The text or array of texts to pre-tokenize.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options]</td><td><code>Object</code></td><td><p>Additional options for the pre-tokenization logic.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..PreTokenizer+_call" class="group"></a> | |
| ### `preTokenizer._call(text, [options])` ⇒ <code>Array.<string></code> | |
| Alias for [PreTokenizer#pre_tokenize](PreTokenizer#pre_tokenize). | |
| **Kind**: instance method of [<code>PreTokenizer</code>](#module_tokenizers..PreTokenizer) | |
| **Overrides**: [<code>_call</code>](#Callable+_call) | |
| **Returns**: <code>Array.<string></code> - An array of pre-tokens. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>text</td><td><code>string</code> | <code>Array<string></code></td><td><p>The text or array of texts to pre-tokenize.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options]</td><td><code>Object</code></td><td><p>Additional options for the pre-tokenization logic.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..PreTokenizer.fromConfig" class="group"></a> | |
| ### `PreTokenizer.fromConfig(config)` ⇒ <code>PreTokenizer</code> | |
| Factory method that returns an instance of a subclass of `PreTokenizer` based on the provided configuration. | |
| **Kind**: static method of [<code>PreTokenizer</code>](#module_tokenizers..PreTokenizer) | |
| **Returns**: <code>PreTokenizer</code> - An instance of a subclass of `PreTokenizer`. | |
| **Throws**: | |
| - <code>Error</code> If the provided configuration object does not correspond to any known pre-tokenizer. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>config</td><td><code>Object</code></td><td><p>A configuration object for the pre-tokenizer.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..BertPreTokenizer" class="group"></a> | |
| ## tokenizers~BertPreTokenizer ⇐ <code>PreTokenizer</code> | |
| **Kind**: inner class of [<code>tokenizers</code>](#module_tokenizers) | |
| **Extends**: <code>PreTokenizer</code> | |
| * [~BertPreTokenizer](#module_tokenizers..BertPreTokenizer) ⇐ <code>PreTokenizer</code> | |
| * [`new BertPreTokenizer(config)`](#new_module_tokenizers..BertPreTokenizer_new) | |
| * [`.pre_tokenize_text(text, [options])`](#module_tokenizers..BertPreTokenizer+pre_tokenize_text) ⇒ <code>Array.<string></code> | |
| * * * | |
| <a id="new_module_tokenizers..BertPreTokenizer_new" class="group"></a> | |
| ### `new BertPreTokenizer(config)` | |
| A PreTokenizer that splits text into wordpieces using a basic tokenization scheme | |
| similar to that used in the original implementation of BERT. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>config</td><td><code>Object</code></td><td><p>The configuration object.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..BertPreTokenizer+pre_tokenize_text" class="group"></a> | |
| ### `bertPreTokenizer.pre_tokenize_text(text, [options])` ⇒ <code>Array.<string></code> | |
| Tokenizes a single text using the BERT pre-tokenization scheme. | |
| **Kind**: instance method of [<code>BertPreTokenizer</code>](#module_tokenizers..BertPreTokenizer) | |
| **Returns**: <code>Array.<string></code> - An array of tokens. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>text</td><td><code>string</code></td><td><p>The text to tokenize.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options]</td><td><code>Object</code></td><td><p>Additional options for the pre-tokenization logic.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..ByteLevelPreTokenizer" class="group"></a> | |
| ## tokenizers~ByteLevelPreTokenizer ⇐ <code>PreTokenizer</code> | |
| A pre-tokenizer that splits text into Byte-Pair-Encoding (BPE) subwords. | |
| **Kind**: inner class of [<code>tokenizers</code>](#module_tokenizers) | |
| **Extends**: <code>PreTokenizer</code> | |
| * [~ByteLevelPreTokenizer](#module_tokenizers..ByteLevelPreTokenizer) ⇐ <code>PreTokenizer</code> | |
| * [`new ByteLevelPreTokenizer(config)`](#new_module_tokenizers..ByteLevelPreTokenizer_new) | |
| * [`.add_prefix_space`](#module_tokenizers..ByteLevelPreTokenizer+add_prefix_space) : <code>boolean</code> | |
| * [`.trim_offsets`](#module_tokenizers..ByteLevelPreTokenizer+trim_offsets) : <code>boolean</code> | |
| * [`.use_regex`](#module_tokenizers..ByteLevelPreTokenizer+use_regex) : <code>boolean</code> | |
| * [`.pre_tokenize_text(text, [options])`](#module_tokenizers..ByteLevelPreTokenizer+pre_tokenize_text) ⇒ <code>Array.<string></code> | |
| * * * | |
| <a id="new_module_tokenizers..ByteLevelPreTokenizer_new" class="group"></a> | |
| ### `new ByteLevelPreTokenizer(config)` | |
| Creates a new instance of the `ByteLevelPreTokenizer` class. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>config</td><td><code>Object</code></td><td><p>The configuration object.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..ByteLevelPreTokenizer+add_prefix_space" class="group"></a> | |
| ### `byteLevelPreTokenizer.add_prefix_space` : <code>boolean</code> | |
| Whether to add a leading space to the first word.This allows to treat the leading word just as any other word. | |
| **Kind**: instance property of [<code>ByteLevelPreTokenizer</code>](#module_tokenizers..ByteLevelPreTokenizer) | |
| * * * | |
| <a id="module_tokenizers..ByteLevelPreTokenizer+trim_offsets" class="group"></a> | |
| ### `byteLevelPreTokenizer.trim_offsets` : <code>boolean</code> | |
| Whether the post processing step should trim offsetsto avoid including whitespaces. | |
| **Kind**: instance property of [<code>ByteLevelPreTokenizer</code>](#module_tokenizers..ByteLevelPreTokenizer) | |
| **Todo** | |
| - Use this in the pretokenization step. | |
| * * * | |
| <a id="module_tokenizers..ByteLevelPreTokenizer+use_regex" class="group"></a> | |
| ### `byteLevelPreTokenizer.use_regex` : <code>boolean</code> | |
| Whether to use the standard GPT2 regex for whitespace splitting.Set it to False if you want to use your own splitting. Defaults to true. | |
| **Kind**: instance property of [<code>ByteLevelPreTokenizer</code>](#module_tokenizers..ByteLevelPreTokenizer) | |
| * * * | |
| <a id="module_tokenizers..ByteLevelPreTokenizer+pre_tokenize_text" class="group"></a> | |
| ### `byteLevelPreTokenizer.pre_tokenize_text(text, [options])` ⇒ <code>Array.<string></code> | |
| Tokenizes a single piece of text using byte-level tokenization. | |
| **Kind**: instance method of [<code>ByteLevelPreTokenizer</code>](#module_tokenizers..ByteLevelPreTokenizer) | |
| **Returns**: <code>Array.<string></code> - An array of tokens. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>text</td><td><code>string</code></td><td><p>The text to tokenize.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options]</td><td><code>Object</code></td><td><p>Additional options for the pre-tokenization logic.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..SplitPreTokenizer" class="group"></a> | |
| ## tokenizers~SplitPreTokenizer ⇐ <code>PreTokenizer</code> | |
| Splits text using a given pattern. | |
| **Kind**: inner class of [<code>tokenizers</code>](#module_tokenizers) | |
| **Extends**: <code>PreTokenizer</code> | |
| * [~SplitPreTokenizer](#module_tokenizers..SplitPreTokenizer) ⇐ <code>PreTokenizer</code> | |
| * [`new SplitPreTokenizer(config)`](#new_module_tokenizers..SplitPreTokenizer_new) | |
| * [`.pre_tokenize_text(text, [options])`](#module_tokenizers..SplitPreTokenizer+pre_tokenize_text) ⇒ <code>Array.<string></code> | |
| * * * | |
| <a id="new_module_tokenizers..SplitPreTokenizer_new" class="group"></a> | |
| ### `new SplitPreTokenizer(config)` | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>config</td><td><code>Object</code></td><td><p>The configuration options for the pre-tokenizer.</p> | |
| </td> | |
| </tr><tr> | |
| <td>config.pattern</td><td><code>Object</code></td><td><p>The pattern used to split the text. Can be a string or a regex object.</p> | |
| </td> | |
| </tr><tr> | |
| <td>config.pattern.String</td><td><code>string</code> | <code>undefined</code></td><td><p>The string to use for splitting. Only defined if the pattern is a string.</p> | |
| </td> | |
| </tr><tr> | |
| <td>config.pattern.Regex</td><td><code>string</code> | <code>undefined</code></td><td><p>The regex to use for splitting. Only defined if the pattern is a regex.</p> | |
| </td> | |
| </tr><tr> | |
| <td>config.behavior</td><td><code>SplitDelimiterBehavior</code></td><td><p>The behavior to use when splitting.</p> | |
| </td> | |
| </tr><tr> | |
| <td>config.invert</td><td><code>boolean</code></td><td><p>Whether to split (invert=false) or match (invert=true) the pattern.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..SplitPreTokenizer+pre_tokenize_text" class="group"></a> | |
| ### `splitPreTokenizer.pre_tokenize_text(text, [options])` ⇒ <code>Array.<string></code> | |
| Tokenizes text by splitting it using the given pattern. | |
| **Kind**: instance method of [<code>SplitPreTokenizer</code>](#module_tokenizers..SplitPreTokenizer) | |
| **Returns**: <code>Array.<string></code> - An array of tokens. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>text</td><td><code>string</code></td><td><p>The text to tokenize.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options]</td><td><code>Object</code></td><td><p>Additional options for the pre-tokenization logic.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..PunctuationPreTokenizer" class="group"></a> | |
| ## tokenizers~PunctuationPreTokenizer ⇐ <code>PreTokenizer</code> | |
| Splits text based on punctuation. | |
| **Kind**: inner class of [<code>tokenizers</code>](#module_tokenizers) | |
| **Extends**: <code>PreTokenizer</code> | |
| * [~PunctuationPreTokenizer](#module_tokenizers..PunctuationPreTokenizer) ⇐ <code>PreTokenizer</code> | |
| * [`new PunctuationPreTokenizer(config)`](#new_module_tokenizers..PunctuationPreTokenizer_new) | |
| * [`.pre_tokenize_text(text, [options])`](#module_tokenizers..PunctuationPreTokenizer+pre_tokenize_text) ⇒ <code>Array.<string></code> | |
| * * * | |
| <a id="new_module_tokenizers..PunctuationPreTokenizer_new" class="group"></a> | |
| ### `new PunctuationPreTokenizer(config)` | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>config</td><td><code>Object</code></td><td><p>The configuration options for the pre-tokenizer.</p> | |
| </td> | |
| </tr><tr> | |
| <td>config.behavior</td><td><code>SplitDelimiterBehavior</code></td><td><p>The behavior to use when splitting.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..PunctuationPreTokenizer+pre_tokenize_text" class="group"></a> | |
| ### `punctuationPreTokenizer.pre_tokenize_text(text, [options])` ⇒ <code>Array.<string></code> | |
| Tokenizes text by splitting it using the given pattern. | |
| **Kind**: instance method of [<code>PunctuationPreTokenizer</code>](#module_tokenizers..PunctuationPreTokenizer) | |
| **Returns**: <code>Array.<string></code> - An array of tokens. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>text</td><td><code>string</code></td><td><p>The text to tokenize.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options]</td><td><code>Object</code></td><td><p>Additional options for the pre-tokenization logic.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..DigitsPreTokenizer" class="group"></a> | |
| ## tokenizers~DigitsPreTokenizer ⇐ <code>PreTokenizer</code> | |
| Splits text based on digits. | |
| **Kind**: inner class of [<code>tokenizers</code>](#module_tokenizers) | |
| **Extends**: <code>PreTokenizer</code> | |
| * [~DigitsPreTokenizer](#module_tokenizers..DigitsPreTokenizer) ⇐ <code>PreTokenizer</code> | |
| * [`new DigitsPreTokenizer(config)`](#new_module_tokenizers..DigitsPreTokenizer_new) | |
| * [`.pre_tokenize_text(text, [options])`](#module_tokenizers..DigitsPreTokenizer+pre_tokenize_text) ⇒ <code>Array.<string></code> | |
| * * * | |
| <a id="new_module_tokenizers..DigitsPreTokenizer_new" class="group"></a> | |
| ### `new DigitsPreTokenizer(config)` | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>config</td><td><code>Object</code></td><td><p>The configuration options for the pre-tokenizer.</p> | |
| </td> | |
| </tr><tr> | |
| <td>config.individual_digits</td><td><code>boolean</code></td><td><p>Whether to split on individual digits.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..DigitsPreTokenizer+pre_tokenize_text" class="group"></a> | |
| ### `digitsPreTokenizer.pre_tokenize_text(text, [options])` ⇒ <code>Array.<string></code> | |
| Tokenizes text by splitting it using the given pattern. | |
| **Kind**: instance method of [<code>DigitsPreTokenizer</code>](#module_tokenizers..DigitsPreTokenizer) | |
| **Returns**: <code>Array.<string></code> - An array of tokens. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>text</td><td><code>string</code></td><td><p>The text to tokenize.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options]</td><td><code>Object</code></td><td><p>Additional options for the pre-tokenization logic.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..PostProcessor" class="group"></a> | |
| ## tokenizers~PostProcessor ⇐ [<code>Callable</code>](#Callable) | |
| **Kind**: inner class of [<code>tokenizers</code>](#module_tokenizers) | |
| **Extends**: [<code>Callable</code>](#Callable) | |
| * [~PostProcessor](#module_tokenizers..PostProcessor) ⇐ [<code>Callable</code>](#Callable) | |
| * [`new PostProcessor(config)`](#new_module_tokenizers..PostProcessor_new) | |
| * _instance_ | |
| * [`.post_process(tokens, ...args)`](#module_tokenizers..PostProcessor+post_process) ⇒ <code>PostProcessedOutput</code> | |
| * [`._call(tokens, ...args)`](#module_tokenizers..PostProcessor+_call) ⇒ <code>PostProcessedOutput</code> | |
| * _static_ | |
| * [`.fromConfig(config)`](#module_tokenizers..PostProcessor.fromConfig) ⇒ <code>PostProcessor</code> | |
| * * * | |
| <a id="new_module_tokenizers..PostProcessor_new" class="group"></a> | |
| ### `new PostProcessor(config)` | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>config</td><td><code>Object</code></td><td><p>The configuration for the post-processor.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..PostProcessor+post_process" class="group"></a> | |
| ### `postProcessor.post_process(tokens, ...args)` ⇒ <code>PostProcessedOutput</code> | |
| Method to be implemented in subclass to apply post-processing on the given tokens. | |
| **Kind**: instance method of [<code>PostProcessor</code>](#module_tokenizers..PostProcessor) | |
| **Returns**: <code>PostProcessedOutput</code> - The post-processed tokens. | |
| **Throws**: | |
| - <code>Error</code> If the method is not implemented in subclass. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>tokens</td><td><code>Array</code></td><td><p>The input tokens to be post-processed.</p> | |
| </td> | |
| </tr><tr> | |
| <td>...args</td><td><code>*</code></td><td><p>Additional arguments required by the post-processing logic.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..PostProcessor+_call" class="group"></a> | |
| ### `postProcessor._call(tokens, ...args)` ⇒ <code>PostProcessedOutput</code> | |
| Alias for [PostProcessor#post_process](PostProcessor#post_process). | |
| **Kind**: instance method of [<code>PostProcessor</code>](#module_tokenizers..PostProcessor) | |
| **Overrides**: [<code>_call</code>](#Callable+_call) | |
| **Returns**: <code>PostProcessedOutput</code> - The post-processed tokens. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>tokens</td><td><code>Array</code></td><td><p>The text or array of texts to post-process.</p> | |
| </td> | |
| </tr><tr> | |
| <td>...args</td><td><code>*</code></td><td><p>Additional arguments required by the post-processing logic.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..PostProcessor.fromConfig" class="group"></a> | |
| ### `PostProcessor.fromConfig(config)` ⇒ <code>PostProcessor</code> | |
| Factory method to create a PostProcessor object from a configuration object. | |
| **Kind**: static method of [<code>PostProcessor</code>](#module_tokenizers..PostProcessor) | |
| **Returns**: <code>PostProcessor</code> - A PostProcessor object created from the given configuration. | |
| **Throws**: | |
| - <code>Error</code> If an unknown PostProcessor type is encountered. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>config</td><td><code>Object</code></td><td><p>Configuration object representing a PostProcessor.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..BertProcessing" class="group"></a> | |
| ## tokenizers~BertProcessing | |
| A post-processor that adds special tokens to the beginning and end of the input. | |
| **Kind**: inner class of [<code>tokenizers</code>](#module_tokenizers) | |
| * [~BertProcessing](#module_tokenizers..BertProcessing) | |
| * [`new BertProcessing(config)`](#new_module_tokenizers..BertProcessing_new) | |
| * [`.post_process(tokens, [tokens_pair])`](#module_tokenizers..BertProcessing+post_process) ⇒ <code>PostProcessedOutput</code> | |
| * * * | |
| <a id="new_module_tokenizers..BertProcessing_new" class="group"></a> | |
| ### `new BertProcessing(config)` | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>config</td><td><code>Object</code></td><td><p>The configuration for the post-processor.</p> | |
| </td> | |
| </tr><tr> | |
| <td>config.cls</td><td><code>Array.<string></code></td><td><p>The special tokens to add to the beginning of the input.</p> | |
| </td> | |
| </tr><tr> | |
| <td>config.sep</td><td><code>Array.<string></code></td><td><p>The special tokens to add to the end of the input.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..BertProcessing+post_process" class="group"></a> | |
| ### `bertProcessing.post_process(tokens, [tokens_pair])` ⇒ <code>PostProcessedOutput</code> | |
| Adds the special tokens to the beginning and end of the input. | |
| **Kind**: instance method of [<code>BertProcessing</code>](#module_tokenizers..BertProcessing) | |
| **Returns**: <code>PostProcessedOutput</code> - The post-processed tokens with the special tokens added to the beginning and end. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Default</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>tokens</td><td><code>Array.<string></code></td><td></td><td><p>The input tokens.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[tokens_pair]</td><td><code>Array.<string></code></td><td><code></code></td><td><p>An optional second set of input tokens.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..TemplateProcessing" class="group"></a> | |
| ## tokenizers~TemplateProcessing ⇐ <code>PostProcessor</code> | |
| Post processor that replaces special tokens in a template with actual tokens. | |
| **Kind**: inner class of [<code>tokenizers</code>](#module_tokenizers) | |
| **Extends**: <code>PostProcessor</code> | |
| * [~TemplateProcessing](#module_tokenizers..TemplateProcessing) ⇐ <code>PostProcessor</code> | |
| * [`new TemplateProcessing(config)`](#new_module_tokenizers..TemplateProcessing_new) | |
| * [`.post_process(tokens, [tokens_pair])`](#module_tokenizers..TemplateProcessing+post_process) ⇒ <code>PostProcessedOutput</code> | |
| * * * | |
| <a id="new_module_tokenizers..TemplateProcessing_new" class="group"></a> | |
| ### `new TemplateProcessing(config)` | |
| Creates a new instance of `TemplateProcessing`. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>config</td><td><code>Object</code></td><td><p>The configuration options for the post processor.</p> | |
| </td> | |
| </tr><tr> | |
| <td>config.single</td><td><code>Array</code></td><td><p>The template for a single sequence of tokens.</p> | |
| </td> | |
| </tr><tr> | |
| <td>config.pair</td><td><code>Array</code></td><td><p>The template for a pair of sequences of tokens.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..TemplateProcessing+post_process" class="group"></a> | |
| ### `templateProcessing.post_process(tokens, [tokens_pair])` ⇒ <code>PostProcessedOutput</code> | |
| Replaces special tokens in the template with actual tokens. | |
| **Kind**: instance method of [<code>TemplateProcessing</code>](#module_tokenizers..TemplateProcessing) | |
| **Returns**: <code>PostProcessedOutput</code> - An object containing the list of tokens with the special tokens replaced with actual tokens. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Default</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>tokens</td><td><code>Array.<string></code></td><td></td><td><p>The list of tokens for the first sequence.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[tokens_pair]</td><td><code>Array.<string></code></td><td><code></code></td><td><p>The list of tokens for the second sequence (optional).</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..ByteLevelPostProcessor" class="group"></a> | |
| ## tokenizers~ByteLevelPostProcessor ⇐ <code>PostProcessor</code> | |
| A PostProcessor that returns the given tokens as is. | |
| **Kind**: inner class of [<code>tokenizers</code>](#module_tokenizers) | |
| **Extends**: <code>PostProcessor</code> | |
| * * * | |
| <a id="module_tokenizers..ByteLevelPostProcessor+post_process" class="group"></a> | |
| ### `byteLevelPostProcessor.post_process(tokens, [tokens_pair])` ⇒ <code>PostProcessedOutput</code> | |
| Post process the given tokens. | |
| **Kind**: instance method of [<code>ByteLevelPostProcessor</code>](#module_tokenizers..ByteLevelPostProcessor) | |
| **Returns**: <code>PostProcessedOutput</code> - An object containing the post-processed tokens. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Default</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>tokens</td><td><code>Array.<string></code></td><td></td><td><p>The list of tokens for the first sequence.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[tokens_pair]</td><td><code>Array.<string></code></td><td><code></code></td><td><p>The list of tokens for the second sequence (optional).</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..PostProcessorSequence" class="group"></a> | |
| ## tokenizers~PostProcessorSequence | |
| A post-processor that applies multiple post-processors in sequence. | |
| **Kind**: inner class of [<code>tokenizers</code>](#module_tokenizers) | |
| * [~PostProcessorSequence](#module_tokenizers..PostProcessorSequence) | |
| * [`new PostProcessorSequence(config)`](#new_module_tokenizers..PostProcessorSequence_new) | |
| * [`.post_process(tokens, [tokens_pair])`](#module_tokenizers..PostProcessorSequence+post_process) ⇒ <code>PostProcessedOutput</code> | |
| * * * | |
| <a id="new_module_tokenizers..PostProcessorSequence_new" class="group"></a> | |
| ### `new PostProcessorSequence(config)` | |
| Creates a new instance of PostProcessorSequence. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>config</td><td><code>Object</code></td><td><p>The configuration object.</p> | |
| </td> | |
| </tr><tr> | |
| <td>config.processors</td><td><code>Array.<Object></code></td><td><p>The list of post-processors to apply.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..PostProcessorSequence+post_process" class="group"></a> | |
| ### `postProcessorSequence.post_process(tokens, [tokens_pair])` ⇒ <code>PostProcessedOutput</code> | |
| Post process the given tokens. | |
| **Kind**: instance method of [<code>PostProcessorSequence</code>](#module_tokenizers..PostProcessorSequence) | |
| **Returns**: <code>PostProcessedOutput</code> - An object containing the post-processed tokens. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Default</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>tokens</td><td><code>Array.<string></code></td><td></td><td><p>The list of tokens for the first sequence.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[tokens_pair]</td><td><code>Array.<string></code></td><td><code></code></td><td><p>The list of tokens for the second sequence (optional).</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..Decoder" class="group"></a> | |
| ## tokenizers~Decoder ⇐ [<code>Callable</code>](#Callable) | |
| The base class for token decoders. | |
| **Kind**: inner class of [<code>tokenizers</code>](#module_tokenizers) | |
| **Extends**: [<code>Callable</code>](#Callable) | |
| * [~Decoder](#module_tokenizers..Decoder) ⇐ [<code>Callable</code>](#Callable) | |
| * [`new Decoder(config)`](#new_module_tokenizers..Decoder_new) | |
| * _instance_ | |
| * [`.added_tokens`](#module_tokenizers..Decoder+added_tokens) : <code>Array.<AddedToken></code> | |
| * [`._call(tokens)`](#module_tokenizers..Decoder+_call) ⇒ <code>string</code> | |
| * [`.decode(tokens)`](#module_tokenizers..Decoder+decode) ⇒ <code>string</code> | |
| * [`.decode_chain(tokens)`](#module_tokenizers..Decoder+decode_chain) ⇒ <code>Array.<string></code> | |
| * _static_ | |
| * [`.fromConfig(config)`](#module_tokenizers..Decoder.fromConfig) ⇒ <code>Decoder</code> | |
| * * * | |
| <a id="new_module_tokenizers..Decoder_new" class="group"></a> | |
| ### `new Decoder(config)` | |
| Creates an instance of `Decoder`. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>config</td><td><code>Object</code></td><td><p>The configuration object.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..Decoder+added_tokens" class="group"></a> | |
| ### `decoder.added_tokens` : <code>Array.<AddedToken></code> | |
| **Kind**: instance property of [<code>Decoder</code>](#module_tokenizers..Decoder) | |
| * * * | |
| <a id="module_tokenizers..Decoder+_call" class="group"></a> | |
| ### `decoder._call(tokens)` ⇒ <code>string</code> | |
| Calls the `decode` method. | |
| **Kind**: instance method of [<code>Decoder</code>](#module_tokenizers..Decoder) | |
| **Overrides**: [<code>_call</code>](#Callable+_call) | |
| **Returns**: <code>string</code> - The decoded string. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>tokens</td><td><code>Array.<string></code></td><td><p>The list of tokens.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..Decoder+decode" class="group"></a> | |
| ### `decoder.decode(tokens)` ⇒ <code>string</code> | |
| Decodes a list of tokens. | |
| **Kind**: instance method of [<code>Decoder</code>](#module_tokenizers..Decoder) | |
| **Returns**: <code>string</code> - The decoded string. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>tokens</td><td><code>Array.<string></code></td><td><p>The list of tokens.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..Decoder+decode_chain" class="group"></a> | |
| ### `decoder.decode_chain(tokens)` ⇒ <code>Array.<string></code> | |
| Apply the decoder to a list of tokens. | |
| **Kind**: instance method of [<code>Decoder</code>](#module_tokenizers..Decoder) | |
| **Returns**: <code>Array.<string></code> - The decoded list of tokens. | |
| **Throws**: | |
| - <code>Error</code> If the `decode_chain` method is not implemented in the subclass. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>tokens</td><td><code>Array.<string></code></td><td><p>The list of tokens.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..Decoder.fromConfig" class="group"></a> | |
| ### `Decoder.fromConfig(config)` ⇒ <code>Decoder</code> | |
| Creates a decoder instance based on the provided configuration. | |
| **Kind**: static method of [<code>Decoder</code>](#module_tokenizers..Decoder) | |
| **Returns**: <code>Decoder</code> - A decoder instance. | |
| **Throws**: | |
| - <code>Error</code> If an unknown decoder type is provided. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>config</td><td><code>Object</code></td><td><p>The configuration object.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..FuseDecoder" class="group"></a> | |
| ## tokenizers~FuseDecoder | |
| Fuse simply fuses all tokens into one big string. | |
| It's usually the last decoding step anyway, but this decoder | |
| exists incase some decoders need to happen after that step | |
| **Kind**: inner class of [<code>tokenizers</code>](#module_tokenizers) | |
| * * * | |
| <a id="module_tokenizers..FuseDecoder+decode_chain" class="group"></a> | |
| ### `fuseDecoder.decode_chain()` : <code>*</code> | |
| **Kind**: instance method of [<code>FuseDecoder</code>](#module_tokenizers..FuseDecoder) | |
| * * * | |
| <a id="module_tokenizers..WordPieceDecoder" class="group"></a> | |
| ## tokenizers~WordPieceDecoder ⇐ <code>Decoder</code> | |
| A decoder that decodes a list of WordPiece tokens into a single string. | |
| **Kind**: inner class of [<code>tokenizers</code>](#module_tokenizers) | |
| **Extends**: <code>Decoder</code> | |
| * [~WordPieceDecoder](#module_tokenizers..WordPieceDecoder) ⇐ <code>Decoder</code> | |
| * [`new WordPieceDecoder(config)`](#new_module_tokenizers..WordPieceDecoder_new) | |
| * [`.decode_chain()`](#module_tokenizers..WordPieceDecoder+decode_chain) : <code>*</code> | |
| * * * | |
| <a id="new_module_tokenizers..WordPieceDecoder_new" class="group"></a> | |
| ### `new WordPieceDecoder(config)` | |
| Creates a new instance of WordPieceDecoder. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>config</td><td><code>Object</code></td><td><p>The configuration object.</p> | |
| </td> | |
| </tr><tr> | |
| <td>config.prefix</td><td><code>string</code></td><td><p>The prefix used for WordPiece encoding.</p> | |
| </td> | |
| </tr><tr> | |
| <td>config.cleanup</td><td><code>boolean</code></td><td><p>Whether to cleanup the decoded string.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..WordPieceDecoder+decode_chain" class="group"></a> | |
| ### `wordPieceDecoder.decode_chain()` : <code>*</code> | |
| **Kind**: instance method of [<code>WordPieceDecoder</code>](#module_tokenizers..WordPieceDecoder) | |
| * * * | |
| <a id="module_tokenizers..ByteLevelDecoder" class="group"></a> | |
| ## tokenizers~ByteLevelDecoder ⇐ <code>Decoder</code> | |
| Byte-level decoder for tokenization output. Inherits from the `Decoder` class. | |
| **Kind**: inner class of [<code>tokenizers</code>](#module_tokenizers) | |
| **Extends**: <code>Decoder</code> | |
| * [~ByteLevelDecoder](#module_tokenizers..ByteLevelDecoder) ⇐ <code>Decoder</code> | |
| * [`new ByteLevelDecoder(config)`](#new_module_tokenizers..ByteLevelDecoder_new) | |
| * [`.convert_tokens_to_string(tokens)`](#module_tokenizers..ByteLevelDecoder+convert_tokens_to_string) ⇒ <code>string</code> | |
| * [`.decode_chain()`](#module_tokenizers..ByteLevelDecoder+decode_chain) : <code>*</code> | |
| * * * | |
| <a id="new_module_tokenizers..ByteLevelDecoder_new" class="group"></a> | |
| ### `new ByteLevelDecoder(config)` | |
| Create a `ByteLevelDecoder` object. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>config</td><td><code>Object</code></td><td><p>Configuration object.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..ByteLevelDecoder+convert_tokens_to_string" class="group"></a> | |
| ### `byteLevelDecoder.convert_tokens_to_string(tokens)` ⇒ <code>string</code> | |
| Convert an array of tokens to string by decoding each byte. | |
| **Kind**: instance method of [<code>ByteLevelDecoder</code>](#module_tokenizers..ByteLevelDecoder) | |
| **Returns**: <code>string</code> - The decoded string. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>tokens</td><td><code>Array.<string></code></td><td><p>Array of tokens to be decoded.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..ByteLevelDecoder+decode_chain" class="group"></a> | |
| ### `byteLevelDecoder.decode_chain()` : <code>*</code> | |
| **Kind**: instance method of [<code>ByteLevelDecoder</code>](#module_tokenizers..ByteLevelDecoder) | |
| * * * | |
| <a id="module_tokenizers..CTCDecoder" class="group"></a> | |
| ## tokenizers~CTCDecoder | |
| The CTC (Connectionist Temporal Classification) decoder. | |
| See https://github.com/huggingface/tokenizers/blob/bb38f390a61883fc2f29d659af696f428d1cda6b/tokenizers/src/decoders/ctc.rs | |
| **Kind**: inner class of [<code>tokenizers</code>](#module_tokenizers) | |
| * [~CTCDecoder](#module_tokenizers..CTCDecoder) | |
| * [`.convert_tokens_to_string(tokens)`](#module_tokenizers..CTCDecoder+convert_tokens_to_string) ⇒ <code>string</code> | |
| * [`.decode_chain()`](#module_tokenizers..CTCDecoder+decode_chain) : <code>*</code> | |
| * * * | |
| <a id="module_tokenizers..CTCDecoder+convert_tokens_to_string" class="group"></a> | |
| ### `ctcDecoder.convert_tokens_to_string(tokens)` ⇒ <code>string</code> | |
| Converts a connectionist-temporal-classification (CTC) output tokens into a single string. | |
| **Kind**: instance method of [<code>CTCDecoder</code>](#module_tokenizers..CTCDecoder) | |
| **Returns**: <code>string</code> - The decoded string. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>tokens</td><td><code>Array.<string></code></td><td><p>Array of tokens to be decoded.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..CTCDecoder+decode_chain" class="group"></a> | |
| ### `ctcDecoder.decode_chain()` : <code>*</code> | |
| **Kind**: instance method of [<code>CTCDecoder</code>](#module_tokenizers..CTCDecoder) | |
| * * * | |
| <a id="module_tokenizers..DecoderSequence" class="group"></a> | |
| ## tokenizers~DecoderSequence ⇐ <code>Decoder</code> | |
| Apply a sequence of decoders. | |
| **Kind**: inner class of [<code>tokenizers</code>](#module_tokenizers) | |
| **Extends**: <code>Decoder</code> | |
| * [~DecoderSequence](#module_tokenizers..DecoderSequence) ⇐ <code>Decoder</code> | |
| * [`new DecoderSequence(config)`](#new_module_tokenizers..DecoderSequence_new) | |
| * [`.decode_chain()`](#module_tokenizers..DecoderSequence+decode_chain) : <code>*</code> | |
| * * * | |
| <a id="new_module_tokenizers..DecoderSequence_new" class="group"></a> | |
| ### `new DecoderSequence(config)` | |
| Creates a new instance of DecoderSequence. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>config</td><td><code>Object</code></td><td><p>The configuration object.</p> | |
| </td> | |
| </tr><tr> | |
| <td>config.decoders</td><td><code>Array.<Object></code></td><td><p>The list of decoders to apply.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..DecoderSequence+decode_chain" class="group"></a> | |
| ### `decoderSequence.decode_chain()` : <code>*</code> | |
| **Kind**: instance method of [<code>DecoderSequence</code>](#module_tokenizers..DecoderSequence) | |
| * * * | |
| <a id="module_tokenizers..MetaspacePreTokenizer" class="group"></a> | |
| ## tokenizers~MetaspacePreTokenizer ⇐ <code>PreTokenizer</code> | |
| This PreTokenizer replaces spaces with the given replacement character, adds a prefix space if requested, | |
| and returns a list of tokens. | |
| **Kind**: inner class of [<code>tokenizers</code>](#module_tokenizers) | |
| **Extends**: <code>PreTokenizer</code> | |
| * [~MetaspacePreTokenizer](#module_tokenizers..MetaspacePreTokenizer) ⇐ <code>PreTokenizer</code> | |
| * [`new MetaspacePreTokenizer(config)`](#new_module_tokenizers..MetaspacePreTokenizer_new) | |
| * [`.pre_tokenize_text(text, [options])`](#module_tokenizers..MetaspacePreTokenizer+pre_tokenize_text) ⇒ <code>Array.<string></code> | |
| * * * | |
| <a id="new_module_tokenizers..MetaspacePreTokenizer_new" class="group"></a> | |
| ### `new MetaspacePreTokenizer(config)` | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Default</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>config</td><td><code>Object</code></td><td></td><td><p>The configuration object for the MetaspacePreTokenizer.</p> | |
| </td> | |
| </tr><tr> | |
| <td>config.replacement</td><td><code>string</code></td><td></td><td><p>The character to replace spaces with.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[config.str_rep]</td><td><code>string</code></td><td><code>"config.replacement"</code></td><td><p>An optional string representation of the replacement character.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[config.prepend_scheme]</td><td><code>'first'</code> | <code>'never'</code> | <code>'always'</code></td><td><code>'always'</code></td><td><p>The metaspace prepending scheme.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..MetaspacePreTokenizer+pre_tokenize_text" class="group"></a> | |
| ### `metaspacePreTokenizer.pre_tokenize_text(text, [options])` ⇒ <code>Array.<string></code> | |
| This method takes a string, replaces spaces with the replacement character, | |
| adds a prefix space if requested, and returns a new list of tokens. | |
| **Kind**: instance method of [<code>MetaspacePreTokenizer</code>](#module_tokenizers..MetaspacePreTokenizer) | |
| **Returns**: <code>Array.<string></code> - A new list of pre-tokenized tokens. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>text</td><td><code>string</code></td><td><p>The text to pre-tokenize.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options]</td><td><code>Object</code></td><td><p>The options for the pre-tokenization.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.section_index]</td><td><code>number</code></td><td><p>The index of the section to pre-tokenize.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..MetaspaceDecoder" class="group"></a> | |
| ## tokenizers~MetaspaceDecoder ⇐ <code>Decoder</code> | |
| MetaspaceDecoder class extends the Decoder class and decodes Metaspace tokenization. | |
| **Kind**: inner class of [<code>tokenizers</code>](#module_tokenizers) | |
| **Extends**: <code>Decoder</code> | |
| * [~MetaspaceDecoder](#module_tokenizers..MetaspaceDecoder) ⇐ <code>Decoder</code> | |
| * [`new MetaspaceDecoder(config)`](#new_module_tokenizers..MetaspaceDecoder_new) | |
| * [`.decode_chain()`](#module_tokenizers..MetaspaceDecoder+decode_chain) : <code>*</code> | |
| * * * | |
| <a id="new_module_tokenizers..MetaspaceDecoder_new" class="group"></a> | |
| ### `new MetaspaceDecoder(config)` | |
| Constructs a new MetaspaceDecoder object. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>config</td><td><code>Object</code></td><td><p>The configuration object for the MetaspaceDecoder.</p> | |
| </td> | |
| </tr><tr> | |
| <td>config.replacement</td><td><code>string</code></td><td><p>The string to replace spaces with.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..MetaspaceDecoder+decode_chain" class="group"></a> | |
| ### `metaspaceDecoder.decode_chain()` : <code>*</code> | |
| **Kind**: instance method of [<code>MetaspaceDecoder</code>](#module_tokenizers..MetaspaceDecoder) | |
| * * * | |
| <a id="module_tokenizers..Precompiled" class="group"></a> | |
| ## tokenizers~Precompiled ⇐ <code>Normalizer</code> | |
| A normalizer that applies a precompiled charsmap. | |
| This is useful for applying complex normalizations in C++ and exposing them to JavaScript. | |
| **Kind**: inner class of [<code>tokenizers</code>](#module_tokenizers) | |
| **Extends**: <code>Normalizer</code> | |
| * [~Precompiled](#module_tokenizers..Precompiled) ⇐ <code>Normalizer</code> | |
| * [`new Precompiled(config)`](#new_module_tokenizers..Precompiled_new) | |
| * [`.normalize(text)`](#module_tokenizers..Precompiled+normalize) ⇒ <code>string</code> | |
| * * * | |
| <a id="new_module_tokenizers..Precompiled_new" class="group"></a> | |
| ### `new Precompiled(config)` | |
| Create a new instance of Precompiled normalizer. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>config</td><td><code>Object</code></td><td><p>The configuration object for the Precompiled normalizer.</p> | |
| </td> | |
| </tr><tr> | |
| <td>config.precompiled_charsmap</td><td><code>Object</code></td><td><p>The precompiled charsmap object.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..Precompiled+normalize" class="group"></a> | |
| ### `precompiled.normalize(text)` ⇒ <code>string</code> | |
| Normalizes the given text by applying the precompiled charsmap. | |
| **Kind**: instance method of [<code>Precompiled</code>](#module_tokenizers..Precompiled) | |
| **Returns**: <code>string</code> - The normalized text. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>text</td><td><code>string</code></td><td><p>The text to normalize.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..PreTokenizerSequence" class="group"></a> | |
| ## tokenizers~PreTokenizerSequence ⇐ <code>PreTokenizer</code> | |
| A pre-tokenizer that applies a sequence of pre-tokenizers to the input text. | |
| **Kind**: inner class of [<code>tokenizers</code>](#module_tokenizers) | |
| **Extends**: <code>PreTokenizer</code> | |
| * [~PreTokenizerSequence](#module_tokenizers..PreTokenizerSequence) ⇐ <code>PreTokenizer</code> | |
| * [`new PreTokenizerSequence(config)`](#new_module_tokenizers..PreTokenizerSequence_new) | |
| * [`.pre_tokenize_text(text, [options])`](#module_tokenizers..PreTokenizerSequence+pre_tokenize_text) ⇒ <code>Array.<string></code> | |
| * * * | |
| <a id="new_module_tokenizers..PreTokenizerSequence_new" class="group"></a> | |
| ### `new PreTokenizerSequence(config)` | |
| Creates an instance of PreTokenizerSequence. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>config</td><td><code>Object</code></td><td><p>The configuration object for the pre-tokenizer sequence.</p> | |
| </td> | |
| </tr><tr> | |
| <td>config.pretokenizers</td><td><code>Array.<Object></code></td><td><p>An array of pre-tokenizer configurations.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..PreTokenizerSequence+pre_tokenize_text" class="group"></a> | |
| ### `preTokenizerSequence.pre_tokenize_text(text, [options])` ⇒ <code>Array.<string></code> | |
| Applies each pre-tokenizer in the sequence to the input text in turn. | |
| **Kind**: instance method of [<code>PreTokenizerSequence</code>](#module_tokenizers..PreTokenizerSequence) | |
| **Returns**: <code>Array.<string></code> - The pre-tokenized text. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>text</td><td><code>string</code></td><td><p>The text to pre-tokenize.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options]</td><td><code>Object</code></td><td><p>Additional options for the pre-tokenization logic.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..WhitespacePreTokenizer" class="group"></a> | |
| ## tokenizers~WhitespacePreTokenizer | |
| Splits on word boundaries (using the following regular expression: `\w+|[^\w\s]+`). | |
| **Kind**: inner class of [<code>tokenizers</code>](#module_tokenizers) | |
| * [~WhitespacePreTokenizer](#module_tokenizers..WhitespacePreTokenizer) | |
| * [`new WhitespacePreTokenizer(config)`](#new_module_tokenizers..WhitespacePreTokenizer_new) | |
| * [`.pre_tokenize_text(text, [options])`](#module_tokenizers..WhitespacePreTokenizer+pre_tokenize_text) ⇒ <code>Array.<string></code> | |
| * * * | |
| <a id="new_module_tokenizers..WhitespacePreTokenizer_new" class="group"></a> | |
| ### `new WhitespacePreTokenizer(config)` | |
| Creates an instance of WhitespacePreTokenizer. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>config</td><td><code>Object</code></td><td><p>The configuration object for the pre-tokenizer.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..WhitespacePreTokenizer+pre_tokenize_text" class="group"></a> | |
| ### `whitespacePreTokenizer.pre_tokenize_text(text, [options])` ⇒ <code>Array.<string></code> | |
| Pre-tokenizes the input text by splitting it on word boundaries. | |
| **Kind**: instance method of [<code>WhitespacePreTokenizer</code>](#module_tokenizers..WhitespacePreTokenizer) | |
| **Returns**: <code>Array.<string></code> - An array of tokens produced by splitting the input text on whitespace. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>text</td><td><code>string</code></td><td><p>The text to be pre-tokenized.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options]</td><td><code>Object</code></td><td><p>Additional options for the pre-tokenization logic.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..WhitespaceSplit" class="group"></a> | |
| ## tokenizers~WhitespaceSplit ⇐ <code>PreTokenizer</code> | |
| Splits a string of text by whitespace characters into individual tokens. | |
| **Kind**: inner class of [<code>tokenizers</code>](#module_tokenizers) | |
| **Extends**: <code>PreTokenizer</code> | |
| * [~WhitespaceSplit](#module_tokenizers..WhitespaceSplit) ⇐ <code>PreTokenizer</code> | |
| * [`new WhitespaceSplit(config)`](#new_module_tokenizers..WhitespaceSplit_new) | |
| * [`.pre_tokenize_text(text, [options])`](#module_tokenizers..WhitespaceSplit+pre_tokenize_text) ⇒ <code>Array.<string></code> | |
| * * * | |
| <a id="new_module_tokenizers..WhitespaceSplit_new" class="group"></a> | |
| ### `new WhitespaceSplit(config)` | |
| Creates an instance of WhitespaceSplit. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>config</td><td><code>Object</code></td><td><p>The configuration object for the pre-tokenizer.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..WhitespaceSplit+pre_tokenize_text" class="group"></a> | |
| ### `whitespaceSplit.pre_tokenize_text(text, [options])` ⇒ <code>Array.<string></code> | |
| Pre-tokenizes the input text by splitting it on whitespace characters. | |
| **Kind**: instance method of [<code>WhitespaceSplit</code>](#module_tokenizers..WhitespaceSplit) | |
| **Returns**: <code>Array.<string></code> - An array of tokens produced by splitting the input text on whitespace. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>text</td><td><code>string</code></td><td><p>The text to be pre-tokenized.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options]</td><td><code>Object</code></td><td><p>Additional options for the pre-tokenization logic.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..ReplacePreTokenizer" class="group"></a> | |
| ## tokenizers~ReplacePreTokenizer | |
| **Kind**: inner class of [<code>tokenizers</code>](#module_tokenizers) | |
| * [~ReplacePreTokenizer](#module_tokenizers..ReplacePreTokenizer) | |
| * [`new ReplacePreTokenizer(config)`](#new_module_tokenizers..ReplacePreTokenizer_new) | |
| * [`.pre_tokenize_text(text, [options])`](#module_tokenizers..ReplacePreTokenizer+pre_tokenize_text) ⇒ <code>Array.<string></code> | |
| * * * | |
| <a id="new_module_tokenizers..ReplacePreTokenizer_new" class="group"></a> | |
| ### `new ReplacePreTokenizer(config)` | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>config</td><td><code>Object</code></td><td><p>The configuration options for the pre-tokenizer.</p> | |
| </td> | |
| </tr><tr> | |
| <td>config.pattern</td><td><code>Object</code></td><td><p>The pattern used to split the text. Can be a string or a regex object.</p> | |
| </td> | |
| </tr><tr> | |
| <td>config.content</td><td><code>string</code></td><td><p>What to replace the pattern with.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..ReplacePreTokenizer+pre_tokenize_text" class="group"></a> | |
| ### `replacePreTokenizer.pre_tokenize_text(text, [options])` ⇒ <code>Array.<string></code> | |
| Pre-tokenizes the input text by replacing certain characters. | |
| **Kind**: instance method of [<code>ReplacePreTokenizer</code>](#module_tokenizers..ReplacePreTokenizer) | |
| **Returns**: <code>Array.<string></code> - An array of tokens produced by replacing certain characters. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>text</td><td><code>string</code></td><td><p>The text to be pre-tokenized.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options]</td><td><code>Object</code></td><td><p>Additional options for the pre-tokenization logic.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..BYTES_TO_UNICODE" class="group"></a> | |
| ## `tokenizers~BYTES_TO_UNICODE` ⇒ <code>Object</code> | |
| Returns list of utf-8 byte and a mapping to unicode strings. | |
| Specifically avoids mapping to whitespace/control characters the BPE code barfs on. | |
| **Kind**: inner constant of [<code>tokenizers</code>](#module_tokenizers) | |
| **Returns**: <code>Object</code> - Object with utf-8 byte keys and unicode string values. | |
| * * * | |
| <a id="module_tokenizers..loadTokenizer" class="group"></a> | |
| ## `tokenizers~loadTokenizer(pretrained_model_name_or_path, options)` ⇒ <code>Promise.<Array<any>></code> | |
| Loads a tokenizer from the specified path. | |
| **Kind**: inner method of [<code>tokenizers</code>](#module_tokenizers) | |
| **Returns**: <code>Promise.<Array<any>></code> - A promise that resolves with information about the loaded tokenizer. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>pretrained_model_name_or_path</td><td><code>string</code></td><td><p>The path to the tokenizer directory.</p> | |
| </td> | |
| </tr><tr> | |
| <td>options</td><td><code>PretrainedTokenizerOptions</code></td><td><p>Additional options for loading the tokenizer.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..regexSplit" class="group"></a> | |
| ## `tokenizers~regexSplit(text, regex)` ⇒ <code>Array.<string></code> | |
| Helper function to split a string on a regex, but keep the delimiters. | |
| This is required, because the JavaScript `.split()` method does not keep the delimiters, | |
| and wrapping in a capturing group causes issues with existing capturing groups (due to nesting). | |
| **Kind**: inner method of [<code>tokenizers</code>](#module_tokenizers) | |
| **Returns**: <code>Array.<string></code> - The split string. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>text</td><td><code>string</code></td><td><p>The text to split.</p> | |
| </td> | |
| </tr><tr> | |
| <td>regex</td><td><code>RegExp</code></td><td><p>The regex to split on.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..createPattern" class="group"></a> | |
| ## `tokenizers~createPattern(pattern, invert)` ⇒ <code>RegExp</code> | <code>null</code> | |
| Helper method to construct a pattern from a config object. | |
| **Kind**: inner method of [<code>tokenizers</code>](#module_tokenizers) | |
| **Returns**: <code>RegExp</code> | <code>null</code> - The compiled pattern. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Default</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>pattern</td><td><code>Object</code></td><td></td><td><p>The pattern object.</p> | |
| </td> | |
| </tr><tr> | |
| <td>invert</td><td><code>boolean</code></td><td><code>true</code></td><td><p>Whether to invert the pattern.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..objectToMap" class="group"></a> | |
| ## `tokenizers~objectToMap(obj)` ⇒ <code>Map.<string, any></code> | |
| Helper function to convert an Object to a Map | |
| **Kind**: inner method of [<code>tokenizers</code>](#module_tokenizers) | |
| **Returns**: <code>Map.<string, any></code> - The map. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>obj</td><td><code>Object</code></td><td><p>The object to convert.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..prepareTensorForDecode" class="group"></a> | |
| ## `tokenizers~prepareTensorForDecode(tensor)` ⇒ <code>Array.<number></code> | |
| Helper function to convert a tensor to a list before decoding. | |
| **Kind**: inner method of [<code>tokenizers</code>](#module_tokenizers) | |
| **Returns**: <code>Array.<number></code> - The tensor as a list. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>tensor</td><td><code><a href="#Tensor">Tensor</a></code></td><td><p>The tensor to convert.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..clean_up_tokenization" class="group"></a> | |
| ## `tokenizers~clean_up_tokenization(text)` ⇒ <code>string</code> | |
| Clean up a list of simple English tokenization artifacts like spaces before punctuations and abbreviated forms | |
| **Kind**: inner method of [<code>tokenizers</code>](#module_tokenizers) | |
| **Returns**: <code>string</code> - The cleaned up text. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>text</td><td><code>string</code></td><td><p>The text to clean up.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..remove_accents" class="group"></a> | |
| ## `tokenizers~remove_accents(text)` ⇒ <code>string</code> | |
| Helper function to remove accents from a string. | |
| **Kind**: inner method of [<code>tokenizers</code>](#module_tokenizers) | |
| **Returns**: <code>string</code> - The text with accents removed. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>text</td><td><code>string</code></td><td><p>The text to remove accents from.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..lowercase_and_remove_accent" class="group"></a> | |
| ## `tokenizers~lowercase_and_remove_accent(text)` ⇒ <code>string</code> | |
| Helper function to lowercase a string and remove accents. | |
| **Kind**: inner method of [<code>tokenizers</code>](#module_tokenizers) | |
| **Returns**: <code>string</code> - The lowercased text with accents removed. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>text</td><td><code>string</code></td><td><p>The text to lowercase and remove accents from.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..whitespace_split" class="group"></a> | |
| ## `tokenizers~whitespace_split(text)` ⇒ <code>Array.<string></code> | |
| Split a string on whitespace. | |
| **Kind**: inner method of [<code>tokenizers</code>](#module_tokenizers) | |
| **Returns**: <code>Array.<string></code> - The split string. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>text</td><td><code>string</code></td><td><p>The text to split.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..PretrainedTokenizerOptions" class="group"></a> | |
| ## `tokenizers~PretrainedTokenizerOptions` : <code>Object</code> | |
| Additional tokenizer-specific properties. | |
| **Kind**: inner typedef of [<code>tokenizers</code>](#module_tokenizers) | |
| **Properties** | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Name</th><th>Type</th><th>Default</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>[legacy]</td><td><code>boolean</code></td><td><code>false</code></td><td><p>Whether or not the <code>legacy</code> behavior of the tokenizer should be used.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..BPENode" class="group"></a> | |
| ## `tokenizers~BPENode` : <code>Object</code> | |
| **Kind**: inner typedef of [<code>tokenizers</code>](#module_tokenizers) | |
| **Properties** | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Name</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>token</td><td><code>string</code></td><td><p>The token associated with the node</p> | |
| </td> | |
| </tr><tr> | |
| <td>bias</td><td><code>number</code></td><td><p>A positional bias for the node.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[score]</td><td><code>number</code></td><td><p>The score of the node.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[prev]</td><td><code>BPENode</code></td><td><p>The previous node in the linked list.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[next]</td><td><code>BPENode</code></td><td><p>The next node in the linked list.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..SplitDelimiterBehavior" class="group"></a> | |
| ## `tokenizers~SplitDelimiterBehavior` : <code>'removed'</code> | <code>'isolated'</code> | <code>'mergedWithPrevious'</code> | <code>'mergedWithNext'</code> | <code>'contiguous'</code> | |
| **Kind**: inner typedef of [<code>tokenizers</code>](#module_tokenizers) | |
| * * * | |
| <a id="module_tokenizers..PostProcessedOutput" class="group"></a> | |
| ## `tokenizers~PostProcessedOutput` : <code>Object</code> | |
| **Kind**: inner typedef of [<code>tokenizers</code>](#module_tokenizers) | |
| **Properties** | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Name</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>tokens</td><td><code>Array.<string></code></td><td><p>List of token produced by the post-processor.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[token_type_ids]</td><td><code>Array.<number></code></td><td><p>List of token type ids produced by the post-processor.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..EncodingSingle" class="group"></a> | |
| ## `tokenizers~EncodingSingle` : <code>Object</code> | |
| **Kind**: inner typedef of [<code>tokenizers</code>](#module_tokenizers) | |
| **Properties** | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Name</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>input_ids</td><td><code>Array.<number></code></td><td><p>List of token ids to be fed to a model.</p> | |
| </td> | |
| </tr><tr> | |
| <td>attention_mask</td><td><code>Array.<number></code></td><td><p>List of token type ids to be fed to a model</p> | |
| </td> | |
| </tr><tr> | |
| <td>[token_type_ids]</td><td><code>Array.<number></code></td><td><p>List of indices specifying which tokens should be attended to by the model</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..Message" class="group"></a> | |
| ## `tokenizers~Message` : <code>Object</code> | |
| **Kind**: inner typedef of [<code>tokenizers</code>](#module_tokenizers) | |
| **Properties** | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Name</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>role</td><td><code>string</code></td><td><p>The role of the message (e.g., "user" or "assistant" or "system").</p> | |
| </td> | |
| </tr><tr> | |
| <td>content</td><td><code>string</code></td><td><p>The content of the message.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_tokenizers..BatchEncoding" class="group"></a> | |
| ## `tokenizers~BatchEncoding` : <code>Array<number></code> | <code>Array<Array<number>></code> | [<code>Tensor</code>](#Tensor) | |
| Holds the output of the tokenizer's call function. | |
| **Kind**: inner typedef of [<code>tokenizers</code>](#module_tokenizers) | |
| **Properties** | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Name</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>input_ids</td><td><code>BatchEncodingItem</code></td><td><p>List of token ids to be fed to a model.</p> | |
| </td> | |
| </tr><tr> | |
| <td>attention_mask</td><td><code>BatchEncodingItem</code></td><td><p>List of indices specifying which tokens should be attended to by the model.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[token_type_ids]</td><td><code>BatchEncodingItem</code></td><td><p>List of token type ids to be fed to a model.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <EditOnGithub source="https://github.com/huggingface/transformers.js/blob/main/docs/source/api/tokenizers.md" /> |
Xet Storage Details
- Size:
- 158 kB
- Xet hash:
- 80f5a122b29609a3c291da76e36601b332cec11f4f3f02d67361e128de11b756
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.