Buckets:
| # tokenizers | |
| Tokenizers are used to prepare textual inputs for a model. | |
| **Example:** Create an `AutoTokenizer` and use it to tokenize a sentence. | |
| This will automatically detect the tokenizer type based on the tokenizer class defined in `tokenizer.json`. | |
| ```javascript | |
| import { AutoTokenizer } from '@huggingface/transformers'; | |
| const tokenizer = await AutoTokenizer.from_pretrained('Xenova/bert-base-uncased'); | |
| const { input_ids } = await tokenizer('I love transformers!'); | |
| // Tensor { | |
| // data: BigInt64Array(6) [101n, 1045n, 2293n, 19081n, 999n, 102n], | |
| // dims: [1, 6], | |
| // type: 'int64', | |
| // size: 6, | |
| // } | |
| ``` | |
| * [tokenizers](#module_tokenizers) | |
| * _static_ | |
| * [.TokenizerModel](#module_tokenizers.TokenizerModel) ⇐ [Callable](#Callable) | |
| * [`new TokenizerModel(config)`](#new_module_tokenizers.TokenizerModel_new) | |
| * _instance_ | |
| * [`.vocab`](#module_tokenizers.TokenizerModel+vocab) : Array.<string> | |
| * [`.tokens_to_ids`](#module_tokenizers.TokenizerModel+tokens_to_ids) : Map.<string, number> | |
| * [`.fuse_unk`](#module_tokenizers.TokenizerModel+fuse_unk) : boolean | |
| * [`._call(tokens)`](#module_tokenizers.TokenizerModel+_call) ⇒ Array.<string> | |
| * [`.encode(tokens)`](#module_tokenizers.TokenizerModel+encode) ⇒ Array.<string> | |
| * [`.convert_tokens_to_ids(tokens)`](#module_tokenizers.TokenizerModel+convert_tokens_to_ids) ⇒ Array.<number> | |
| * [`.convert_ids_to_tokens(ids)`](#module_tokenizers.TokenizerModel+convert_ids_to_tokens) ⇒ Array.<string> | |
| * _static_ | |
| * [`.fromConfig(config, ...args)`](#module_tokenizers.TokenizerModel.fromConfig) ⇒ TokenizerModel | |
| * [.PreTrainedTokenizer](#module_tokenizers.PreTrainedTokenizer) | |
| * [`new PreTrainedTokenizer(tokenizerJSON, tokenizerConfig)`](#new_module_tokenizers.PreTrainedTokenizer_new) | |
| * _instance_ | |
| * [`.added_tokens`](#module_tokenizers.PreTrainedTokenizer+added_tokens) : Array.<AddedToken> | |
| * [`.added_tokens_map`](#module_tokenizers.PreTrainedTokenizer+added_tokens_map) : Map.<string, AddedToken> | |
| * [`.remove_space`](#module_tokenizers.PreTrainedTokenizer+remove_space) : boolean | |
| * [`._call(text, options)`](#module_tokenizers.PreTrainedTokenizer+_call) ⇒ BatchEncoding | |
| * [`._encode_text(text)`](#module_tokenizers.PreTrainedTokenizer+_encode_text) ⇒ Array<string> | null | |
| * [`._tokenize_helper(text, options)`](#module_tokenizers.PreTrainedTokenizer+_tokenize_helper) ⇒ * | |
| * [`.tokenize(text, options)`](#module_tokenizers.PreTrainedTokenizer+tokenize) ⇒ Array.<string> | |
| * [`.encode(text, options)`](#module_tokenizers.PreTrainedTokenizer+encode) ⇒ Array.<number> | |
| * [`.batch_decode(batch, decode_args)`](#module_tokenizers.PreTrainedTokenizer+batch_decode) ⇒ Array.<string> | |
| * [`.decode(token_ids, [decode_args])`](#module_tokenizers.PreTrainedTokenizer+decode) ⇒ string | |
| * [`.decode_single(token_ids, decode_args)`](#module_tokenizers.PreTrainedTokenizer+decode_single) ⇒ string | |
| * [`.get_chat_template(options)`](#module_tokenizers.PreTrainedTokenizer+get_chat_template) ⇒ string | |
| * [`.apply_chat_template(conversation, options)`](#module_tokenizers.PreTrainedTokenizer+apply_chat_template) ⇒ string | [Tensor](#Tensor) | Array<number> | Array<Array<number>> | BatchEncoding | |
| * _static_ | |
| * [`.from_pretrained(pretrained_model_name_or_path, options)`](#module_tokenizers.PreTrainedTokenizer.from_pretrained) ⇒ Promise.<PreTrainedTokenizer> | |
| * [.BertTokenizer](#module_tokenizers.BertTokenizer) ⇐ PreTrainedTokenizer | |
| * [.AlbertTokenizer](#module_tokenizers.AlbertTokenizer) ⇐ PreTrainedTokenizer | |
| * [.NllbTokenizer](#module_tokenizers.NllbTokenizer) | |
| * [`._build_translation_inputs(raw_inputs, tokenizer_options, generate_kwargs)`](#module_tokenizers.NllbTokenizer+_build_translation_inputs) ⇒ Object | |
| * [.M2M100Tokenizer](#module_tokenizers.M2M100Tokenizer) | |
| * [`._build_translation_inputs(raw_inputs, tokenizer_options, generate_kwargs)`](#module_tokenizers.M2M100Tokenizer+_build_translation_inputs) ⇒ Object | |
| * [.WhisperTokenizer](#module_tokenizers.WhisperTokenizer) ⇐ PreTrainedTokenizer | |
| * [`._decode_asr(sequences, options)`](#module_tokenizers.WhisperTokenizer+_decode_asr) ⇒ * | |
| * [`.decode()`](#module_tokenizers.WhisperTokenizer+decode) : * | |
| * [.MarianTokenizer](#module_tokenizers.MarianTokenizer) | |
| * [`new MarianTokenizer(tokenizerJSON, tokenizerConfig)`](#new_module_tokenizers.MarianTokenizer_new) | |
| * [`._encode_text(text)`](#module_tokenizers.MarianTokenizer+_encode_text) ⇒ Array | |
| * [.AutoTokenizer](#module_tokenizers.AutoTokenizer) | |
| * [`new AutoTokenizer()`](#new_module_tokenizers.AutoTokenizer_new) | |
| * [`.from_pretrained(pretrained_model_name_or_path, options)`](#module_tokenizers.AutoTokenizer.from_pretrained) ⇒ Promise.<PreTrainedTokenizer> | |
| * [`.is_chinese_char(cp)`](#module_tokenizers.is_chinese_char) ⇒ boolean | |
| * _inner_ | |
| * [~AddedToken](#module_tokenizers..AddedToken) | |
| * [`new AddedToken(config)`](#new_module_tokenizers..AddedToken_new) | |
| * [~WordPieceTokenizer](#module_tokenizers..WordPieceTokenizer) ⇐ TokenizerModel | |
| * [`new WordPieceTokenizer(config)`](#new_module_tokenizers..WordPieceTokenizer_new) | |
| * [`.tokens_to_ids`](#module_tokenizers..WordPieceTokenizer+tokens_to_ids) : Map.<string, number> | |
| * [`.unk_token_id`](#module_tokenizers..WordPieceTokenizer+unk_token_id) : number | |
| * [`.unk_token`](#module_tokenizers..WordPieceTokenizer+unk_token) : string | |
| * [`.max_input_chars_per_word`](#module_tokenizers..WordPieceTokenizer+max_input_chars_per_word) : number | |
| * [`.vocab`](#module_tokenizers..WordPieceTokenizer+vocab) : Array.<string> | |
| * [`.encode(tokens)`](#module_tokenizers..WordPieceTokenizer+encode) ⇒ Array.<string> | |
| * [~Unigram](#module_tokenizers..Unigram) ⇐ TokenizerModel | |
| * [`new Unigram(config, moreConfig)`](#new_module_tokenizers..Unigram_new) | |
| * [`.scores`](#module_tokenizers..Unigram+scores) : Array.<number> | |
| * [`.populateNodes(lattice)`](#module_tokenizers..Unigram+populateNodes) | |
| * [`.tokenize(normalized)`](#module_tokenizers..Unigram+tokenize) ⇒ Array.<string> | |
| * [`.encode(tokens)`](#module_tokenizers..Unigram+encode) ⇒ Array.<string> | |
| * [~BPE](#module_tokenizers..BPE) ⇐ TokenizerModel | |
| * [`new BPE(config)`](#new_module_tokenizers..BPE_new) | |
| * [`.tokens_to_ids`](#module_tokenizers..BPE+tokens_to_ids) : Map.<string, number> | |
| * [`.merges`](#module_tokenizers..BPE+merges) : * | |
| * [`.config.merges`](#module_tokenizers..BPE+merges.config.merges) : * | |
| * [`.max_length_to_cache`](#module_tokenizers..BPE+max_length_to_cache) | |
| * [`.cache_capacity`](#module_tokenizers..BPE+cache_capacity) | |
| * [`.clear_cache()`](#module_tokenizers..BPE+clear_cache) | |
| * [`.bpe(token)`](#module_tokenizers..BPE+bpe) ⇒ Array.<string> | |
| * [`.encode(tokens)`](#module_tokenizers..BPE+encode) ⇒ Array.<string> | |
| * [~LegacyTokenizerModel](#module_tokenizers..LegacyTokenizerModel) | |
| * [`new LegacyTokenizerModel(config, moreConfig)`](#new_module_tokenizers..LegacyTokenizerModel_new) | |
| * [`.tokens_to_ids`](#module_tokenizers..LegacyTokenizerModel+tokens_to_ids) : Map.<string, number> | |
| * *[~Normalizer](#module_tokenizers..Normalizer)* | |
| * *[`new Normalizer(config)`](#new_module_tokenizers..Normalizer_new)* | |
| * _instance_ | |
| * **[`.normalize(text)`](#module_tokenizers..Normalizer+normalize) ⇒ string** | |
| * *[`._call(text)`](#module_tokenizers..Normalizer+_call) ⇒ string* | |
| * _static_ | |
| * *[`.fromConfig(config)`](#module_tokenizers..Normalizer.fromConfig) ⇒ Normalizer* | |
| * [~Replace](#module_tokenizers..Replace) ⇐ Normalizer | |
| * [`.normalize(text)`](#module_tokenizers..Replace+normalize) ⇒ string | |
| * *[~UnicodeNormalizer](#module_tokenizers..UnicodeNormalizer) ⇐ Normalizer* | |
| * *[`.form`](#module_tokenizers..UnicodeNormalizer+form) : string* | |
| * *[`.normalize(text)`](#module_tokenizers..UnicodeNormalizer+normalize) ⇒ string* | |
| * [~NFC](#module_tokenizers..NFC) ⇐ UnicodeNormalizer | |
| * [~NFD](#module_tokenizers..NFD) ⇐ UnicodeNormalizer | |
| * [~NFKC](#module_tokenizers..NFKC) ⇐ UnicodeNormalizer | |
| * [~NFKD](#module_tokenizers..NFKD) ⇐ UnicodeNormalizer | |
| * [~StripNormalizer](#module_tokenizers..StripNormalizer) | |
| * [`.normalize(text)`](#module_tokenizers..StripNormalizer+normalize) ⇒ string | |
| * [~StripAccents](#module_tokenizers..StripAccents) ⇐ Normalizer | |
| * [`.normalize(text)`](#module_tokenizers..StripAccents+normalize) ⇒ string | |
| * [~Lowercase](#module_tokenizers..Lowercase) ⇐ Normalizer | |
| * [`.normalize(text)`](#module_tokenizers..Lowercase+normalize) ⇒ string | |
| * [~Prepend](#module_tokenizers..Prepend) ⇐ Normalizer | |
| * [`.normalize(text)`](#module_tokenizers..Prepend+normalize) ⇒ string | |
| * [~NormalizerSequence](#module_tokenizers..NormalizerSequence) ⇐ Normalizer | |
| * [`new NormalizerSequence(config)`](#new_module_tokenizers..NormalizerSequence_new) | |
| * [`.normalize(text)`](#module_tokenizers..NormalizerSequence+normalize) ⇒ string | |
| * [~BertNormalizer](#module_tokenizers..BertNormalizer) ⇐ Normalizer | |
| * [`._tokenize_chinese_chars(text)`](#module_tokenizers..BertNormalizer+_tokenize_chinese_chars) ⇒ string | |
| * [`.stripAccents(text)`](#module_tokenizers..BertNormalizer+stripAccents) ⇒ string | |
| * [`.normalize(text)`](#module_tokenizers..BertNormalizer+normalize) ⇒ string | |
| * [~PreTokenizer](#module_tokenizers..PreTokenizer) ⇐ [Callable](#Callable) | |
| * _instance_ | |
| * *[`.pre_tokenize_text(text, [options])`](#module_tokenizers..PreTokenizer+pre_tokenize_text) ⇒ Array.<string>* | |
| * [`.pre_tokenize(text, [options])`](#module_tokenizers..PreTokenizer+pre_tokenize) ⇒ Array.<string> | |
| * [`._call(text, [options])`](#module_tokenizers..PreTokenizer+_call) ⇒ Array.<string> | |
| * _static_ | |
| * [`.fromConfig(config)`](#module_tokenizers..PreTokenizer.fromConfig) ⇒ PreTokenizer | |
| * [~BertPreTokenizer](#module_tokenizers..BertPreTokenizer) ⇐ PreTokenizer | |
| * [`new BertPreTokenizer(config)`](#new_module_tokenizers..BertPreTokenizer_new) | |
| * [`.pre_tokenize_text(text, [options])`](#module_tokenizers..BertPreTokenizer+pre_tokenize_text) ⇒ Array.<string> | |
| * [~ByteLevelPreTokenizer](#module_tokenizers..ByteLevelPreTokenizer) ⇐ PreTokenizer | |
| * [`new ByteLevelPreTokenizer(config)`](#new_module_tokenizers..ByteLevelPreTokenizer_new) | |
| * [`.add_prefix_space`](#module_tokenizers..ByteLevelPreTokenizer+add_prefix_space) : boolean | |
| * [`.trim_offsets`](#module_tokenizers..ByteLevelPreTokenizer+trim_offsets) : boolean | |
| * [`.use_regex`](#module_tokenizers..ByteLevelPreTokenizer+use_regex) : boolean | |
| * [`.pre_tokenize_text(text, [options])`](#module_tokenizers..ByteLevelPreTokenizer+pre_tokenize_text) ⇒ Array.<string> | |
| * [~SplitPreTokenizer](#module_tokenizers..SplitPreTokenizer) ⇐ PreTokenizer | |
| * [`new SplitPreTokenizer(config)`](#new_module_tokenizers..SplitPreTokenizer_new) | |
| * [`.pre_tokenize_text(text, [options])`](#module_tokenizers..SplitPreTokenizer+pre_tokenize_text) ⇒ Array.<string> | |
| * [~PunctuationPreTokenizer](#module_tokenizers..PunctuationPreTokenizer) ⇐ PreTokenizer | |
| * [`new PunctuationPreTokenizer(config)`](#new_module_tokenizers..PunctuationPreTokenizer_new) | |
| * [`.pre_tokenize_text(text, [options])`](#module_tokenizers..PunctuationPreTokenizer+pre_tokenize_text) ⇒ Array.<string> | |
| * [~DigitsPreTokenizer](#module_tokenizers..DigitsPreTokenizer) ⇐ PreTokenizer | |
| * [`new DigitsPreTokenizer(config)`](#new_module_tokenizers..DigitsPreTokenizer_new) | |
| * [`.pre_tokenize_text(text, [options])`](#module_tokenizers..DigitsPreTokenizer+pre_tokenize_text) ⇒ Array.<string> | |
| * [~PostProcessor](#module_tokenizers..PostProcessor) ⇐ [Callable](#Callable) | |
| * [`new PostProcessor(config)`](#new_module_tokenizers..PostProcessor_new) | |
| * _instance_ | |
| * [`.post_process(tokens, ...args)`](#module_tokenizers..PostProcessor+post_process) ⇒ PostProcessedOutput | |
| * [`._call(tokens, ...args)`](#module_tokenizers..PostProcessor+_call) ⇒ PostProcessedOutput | |
| * _static_ | |
| * [`.fromConfig(config)`](#module_tokenizers..PostProcessor.fromConfig) ⇒ PostProcessor | |
| * [~BertProcessing](#module_tokenizers..BertProcessing) | |
| * [`new BertProcessing(config)`](#new_module_tokenizers..BertProcessing_new) | |
| * [`.post_process(tokens, [tokens_pair])`](#module_tokenizers..BertProcessing+post_process) ⇒ PostProcessedOutput | |
| * [~TemplateProcessing](#module_tokenizers..TemplateProcessing) ⇐ PostProcessor | |
| * [`new TemplateProcessing(config)`](#new_module_tokenizers..TemplateProcessing_new) | |
| * [`.post_process(tokens, [tokens_pair])`](#module_tokenizers..TemplateProcessing+post_process) ⇒ PostProcessedOutput | |
| * [~ByteLevelPostProcessor](#module_tokenizers..ByteLevelPostProcessor) ⇐ PostProcessor | |
| * [`.post_process(tokens, [tokens_pair])`](#module_tokenizers..ByteLevelPostProcessor+post_process) ⇒ PostProcessedOutput | |
| * [~PostProcessorSequence](#module_tokenizers..PostProcessorSequence) | |
| * [`new PostProcessorSequence(config)`](#new_module_tokenizers..PostProcessorSequence_new) | |
| * [`.post_process(tokens, [tokens_pair])`](#module_tokenizers..PostProcessorSequence+post_process) ⇒ PostProcessedOutput | |
| * [~Decoder](#module_tokenizers..Decoder) ⇐ [Callable](#Callable) | |
| * [`new Decoder(config)`](#new_module_tokenizers..Decoder_new) | |
| * _instance_ | |
| * [`.added_tokens`](#module_tokenizers..Decoder+added_tokens) : Array.<AddedToken> | |
| * [`._call(tokens)`](#module_tokenizers..Decoder+_call) ⇒ string | |
| * [`.decode(tokens)`](#module_tokenizers..Decoder+decode) ⇒ string | |
| * [`.decode_chain(tokens)`](#module_tokenizers..Decoder+decode_chain) ⇒ Array.<string> | |
| * _static_ | |
| * [`.fromConfig(config)`](#module_tokenizers..Decoder.fromConfig) ⇒ Decoder | |
| * [~FuseDecoder](#module_tokenizers..FuseDecoder) | |
| * [`.decode_chain()`](#module_tokenizers..FuseDecoder+decode_chain) : * | |
| * [~WordPieceDecoder](#module_tokenizers..WordPieceDecoder) ⇐ Decoder | |
| * [`new WordPieceDecoder(config)`](#new_module_tokenizers..WordPieceDecoder_new) | |
| * [`.decode_chain()`](#module_tokenizers..WordPieceDecoder+decode_chain) : * | |
| * [~ByteLevelDecoder](#module_tokenizers..ByteLevelDecoder) ⇐ Decoder | |
| * [`new ByteLevelDecoder(config)`](#new_module_tokenizers..ByteLevelDecoder_new) | |
| * [`.convert_tokens_to_string(tokens)`](#module_tokenizers..ByteLevelDecoder+convert_tokens_to_string) ⇒ string | |
| * [`.decode_chain()`](#module_tokenizers..ByteLevelDecoder+decode_chain) : * | |
| * [~CTCDecoder](#module_tokenizers..CTCDecoder) | |
| * [`.convert_tokens_to_string(tokens)`](#module_tokenizers..CTCDecoder+convert_tokens_to_string) ⇒ string | |
| * [`.decode_chain()`](#module_tokenizers..CTCDecoder+decode_chain) : * | |
| * [~DecoderSequence](#module_tokenizers..DecoderSequence) ⇐ Decoder | |
| * [`new DecoderSequence(config)`](#new_module_tokenizers..DecoderSequence_new) | |
| * [`.decode_chain()`](#module_tokenizers..DecoderSequence+decode_chain) : * | |
| * [~MetaspacePreTokenizer](#module_tokenizers..MetaspacePreTokenizer) ⇐ PreTokenizer | |
| * [`new MetaspacePreTokenizer(config)`](#new_module_tokenizers..MetaspacePreTokenizer_new) | |
| * [`.pre_tokenize_text(text, [options])`](#module_tokenizers..MetaspacePreTokenizer+pre_tokenize_text) ⇒ Array.<string> | |
| * [~MetaspaceDecoder](#module_tokenizers..MetaspaceDecoder) ⇐ Decoder | |
| * [`new MetaspaceDecoder(config)`](#new_module_tokenizers..MetaspaceDecoder_new) | |
| * [`.decode_chain()`](#module_tokenizers..MetaspaceDecoder+decode_chain) : * | |
| * [~Precompiled](#module_tokenizers..Precompiled) ⇐ Normalizer | |
| * [`new Precompiled(config)`](#new_module_tokenizers..Precompiled_new) | |
| * [`.normalize(text)`](#module_tokenizers..Precompiled+normalize) ⇒ string | |
| * [~PreTokenizerSequence](#module_tokenizers..PreTokenizerSequence) ⇐ PreTokenizer | |
| * [`new PreTokenizerSequence(config)`](#new_module_tokenizers..PreTokenizerSequence_new) | |
| * [`.pre_tokenize_text(text, [options])`](#module_tokenizers..PreTokenizerSequence+pre_tokenize_text) ⇒ Array.<string> | |
| * [~WhitespacePreTokenizer](#module_tokenizers..WhitespacePreTokenizer) | |
| * [`new WhitespacePreTokenizer(config)`](#new_module_tokenizers..WhitespacePreTokenizer_new) | |
| * [`.pre_tokenize_text(text, [options])`](#module_tokenizers..WhitespacePreTokenizer+pre_tokenize_text) ⇒ Array.<string> | |
| * [~WhitespaceSplit](#module_tokenizers..WhitespaceSplit) ⇐ PreTokenizer | |
| * [`new WhitespaceSplit(config)`](#new_module_tokenizers..WhitespaceSplit_new) | |
| * [`.pre_tokenize_text(text, [options])`](#module_tokenizers..WhitespaceSplit+pre_tokenize_text) ⇒ Array.<string> | |
| * [~ReplacePreTokenizer](#module_tokenizers..ReplacePreTokenizer) | |
| * [`new ReplacePreTokenizer(config)`](#new_module_tokenizers..ReplacePreTokenizer_new) | |
| * [`.pre_tokenize_text(text, [options])`](#module_tokenizers..ReplacePreTokenizer+pre_tokenize_text) ⇒ Array.<string> | |
| * [~FixedLengthPreTokenizer](#module_tokenizers..FixedLengthPreTokenizer) | |
| * [`new FixedLengthPreTokenizer(config)`](#new_module_tokenizers..FixedLengthPreTokenizer_new) | |
| * [`.pre_tokenize_text(text, [options])`](#module_tokenizers..FixedLengthPreTokenizer+pre_tokenize_text) ⇒ Array.<string> | |
| * [`~BYTES_TO_UNICODE`](#module_tokenizers..BYTES_TO_UNICODE) ⇒ Object | |
| * [`~loadTokenizer(pretrained_model_name_or_path, options)`](#module_tokenizers..loadTokenizer) ⇒ Promise.<Array<any>> | |
| * [`~regexSplit(text, regex)`](#module_tokenizers..regexSplit) ⇒ Array.<string> | |
| * [`~createPattern(pattern, invert)`](#module_tokenizers..createPattern) ⇒ RegExp | null | |
| * [`~objectToMap(obj)`](#module_tokenizers..objectToMap) ⇒ Map.<string, any> | |
| * [`~prepareTensorForDecode(tensor)`](#module_tokenizers..prepareTensorForDecode) ⇒ Array.<number> | |
| * [`~clean_up_tokenization(text)`](#module_tokenizers..clean_up_tokenization) ⇒ string | |
| * [`~remove_accents(text)`](#module_tokenizers..remove_accents) ⇒ string | |
| * [`~lowercase_and_remove_accent(text)`](#module_tokenizers..lowercase_and_remove_accent) ⇒ string | |
| * [`~whitespace_split(text)`](#module_tokenizers..whitespace_split) ⇒ Array.<string> | |
| * [`~PretrainedTokenizerOptions`](#module_tokenizers..PretrainedTokenizerOptions) : Object | |
| * [`~BPENode`](#module_tokenizers..BPENode) : Object | |
| * [`~SplitDelimiterBehavior`](#module_tokenizers..SplitDelimiterBehavior) : 'removed' | 'isolated' | 'mergedWithPrevious' | 'mergedWithNext' | 'contiguous' | |
| * [`~PostProcessedOutput`](#module_tokenizers..PostProcessedOutput) : Object | |
| * [`~EncodingSingle`](#module_tokenizers..EncodingSingle) : Object | |
| * [`~Message`](#module_tokenizers..Message) : Object | |
| * [`~BatchEncoding`](#module_tokenizers..BatchEncoding) : Array<number> | Array<Array<number>> | [Tensor](#Tensor) | |
| * * * | |
| ## tokenizers.TokenizerModel ⇐ [Callable](#Callable) | |
| Abstract base class for tokenizer models. | |
| **Kind**: static class of [tokenizers](#module_tokenizers) | |
| **Extends**: [Callable](#Callable) | |
| * [.TokenizerModel](#module_tokenizers.TokenizerModel) ⇐ [Callable](#Callable) | |
| * [`new TokenizerModel(config)`](#new_module_tokenizers.TokenizerModel_new) | |
| * _instance_ | |
| * [`.vocab`](#module_tokenizers.TokenizerModel+vocab) : Array.<string> | |
| * [`.tokens_to_ids`](#module_tokenizers.TokenizerModel+tokens_to_ids) : Map.<string, number> | |
| * [`.fuse_unk`](#module_tokenizers.TokenizerModel+fuse_unk) : boolean | |
| * [`._call(tokens)`](#module_tokenizers.TokenizerModel+_call) ⇒ Array.<string> | |
| * [`.encode(tokens)`](#module_tokenizers.TokenizerModel+encode) ⇒ Array.<string> | |
| * [`.convert_tokens_to_ids(tokens)`](#module_tokenizers.TokenizerModel+convert_tokens_to_ids) ⇒ Array.<number> | |
| * [`.convert_ids_to_tokens(ids)`](#module_tokenizers.TokenizerModel+convert_ids_to_tokens) ⇒ Array.<string> | |
| * _static_ | |
| * [`.fromConfig(config, ...args)`](#module_tokenizers.TokenizerModel.fromConfig) ⇒ TokenizerModel | |
| * * * | |
| ### `new TokenizerModel(config)` | |
| Creates a new instance of TokenizerModel. | |
| ParamTypeDescription | |
| configObjectThe configuration object for the TokenizerModel. | |
| * * * | |
| ### `tokenizerModel.vocab` : Array.<string> | |
| **Kind**: instance property of [TokenizerModel](#module_tokenizers.TokenizerModel) | |
| * * * | |
| ### `tokenizerModel.tokens_to_ids` : Map.<string, number> | |
| A mapping of tokens to ids. | |
| **Kind**: instance property of [TokenizerModel](#module_tokenizers.TokenizerModel) | |
| * * * | |
| ### `tokenizerModel.fuse_unk` : boolean | |
| Whether to fuse unknown tokens when encoding. Defaults to false. | |
| **Kind**: instance property of [TokenizerModel](#module_tokenizers.TokenizerModel) | |
| * * * | |
| ### `tokenizerModel._call(tokens)` ⇒ Array.<string> | |
| Internal function to call the TokenizerModel instance. | |
| **Kind**: instance method of [TokenizerModel](#module_tokenizers.TokenizerModel) | |
| **Overrides**: [_call](#Callable+_call) | |
| **Returns**: Array.<string> - The encoded tokens. | |
| ParamTypeDescription | |
| tokensArray.<string>The tokens to encode. | |
| * * * | |
| ### `tokenizerModel.encode(tokens)` ⇒ Array.<string> | |
| Encodes a list of tokens into a list of token IDs. | |
| **Kind**: instance method of [TokenizerModel](#module_tokenizers.TokenizerModel) | |
| **Returns**: Array.<string> - The encoded tokens. | |
| **Throws**: | |
| - Will throw an error if not implemented in a subclass. | |
| ParamTypeDescription | |
| tokensArray.<string>The tokens to encode. | |
| * * * | |
| ### `tokenizerModel.convert_tokens_to_ids(tokens)` ⇒ Array.<number> | |
| Converts a list of tokens into a list of token IDs. | |
| **Kind**: instance method of [TokenizerModel](#module_tokenizers.TokenizerModel) | |
| **Returns**: Array.<number> - The converted token IDs. | |
| ParamTypeDescription | |
| tokensArray.<string>The tokens to convert. | |
| * * * | |
| ### `tokenizerModel.convert_ids_to_tokens(ids)` ⇒ Array.<string> | |
| Converts a list of token IDs into a list of tokens. | |
| **Kind**: instance method of [TokenizerModel](#module_tokenizers.TokenizerModel) | |
| **Returns**: Array.<string> - The converted tokens. | |
| ParamTypeDescription | |
| idsArray<number> | Array<bigint>The token IDs to convert. | |
| * * * | |
| ### `TokenizerModel.fromConfig(config, ...args)` ⇒ TokenizerModel | |
| Instantiates a new TokenizerModel instance based on the configuration object provided. | |
| **Kind**: static method of [TokenizerModel](#module_tokenizers.TokenizerModel) | |
| **Returns**: TokenizerModel - A new instance of a TokenizerModel. | |
| **Throws**: | |
| - Will throw an error if the TokenizerModel type in the config is not recognized. | |
| ParamTypeDescription | |
| configObjectThe configuration object for the TokenizerModel. | |
| ...args*Optional arguments to pass to the specific TokenizerModel constructor. | |
| * * * | |
| ## tokenizers.PreTrainedTokenizer | |
| **Kind**: static class of [tokenizers](#module_tokenizers) | |
| * [.PreTrainedTokenizer](#module_tokenizers.PreTrainedTokenizer) | |
| * [`new PreTrainedTokenizer(tokenizerJSON, tokenizerConfig)`](#new_module_tokenizers.PreTrainedTokenizer_new) | |
| * _instance_ | |
| * [`.added_tokens`](#module_tokenizers.PreTrainedTokenizer+added_tokens) : Array.<AddedToken> | |
| * [`.added_tokens_map`](#module_tokenizers.PreTrainedTokenizer+added_tokens_map) : Map.<string, AddedToken> | |
| * [`.remove_space`](#module_tokenizers.PreTrainedTokenizer+remove_space) : boolean | |
| * [`._call(text, options)`](#module_tokenizers.PreTrainedTokenizer+_call) ⇒ BatchEncoding | |
| * [`._encode_text(text)`](#module_tokenizers.PreTrainedTokenizer+_encode_text) ⇒ Array<string> | null | |
| * [`._tokenize_helper(text, options)`](#module_tokenizers.PreTrainedTokenizer+_tokenize_helper) ⇒ * | |
| * [`.tokenize(text, options)`](#module_tokenizers.PreTrainedTokenizer+tokenize) ⇒ Array.<string> | |
| * [`.encode(text, options)`](#module_tokenizers.PreTrainedTokenizer+encode) ⇒ Array.<number> | |
| * [`.batch_decode(batch, decode_args)`](#module_tokenizers.PreTrainedTokenizer+batch_decode) ⇒ Array.<string> | |
| * [`.decode(token_ids, [decode_args])`](#module_tokenizers.PreTrainedTokenizer+decode) ⇒ string | |
| * [`.decode_single(token_ids, decode_args)`](#module_tokenizers.PreTrainedTokenizer+decode_single) ⇒ string | |
| * [`.get_chat_template(options)`](#module_tokenizers.PreTrainedTokenizer+get_chat_template) ⇒ string | |
| * [`.apply_chat_template(conversation, options)`](#module_tokenizers.PreTrainedTokenizer+apply_chat_template) ⇒ string | [Tensor](#Tensor) | Array<number> | Array<Array<number>> | BatchEncoding | |
| * _static_ | |
| * [`.from_pretrained(pretrained_model_name_or_path, options)`](#module_tokenizers.PreTrainedTokenizer.from_pretrained) ⇒ Promise.<PreTrainedTokenizer> | |
| * * * | |
| ### `new PreTrainedTokenizer(tokenizerJSON, tokenizerConfig)` | |
| Create a new PreTrainedTokenizer instance. | |
| ParamTypeDescription | |
| tokenizerJSONObjectThe JSON of the tokenizer. | |
| tokenizerConfigObjectThe config of the tokenizer. | |
| * * * | |
| ### `preTrainedTokenizer.added_tokens` : Array.<AddedToken> | |
| **Kind**: instance property of [PreTrainedTokenizer](#module_tokenizers.PreTrainedTokenizer) | |
| * * * | |
| ### `preTrainedTokenizer.added_tokens_map` : Map.<string, AddedToken> | |
| **Kind**: instance property of [PreTrainedTokenizer](#module_tokenizers.PreTrainedTokenizer) | |
| * * * | |
| ### `preTrainedTokenizer.remove_space` : boolean | |
| Whether or not to strip the text when tokenizing (removing excess spaces before and after the string). | |
| **Kind**: instance property of [PreTrainedTokenizer](#module_tokenizers.PreTrainedTokenizer) | |
| * * * | |
| ### `preTrainedTokenizer._call(text, options)` ⇒ BatchEncoding | |
| Encode/tokenize the given text(s). | |
| **Kind**: instance method of [PreTrainedTokenizer](#module_tokenizers.PreTrainedTokenizer) | |
| **Returns**: BatchEncoding - Object to be passed to the model. | |
| ParamTypeDefaultDescription | |
| textstring | Array<string>The text to tokenize. | |
| optionsObjectAn optional object containing the following properties: | |
| [options.text_pair]string | Array<string>nullOptional second sequence to be encoded. If set, must be the same type as text. | |
| [options.padding]boolean | 'max_length'falseWhether to pad the input sequences. | |
| [options.add_special_tokens]booleantrueWhether or not to add the special tokens associated with the corresponding model. | |
| [options.truncation]booleanWhether to truncate the input sequences. | |
| [options.max_length]numberMaximum length of the returned list and optionally padding length. | |
| [options.return_tensor]booleantrueWhether to return the results as Tensors or arrays. | |
| [options.return_token_type_ids]booleanWhether to return the token type ids. | |
| * * * | |
| ### `preTrainedTokenizer._encode_text(text)` ⇒ Array<string> | null | |
| Encodes a single text using the preprocessor pipeline of the tokenizer. | |
| **Kind**: instance method of [PreTrainedTokenizer](#module_tokenizers.PreTrainedTokenizer) | |
| **Returns**: Array<string> | null - The encoded tokens. | |
| ParamTypeDescription | |
| textstring | nullThe text to encode. | |
| * * * | |
| ### `preTrainedTokenizer._tokenize_helper(text, options)` ⇒ * | |
| Internal helper function to tokenize a text, and optionally a pair of texts. | |
| **Kind**: instance method of [PreTrainedTokenizer](#module_tokenizers.PreTrainedTokenizer) | |
| **Returns**: * - An object containing the tokens and optionally the token type IDs. | |
| ParamTypeDefaultDescription | |
| textstringThe text to tokenize. | |
| optionsObjectAn optional object containing the following properties: | |
| [options.pair]stringnullThe optional second text to tokenize. | |
| [options.add_special_tokens]booleanfalseWhether or not to add the special tokens associated with the corresponding model. | |
| * * * | |
| ### `preTrainedTokenizer.tokenize(text, options)` ⇒ Array.<string> | |
| Converts a string into a sequence of tokens. | |
| **Kind**: instance method of [PreTrainedTokenizer](#module_tokenizers.PreTrainedTokenizer) | |
| **Returns**: Array.<string> - The list of tokens. | |
| ParamTypeDefaultDescription | |
| textstringThe sequence to be encoded. | |
| optionsObjectAn optional object containing the following properties: | |
| [options.pair]stringA second sequence to be encoded with the first. | |
| [options.add_special_tokens]booleanfalseWhether or not to add the special tokens associated with the corresponding model. | |
| * * * | |
| ### `preTrainedTokenizer.encode(text, options)` ⇒ Array.<number> | |
| Encodes a single text or a pair of texts using the model's tokenizer. | |
| **Kind**: instance method of [PreTrainedTokenizer](#module_tokenizers.PreTrainedTokenizer) | |
| **Returns**: Array.<number> - An array of token IDs representing the encoded text(s). | |
| ParamTypeDefaultDescription | |
| textstringThe text to encode. | |
| optionsObjectAn optional object containing the following properties: | |
| [options.text_pair]stringnullThe optional second text to encode. | |
| [options.add_special_tokens]booleantrueWhether or not to add the special tokens associated with the corresponding model. | |
| [options.return_token_type_ids]booleanWhether to return token_type_ids. | |
| * * * | |
| ### `preTrainedTokenizer.batch_decode(batch, decode_args)` ⇒ Array.<string> | |
| Decode a batch of tokenized sequences. | |
| **Kind**: instance method of [PreTrainedTokenizer](#module_tokenizers.PreTrainedTokenizer) | |
| **Returns**: Array.<string> - List of decoded sequences. | |
| ParamTypeDescription | |
| batchArray<Array<number>> | TensorList/Tensor of tokenized input sequences. | |
| decode_argsObject(Optional) Object with decoding arguments. | |
| * * * | |
| ### `preTrainedTokenizer.decode(token_ids, [decode_args])` ⇒ string | |
| Decodes a sequence of token IDs back to a string. | |
| **Kind**: instance method of [PreTrainedTokenizer](#module_tokenizers.PreTrainedTokenizer) | |
| **Returns**: string - The decoded string. | |
| **Throws**: | |
| - Error If `token_ids` is not a non-empty array of integers. | |
| ParamTypeDefaultDescription | |
| token_idsArray<number> | Array<bigint> | TensorList/Tensor of token IDs to decode. | |
| [decode_args]Object{} | |
| [decode_args.skip_special_tokens]booleanfalseIf true, special tokens are removed from the output string. | |
| [decode_args.clean_up_tokenization_spaces]booleantrueIf true, spaces before punctuations and abbreviated forms are removed. | |
| * * * | |
| ### `preTrainedTokenizer.decode_single(token_ids, decode_args)` ⇒ string | |
| Decode a single list of token ids to a string. | |
| **Kind**: instance method of [PreTrainedTokenizer](#module_tokenizers.PreTrainedTokenizer) | |
| **Returns**: string - The decoded string | |
| ParamTypeDefaultDescription | |
| token_idsArray<number> | Array<bigint>List of token ids to decode | |
| decode_argsObjectOptional arguments for decoding | |
| [decode_args.skip_special_tokens]booleanfalseWhether to skip special tokens during decoding | |
| [decode_args.clean_up_tokenization_spaces]booleanWhether to clean up tokenization spaces during decoding. | |
| If null, the value is set to this.decoder.cleanup if it exists, falling back to this.clean_up_tokenization_spaces if it exists, falling back to true. | |
| * * * | |
| ### `preTrainedTokenizer.get_chat_template(options)` ⇒ string | |
| Retrieve the chat template string used for tokenizing chat messages. This template is used | |
| internally by the `apply_chat_template` method and can also be used externally to retrieve the model's chat | |
| template for better generation tracking. | |
| **Kind**: instance method of [PreTrainedTokenizer](#module_tokenizers.PreTrainedTokenizer) | |
| **Returns**: string - The chat template string. | |
| ParamTypeDefaultDescription | |
| optionsObjectAn optional object containing the following properties: | |
| [options.chat_template]stringnullA Jinja template or the name of a template to use for this conversion. | |
| It is usually not necessary to pass anything to this argument, | |
| as the model's template will be used by default. | |
| [options.tools]Array.<Object>A list of tools (callable functions) that will be accessible to the model. If the template does not | |
| support function calling, this argument will have no effect. Each tool should be passed as a JSON Schema, | |
| giving the name, description and argument types for the tool. See our | |
| chat templating guide | |
| for more information. | |
| * * * | |
| ### `preTrainedTokenizer.apply_chat_template(conversation, options)` ⇒ string | [Tensor](#Tensor) | Array<number> | Array<Array<number>> | BatchEncoding | |
| Converts a list of message objects with `"role"` and `"content"` keys to a list of token | |
| ids. This method is intended for use with chat models, and will read the tokenizer's chat_template attribute to | |
| determine the format and control tokens to use when converting. | |
| See [here](https://huggingface.co/docs/transformers/chat_templating) for more information. | |
| **Example:** Applying a chat template to a conversation. | |
| ```javascript | |
| import { AutoTokenizer } from "@huggingface/transformers"; | |
| const tokenizer = await AutoTokenizer.from_pretrained("Xenova/mistral-tokenizer-v1"); | |
| const chat = [ | |
| { "role": "user", "content": "Hello, how are you?" }, | |
| { "role": "assistant", "content": "I'm doing great. How can I help you today?" }, | |
| { "role": "user", "content": "I'd like to show off how chat templating works!" }, | |
| ] | |
| const text = tokenizer.apply_chat_template(chat, { tokenize: false }); | |
| // "[INST] Hello, how are you? [/INST]I'm doing great. How can I help you today? [INST] I'd like to show off how chat templating works! [/INST]" | |
| const input_ids = tokenizer.apply_chat_template(chat, { tokenize: true, return_tensor: false }); | |
| // [1, 733, 16289, 28793, 22557, 28725, 910, 460, 368, 28804, 733, 28748, 16289, 28793, 28737, 28742, 28719, 2548, 1598, 28723, 1602, 541, 315, 1316, 368, 3154, 28804, 2, 28705, 733, 16289, 28793, 315, 28742, 28715, 737, 298, 1347, 805, 910, 10706, 5752, 1077, 3791, 28808, 733, 28748, 16289, 28793] | |
| ``` | |
| **Kind**: instance method of [PreTrainedTokenizer](#module_tokenizers.PreTrainedTokenizer) | |
| **Returns**: string | [Tensor](#Tensor) | Array<number> | Array<Array<number>> | BatchEncoding - The tokenized output. | |
| ParamTypeDefaultDescription | |
| conversationArray.<Message>A list of message objects with "role" and "content" keys, | |
| representing the chat history so far. | |
| optionsObjectAn optional object containing the following properties: | |
| [options.chat_template]stringnullA Jinja template to use for this conversion. If | |
| this is not passed, the model's chat template will be used instead. | |
| [options.tools]Array.<Object>A list of tools (callable functions) that will be accessible to the model. If the template does not | |
| support function calling, this argument will have no effect. Each tool should be passed as a JSON Schema, | |
| giving the name, description and argument types for the tool. See our | |
| chat templating guide | |
| for more information. | |
| [options.documents]*A list of dicts representing documents that will be accessible to the model if it is performing RAG | |
| (retrieval-augmented generation). If the template does not support RAG, this argument will have no | |
| effect. We recommend that each document should be a dict containing "title" and "text" keys. Please | |
| see the RAG section of the chat templating guide | |
| for examples of passing documents with chat templates. | |
| [options.add_generation_prompt]booleanfalseWhether to end the prompt with the token(s) that indicate | |
| the start of an assistant message. This is useful when you want to generate a response from the model. | |
| Note that this argument will be passed to the chat template, and so it must be supported in the | |
| template for this argument to have any effect. | |
| [options.tokenize]booleantrueWhether to tokenize the output. If false, the output will be a string. | |
| [options.padding]booleanfalseWhether to pad sequences to the maximum length. Has no effect if tokenize is false. | |
| [options.truncation]booleanfalseWhether to truncate sequences to the maximum length. Has no effect if tokenize is false. | |
| [options.max_length]numberMaximum length (in tokens) to use for padding or truncation. Has no effect if tokenize is false. | |
| If not specified, the tokenizer's max_length attribute will be used as a default. | |
| [options.return_tensor]booleantrueWhether to return the output as a Tensor or an Array. Has no effect if tokenize is false. | |
| [options.return_dict]booleantrueWhether to return a dictionary with named outputs. Has no effect if tokenize is false. | |
| [options.tokenizer_kwargs]Object{}Additional options to pass to the tokenizer. | |
| * * * | |
| ### `PreTrainedTokenizer.from_pretrained(pretrained_model_name_or_path, options)` ⇒ Promise.<PreTrainedTokenizer> | |
| Loads a pre-trained tokenizer from the given `pretrained_model_name_or_path`. | |
| **Kind**: static method of [PreTrainedTokenizer](#module_tokenizers.PreTrainedTokenizer) | |
| **Returns**: Promise.<PreTrainedTokenizer> - A new instance of the `PreTrainedTokenizer` class. | |
| **Throws**: | |
| - Error Throws an error if the tokenizer.json or tokenizer_config.json files are not found in the `pretrained_model_name_or_path`. | |
| ParamTypeDescription | |
| pretrained_model_name_or_pathstringThe path to the pre-trained tokenizer. | |
| optionsPretrainedTokenizerOptionsAdditional options for loading the tokenizer. | |
| * * * | |
| ## tokenizers.BertTokenizer ⇐ PreTrainedTokenizer | |
| BertTokenizer is a class used to tokenize text for BERT models. | |
| **Kind**: static class of [tokenizers](#module_tokenizers) | |
| **Extends**: PreTrainedTokenizer | |
| * * * | |
| ## tokenizers.AlbertTokenizer ⇐ PreTrainedTokenizer | |
| Albert tokenizer | |
| **Kind**: static class of [tokenizers](#module_tokenizers) | |
| **Extends**: PreTrainedTokenizer | |
| * * * | |
| ## tokenizers.NllbTokenizer | |
| The NllbTokenizer class is used to tokenize text for NLLB ("No Language Left Behind") models. | |
| No Language Left Behind (NLLB) is a first-of-its-kind, AI breakthrough project | |
| that open-sources models capable of delivering high-quality translations directly | |
| between any pair of 200+ languages — including low-resource languages like Asturian, | |
| Luganda, Urdu and more. It aims to help people communicate with anyone, anywhere, | |
| regardless of their language preferences. For more information, check out their | |
| [paper](https://huggingface.co/papers/2207.04672). | |
| For a list of supported languages (along with their language codes), | |
| **Kind**: static class of [tokenizers](#module_tokenizers) | |
| **See**: [https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200](https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200) | |
| * * * | |
| ### `nllbTokenizer._build_translation_inputs(raw_inputs, tokenizer_options, generate_kwargs)` ⇒ Object | |
| Helper function to build translation inputs for an `NllbTokenizer`. | |
| **Kind**: instance method of [NllbTokenizer](#module_tokenizers.NllbTokenizer) | |
| **Returns**: Object - Object to be passed to the model. | |
| ParamTypeDescription | |
| raw_inputsstring | Array<string>The text to tokenize. | |
| tokenizer_optionsObjectOptions to be sent to the tokenizer | |
| generate_kwargsObjectGeneration options. | |
| * * * | |
| ## tokenizers.M2M100Tokenizer | |
| The M2M100Tokenizer class is used to tokenize text for M2M100 ("Many-to-Many") models. | |
| M2M100 is a multilingual encoder-decoder (seq-to-seq) model trained for Many-to-Many | |
| multilingual translation. It was introduced in this [paper](https://huggingface.co/papers/2010.11125) | |
| and first released in [this](https://github.com/pytorch/fairseq/tree/master/examples/m2m_100) repository. | |
| For a list of supported languages (along with their language codes), | |
| **Kind**: static class of [tokenizers](#module_tokenizers) | |
| **See**: [https://huggingface.co/facebook/m2m100_418M#languages-covered](https://huggingface.co/facebook/m2m100_418M#languages-covered) | |
| * * * | |
| ### `m2M100Tokenizer._build_translation_inputs(raw_inputs, tokenizer_options, generate_kwargs)` ⇒ Object | |
| Helper function to build translation inputs for an `M2M100Tokenizer`. | |
| **Kind**: instance method of [M2M100Tokenizer](#module_tokenizers.M2M100Tokenizer) | |
| **Returns**: Object - Object to be passed to the model. | |
| ParamTypeDescription | |
| raw_inputsstring | Array<string>The text to tokenize. | |
| tokenizer_optionsObjectOptions to be sent to the tokenizer | |
| generate_kwargsObjectGeneration options. | |
| * * * | |
| ## tokenizers.WhisperTokenizer ⇐ PreTrainedTokenizer | |
| WhisperTokenizer tokenizer | |
| **Kind**: static class of [tokenizers](#module_tokenizers) | |
| **Extends**: PreTrainedTokenizer | |
| * [.WhisperTokenizer](#module_tokenizers.WhisperTokenizer) ⇐ PreTrainedTokenizer | |
| * [`._decode_asr(sequences, options)`](#module_tokenizers.WhisperTokenizer+_decode_asr) ⇒ * | |
| * [`.decode()`](#module_tokenizers.WhisperTokenizer+decode) : * | |
| * * * | |
| ### `whisperTokenizer._decode_asr(sequences, options)` ⇒ * | |
| Decodes automatic speech recognition (ASR) sequences. | |
| **Kind**: instance method of [WhisperTokenizer](#module_tokenizers.WhisperTokenizer) | |
| **Returns**: * - The decoded sequences. | |
| ParamTypeDescription | |
| sequences*The sequences to decode. | |
| optionsObjectThe options to use for decoding. | |
| * * * | |
| ### `whisperTokenizer.decode()` : * | |
| **Kind**: instance method of [WhisperTokenizer](#module_tokenizers.WhisperTokenizer) | |
| * * * | |
| ## tokenizers.MarianTokenizer | |
| **Kind**: static class of [tokenizers](#module_tokenizers) | |
| **Todo** | |
| - This model is not yet supported by Hugging Face's "fast" tokenizers library (https://github.com/huggingface/tokenizers). | |
| Therefore, this implementation (which is based on fast tokenizers) may produce slightly inaccurate results. | |
| * [.MarianTokenizer](#module_tokenizers.MarianTokenizer) | |
| * [`new MarianTokenizer(tokenizerJSON, tokenizerConfig)`](#new_module_tokenizers.MarianTokenizer_new) | |
| * [`._encode_text(text)`](#module_tokenizers.MarianTokenizer+_encode_text) ⇒ Array | |
| * * * | |
| ### `new MarianTokenizer(tokenizerJSON, tokenizerConfig)` | |
| Create a new MarianTokenizer instance. | |
| ParamTypeDescription | |
| tokenizerJSONObjectThe JSON of the tokenizer. | |
| tokenizerConfigObjectThe config of the tokenizer. | |
| * * * | |
| ### `marianTokenizer._encode_text(text)` ⇒ Array | |
| Encodes a single text. Overriding this method is necessary since the language codes | |
| must be removed before encoding with sentencepiece model. | |
| **Kind**: instance method of [MarianTokenizer](#module_tokenizers.MarianTokenizer) | |
| **Returns**: Array - The encoded tokens. | |
| **See**: https://github.com/huggingface/transformers/blob/12d51db243a00726a548a43cc333390ebae731e3/src/transformers/models/marian/tokenization_marian.py#L204-L213 | |
| ParamTypeDescription | |
| textstring | nullThe text to encode. | |
| * * * | |
| ## tokenizers.AutoTokenizer | |
| Helper class which is used to instantiate pretrained tokenizers with the `from_pretrained` function. | |
| The chosen tokenizer class is determined by the type specified in the tokenizer config. | |
| **Kind**: static class of [tokenizers](#module_tokenizers) | |
| * [.AutoTokenizer](#module_tokenizers.AutoTokenizer) | |
| * [`new AutoTokenizer()`](#new_module_tokenizers.AutoTokenizer_new) | |
| * [`.from_pretrained(pretrained_model_name_or_path, options)`](#module_tokenizers.AutoTokenizer.from_pretrained) ⇒ Promise.<PreTrainedTokenizer> | |
| * * * | |
| ### `new AutoTokenizer()` | |
| **Example** | |
| ```js | |
| const tokenizer = await AutoTokenizer.from_pretrained('Xenova/bert-base-uncased'); | |
| ``` | |
| * * * | |
| ### `AutoTokenizer.from_pretrained(pretrained_model_name_or_path, options)` ⇒ Promise.<PreTrainedTokenizer> | |
| Instantiate one of the tokenizer classes of the library from a pretrained model. | |
| The tokenizer class to instantiate is selected based on the `tokenizer_class` property of the config object | |
| (either passed as an argument or loaded from `pretrained_model_name_or_path` if possible) | |
| **Kind**: static method of [AutoTokenizer](#module_tokenizers.AutoTokenizer) | |
| **Returns**: Promise.<PreTrainedTokenizer> - A new instance of the PreTrainedTokenizer class. | |
| ParamTypeDescription | |
| pretrained_model_name_or_pathstringThe name or path of the pretrained model. Can be either: | |
| A string, the model id of a pretrained tokenizer hosted inside a model repo on huggingface.co. | |
| Valid model ids can be located at the root-level, like bert-base-uncased, or namespaced under a | |
| user or organization name, like dbmdz/bert-base-german-cased. | |
| A path to a directory containing tokenizer files, e.g., ./my_model_directory/. | |
| optionsPretrainedTokenizerOptionsAdditional options for loading the tokenizer. | |
| * * * | |
| ## `tokenizers.is_chinese_char(cp)` ⇒ boolean | |
| Checks whether the given Unicode codepoint represents a CJK (Chinese, Japanese, or Korean) character. | |
| A "chinese character" is defined as anything in the CJK Unicode block: | |
| https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block) | |
| Note that the CJK Unicode block is NOT all Japanese and Korean characters, despite its name. | |
| The modern Korean Hangul alphabet is a different block, as is Japanese Hiragana and Katakana. | |
| Those alphabets are used to write space-separated words, so they are not treated specially | |
| and are handled like all other languages. | |
| **Kind**: static method of [tokenizers](#module_tokenizers) | |
| **Returns**: boolean - True if the codepoint represents a CJK character, false otherwise. | |
| ParamTypeDescription | |
| cpnumber | bigintThe Unicode codepoint to check. | |
| * * * | |
| ## tokenizers~AddedToken | |
| Represent a token added by the user on top of the existing Model vocabulary. | |
| AddedToken can be configured to specify the behavior they should have in various situations like: | |
| - Whether they should only match single words | |
| - Whether to include any whitespace on its left or right | |
| **Kind**: inner class of [tokenizers](#module_tokenizers) | |
| * * * | |
| ### `new AddedToken(config)` | |
| Creates a new instance of AddedToken. | |
| ParamTypeDefaultDescription | |
| configObjectAdded token configuration object. | |
| config.contentstringThe content of the added token. | |
| config.idnumberThe id of the added token. | |
| [config.single_word]booleanfalseWhether this token must be a single word or can break words. | |
| [config.lstrip]booleanfalseWhether this token should strip whitespaces on its left. | |
| [config.rstrip]booleanfalseWhether this token should strip whitespaces on its right. | |
| [config.normalized]booleanfalseWhether this token should be normalized. | |
| [config.special]booleanfalseWhether this token is special. | |
| * * * | |
| ## tokenizers~WordPieceTokenizer ⇐ TokenizerModel | |
| A subclass of TokenizerModel that uses WordPiece encoding to encode tokens. | |
| **Kind**: inner class of [tokenizers](#module_tokenizers) | |
| **Extends**: TokenizerModel | |
| * [~WordPieceTokenizer](#module_tokenizers..WordPieceTokenizer) ⇐ TokenizerModel | |
| * [`new WordPieceTokenizer(config)`](#new_module_tokenizers..WordPieceTokenizer_new) | |
| * [`.tokens_to_ids`](#module_tokenizers..WordPieceTokenizer+tokens_to_ids) : Map.<string, number> | |
| * [`.unk_token_id`](#module_tokenizers..WordPieceTokenizer+unk_token_id) : number | |
| * [`.unk_token`](#module_tokenizers..WordPieceTokenizer+unk_token) : string | |
| * [`.max_input_chars_per_word`](#module_tokenizers..WordPieceTokenizer+max_input_chars_per_word) : number | |
| * [`.vocab`](#module_tokenizers..WordPieceTokenizer+vocab) : Array.<string> | |
| * [`.encode(tokens)`](#module_tokenizers..WordPieceTokenizer+encode) ⇒ Array.<string> | |
| * * * | |
| ### `new WordPieceTokenizer(config)` | |
| ParamTypeDefaultDescription | |
| configObjectThe configuration object. | |
| config.vocabObjectA mapping of tokens to ids. | |
| config.unk_tokenstringThe unknown token string. | |
| config.continuing_subword_prefixstringThe prefix to use for continuing subwords. | |
| [config.max_input_chars_per_word]number100The maximum number of characters per word. | |
| * * * | |
| ### `wordPieceTokenizer.tokens_to_ids` : Map.<string, number> | |
| A mapping of tokens to ids. | |
| **Kind**: instance property of [WordPieceTokenizer](#module_tokenizers..WordPieceTokenizer) | |
| * * * | |
| ### `wordPieceTokenizer.unk_token_id` : number | |
| The id of the unknown token. | |
| **Kind**: instance property of [WordPieceTokenizer](#module_tokenizers..WordPieceTokenizer) | |
| * * * | |
| ### `wordPieceTokenizer.unk_token` : string | |
| The unknown token string. | |
| **Kind**: instance property of [WordPieceTokenizer](#module_tokenizers..WordPieceTokenizer) | |
| * * * | |
| ### `wordPieceTokenizer.max_input_chars_per_word` : number | |
| The maximum number of characters allowed per word. | |
| **Kind**: instance property of [WordPieceTokenizer](#module_tokenizers..WordPieceTokenizer) | |
| * * * | |
| ### `wordPieceTokenizer.vocab` : Array.<string> | |
| An array of tokens. | |
| **Kind**: instance property of [WordPieceTokenizer](#module_tokenizers..WordPieceTokenizer) | |
| * * * | |
| ### `wordPieceTokenizer.encode(tokens)` ⇒ Array.<string> | |
| Encodes an array of tokens using WordPiece encoding. | |
| **Kind**: instance method of [WordPieceTokenizer](#module_tokenizers..WordPieceTokenizer) | |
| **Returns**: Array.<string> - An array of encoded tokens. | |
| ParamTypeDescription | |
| tokensArray.<string>The tokens to encode. | |
| * * * | |
| ## tokenizers~Unigram ⇐ TokenizerModel | |
| Class representing a Unigram tokenizer model. | |
| **Kind**: inner class of [tokenizers](#module_tokenizers) | |
| **Extends**: TokenizerModel | |
| * [~Unigram](#module_tokenizers..Unigram) ⇐ TokenizerModel | |
| * [`new Unigram(config, moreConfig)`](#new_module_tokenizers..Unigram_new) | |
| * [`.scores`](#module_tokenizers..Unigram+scores) : Array.<number> | |
| * [`.populateNodes(lattice)`](#module_tokenizers..Unigram+populateNodes) | |
| * [`.tokenize(normalized)`](#module_tokenizers..Unigram+tokenize) ⇒ Array.<string> | |
| * [`.encode(tokens)`](#module_tokenizers..Unigram+encode) ⇒ Array.<string> | |
| * * * | |
| ### `new Unigram(config, moreConfig)` | |
| Create a new Unigram tokenizer model. | |
| ParamTypeDescription | |
| configObjectThe configuration object for the Unigram model. | |
| config.unk_idnumberThe ID of the unknown token | |
| config.vocab*A 2D array representing a mapping of tokens to scores. | |
| moreConfigObjectAdditional configuration object for the Unigram model. | |
| * * * | |
| ### `unigram.scores` : Array.<number> | |
| **Kind**: instance property of [Unigram](#module_tokenizers..Unigram) | |
| * * * | |
| ### `unigram.populateNodes(lattice)` | |
| Populates lattice nodes. | |
| **Kind**: instance method of [Unigram](#module_tokenizers..Unigram) | |
| ParamTypeDescription | |
| latticeTokenLatticeThe token lattice to populate with nodes. | |
| * * * | |
| ### `unigram.tokenize(normalized)` ⇒ Array.<string> | |
| Encodes an array of tokens into an array of subtokens using the unigram model. | |
| **Kind**: instance method of [Unigram](#module_tokenizers..Unigram) | |
| **Returns**: Array.<string> - An array of subtokens obtained by encoding the input tokens using the unigram model. | |
| ParamTypeDescription | |
| normalizedstringThe normalized string. | |
| * * * | |
| ### `unigram.encode(tokens)` ⇒ Array.<string> | |
| Encodes an array of tokens using Unigram encoding. | |
| **Kind**: instance method of [Unigram](#module_tokenizers..Unigram) | |
| **Returns**: Array.<string> - An array of encoded tokens. | |
| ParamTypeDescription | |
| tokensArray.<string>The tokens to encode. | |
| * * * | |
| ## tokenizers~BPE ⇐ TokenizerModel | |
| BPE class for encoding text into Byte-Pair-Encoding (BPE) tokens. | |
| **Kind**: inner class of [tokenizers](#module_tokenizers) | |
| **Extends**: TokenizerModel | |
| * [~BPE](#module_tokenizers..BPE) ⇐ TokenizerModel | |
| * [`new BPE(config)`](#new_module_tokenizers..BPE_new) | |
| * [`.tokens_to_ids`](#module_tokenizers..BPE+tokens_to_ids) : Map.<string, number> | |
| * [`.merges`](#module_tokenizers..BPE+merges) : * | |
| * [`.config.merges`](#module_tokenizers..BPE+merges.config.merges) : * | |
| * [`.max_length_to_cache`](#module_tokenizers..BPE+max_length_to_cache) | |
| * [`.cache_capacity`](#module_tokenizers..BPE+cache_capacity) | |
| * [`.clear_cache()`](#module_tokenizers..BPE+clear_cache) | |
| * [`.bpe(token)`](#module_tokenizers..BPE+bpe) ⇒ Array.<string> | |
| * [`.encode(tokens)`](#module_tokenizers..BPE+encode) ⇒ Array.<string> | |
| * * * | |
| ### `new BPE(config)` | |
| Create a BPE instance. | |
| ParamTypeDefaultDescription | |
| configObjectThe configuration object for BPE. | |
| config.vocabObjectA mapping of tokens to ids. | |
| config.merges*An array of BPE merges as strings. | |
| config.unk_tokenstringThe unknown token used for out of vocabulary words. | |
| config.end_of_word_suffixstringThe suffix to place at the end of each word. | |
| [config.continuing_subword_suffix]stringThe suffix to insert between words. | |
| [config.byte_fallback]booleanfalseWhether to use spm byte-fallback trick (defaults to False) | |
| [config.ignore_merges]booleanfalseWhether or not to match tokens with the vocab before using merges. | |
| * * * | |
| ### `bpE.tokens_to_ids` : Map.<string, number> | |
| **Kind**: instance property of [BPE](#module_tokenizers..BPE) | |
| * * * | |
| ### `bpE.merges` : * | |
| **Kind**: instance property of [BPE](#module_tokenizers..BPE) | |
| * * * | |
| #### `merges.config.merges` : * | |
| **Kind**: static property of [merges](#module_tokenizers..BPE+merges) | |
| * * * | |
| ### `bpE.max_length_to_cache` | |
| The maximum length we should cache in a model. | |
| Strings that are too long have minimal chances to cache hit anyway | |
| **Kind**: instance property of [BPE](#module_tokenizers..BPE) | |
| * * * | |
| ### `bpE.cache_capacity` | |
| The default capacity for a `BPE`'s internal cache. | |
| **Kind**: instance property of [BPE](#module_tokenizers..BPE) | |
| * * * | |
| ### `bpE.clear_cache()` | |
| Clears the cache. | |
| **Kind**: instance method of [BPE](#module_tokenizers..BPE) | |
| * * * | |
| ### `bpE.bpe(token)` ⇒ Array.<string> | |
| Apply Byte-Pair-Encoding (BPE) to a given token. Efficient heap-based priority | |
| queue implementation adapted from https://github.com/belladoreai/llama-tokenizer-js. | |
| **Kind**: instance method of [BPE](#module_tokenizers..BPE) | |
| **Returns**: Array.<string> - The BPE encoded tokens. | |
| ParamTypeDescription | |
| tokenstringThe token to encode. | |
| * * * | |
| ### `bpE.encode(tokens)` ⇒ Array.<string> | |
| Encodes the input sequence of tokens using the BPE algorithm and returns the resulting subword tokens. | |
| **Kind**: instance method of [BPE](#module_tokenizers..BPE) | |
| **Returns**: Array.<string> - The resulting subword tokens after applying the BPE algorithm to the input sequence of tokens. | |
| ParamTypeDescription | |
| tokensArray.<string>The input sequence of tokens to encode. | |
| * * * | |
| ## tokenizers~LegacyTokenizerModel | |
| Legacy tokenizer class for tokenizers with only a vocabulary. | |
| **Kind**: inner class of [tokenizers](#module_tokenizers) | |
| * [~LegacyTokenizerModel](#module_tokenizers..LegacyTokenizerModel) | |
| * [`new LegacyTokenizerModel(config, moreConfig)`](#new_module_tokenizers..LegacyTokenizerModel_new) | |
| * [`.tokens_to_ids`](#module_tokenizers..LegacyTokenizerModel+tokens_to_ids) : Map.<string, number> | |
| * * * | |
| ### `new LegacyTokenizerModel(config, moreConfig)` | |
| Create a LegacyTokenizerModel instance. | |
| ParamTypeDescription | |
| configObjectThe configuration object for LegacyTokenizerModel. | |
| config.vocabObjectA (possibly nested) mapping of tokens to ids. | |
| moreConfigObjectAdditional configuration object for the LegacyTokenizerModel model. | |
| * * * | |
| ### `legacyTokenizerModel.tokens_to_ids` : Map.<string, number> | |
| **Kind**: instance property of [LegacyTokenizerModel](#module_tokenizers..LegacyTokenizerModel) | |
| * * * | |
| ## *tokenizers~Normalizer* | |
| A base class for text normalization. | |
| **Kind**: inner abstract class of [tokenizers](#module_tokenizers) | |
| * *[~Normalizer](#module_tokenizers..Normalizer)* | |
| * *[`new Normalizer(config)`](#new_module_tokenizers..Normalizer_new)* | |
| * _instance_ | |
| * **[`.normalize(text)`](#module_tokenizers..Normalizer+normalize) ⇒ string** | |
| * *[`._call(text)`](#module_tokenizers..Normalizer+_call) ⇒ string* | |
| * _static_ | |
| * *[`.fromConfig(config)`](#module_tokenizers..Normalizer.fromConfig) ⇒ Normalizer* | |
| * * * | |
| ### *`new Normalizer(config)`* | |
| ParamTypeDescription | |
| configObjectThe configuration object for the normalizer. | |
| * * * | |
| ### **`normalizer.normalize(text)` ⇒ string** | |
| Normalize the input text. | |
| **Kind**: instance abstract method of [Normalizer](#module_tokenizers..Normalizer) | |
| **Returns**: string - The normalized text. | |
| **Throws**: | |
| - Error If this method is not implemented in a subclass. | |
| ParamTypeDescription | |
| textstringThe text to normalize. | |
| * * * | |
| ### *`normalizer._call(text)` ⇒ string* | |
| Alias for [Normalizer#normalize](Normalizer#normalize). | |
| **Kind**: instance method of [Normalizer](#module_tokenizers..Normalizer) | |
| **Returns**: string - The normalized text. | |
| ParamTypeDescription | |
| textstringThe text to normalize. | |
| * * * | |
| ### *`Normalizer.fromConfig(config)` ⇒ Normalizer* | |
| Factory method for creating normalizers from config objects. | |
| **Kind**: static method of [Normalizer](#module_tokenizers..Normalizer) | |
| **Returns**: Normalizer - A Normalizer object. | |
| **Throws**: | |
| - Error If an unknown Normalizer type is specified in the config. | |
| ParamTypeDescription | |
| configObjectThe configuration object for the normalizer. | |
| * * * | |
| ## tokenizers~Replace ⇐ Normalizer | |
| Replace normalizer that replaces occurrences of a pattern with a given string or regular expression. | |
| **Kind**: inner class of [tokenizers](#module_tokenizers) | |
| **Extends**: Normalizer | |
| * * * | |
| ### `replace.normalize(text)` ⇒ string | |
| Normalize the input text by replacing the pattern with the content. | |
| **Kind**: instance method of [Replace](#module_tokenizers..Replace) | |
| **Returns**: string - The normalized text after replacing the pattern with the content. | |
| ParamTypeDescription | |
| textstringThe input text to be normalized. | |
| * * * | |
| ## *tokenizers~UnicodeNormalizer ⇐ Normalizer* | |
| A normalizer that applies Unicode normalization to the input text. | |
| **Kind**: inner abstract class of [tokenizers](#module_tokenizers) | |
| **Extends**: Normalizer | |
| * *[~UnicodeNormalizer](#module_tokenizers..UnicodeNormalizer) ⇐ Normalizer* | |
| * *[`.form`](#module_tokenizers..UnicodeNormalizer+form) : string* | |
| * *[`.normalize(text)`](#module_tokenizers..UnicodeNormalizer+normalize) ⇒ string* | |
| * * * | |
| ### *`unicodeNormalizer.form` : string* | |
| The Unicode normalization form to apply.Should be one of: 'NFC', 'NFD', 'NFKC', or 'NFKD'. | |
| **Kind**: instance property of [UnicodeNormalizer](#module_tokenizers..UnicodeNormalizer) | |
| * * * | |
| ### *`unicodeNormalizer.normalize(text)` ⇒ string* | |
| Normalize the input text by applying Unicode normalization. | |
| **Kind**: instance method of [UnicodeNormalizer](#module_tokenizers..UnicodeNormalizer) | |
| **Returns**: string - The normalized text. | |
| ParamTypeDescription | |
| textstringThe input text to be normalized. | |
| * * * | |
| ## tokenizers~NFC ⇐ UnicodeNormalizer | |
| A normalizer that applies Unicode normalization form C (NFC) to the input text. | |
| Canonical Decomposition, followed by Canonical Composition. | |
| **Kind**: inner class of [tokenizers](#module_tokenizers) | |
| **Extends**: UnicodeNormalizer | |
| * * * | |
| ## tokenizers~NFD ⇐ UnicodeNormalizer | |
| A normalizer that applies Unicode normalization form D (NFD) to the input text. | |
| Canonical Decomposition. | |
| **Kind**: inner class of [tokenizers](#module_tokenizers) | |
| **Extends**: UnicodeNormalizer | |
| * * * | |
| ## tokenizers~NFKC ⇐ UnicodeNormalizer | |
| A normalizer that applies Unicode normalization form KC (NFKC) to the input text. | |
| Compatibility Decomposition, followed by Canonical Composition. | |
| **Kind**: inner class of [tokenizers](#module_tokenizers) | |
| **Extends**: UnicodeNormalizer | |
| * * * | |
| ## tokenizers~NFKD ⇐ UnicodeNormalizer | |
| A normalizer that applies Unicode normalization form KD (NFKD) to the input text. | |
| Compatibility Decomposition. | |
| **Kind**: inner class of [tokenizers](#module_tokenizers) | |
| **Extends**: UnicodeNormalizer | |
| * * * | |
| ## tokenizers~StripNormalizer | |
| A normalizer that strips leading and/or trailing whitespace from the input text. | |
| **Kind**: inner class of [tokenizers](#module_tokenizers) | |
| * * * | |
| ### `stripNormalizer.normalize(text)` ⇒ string | |
| Strip leading and/or trailing whitespace from the input text. | |
| **Kind**: instance method of [StripNormalizer](#module_tokenizers..StripNormalizer) | |
| **Returns**: string - The normalized text. | |
| ParamTypeDescription | |
| textstringThe input text. | |
| * * * | |
| ## tokenizers~StripAccents ⇐ Normalizer | |
| StripAccents normalizer removes all accents from the text. | |
| **Kind**: inner class of [tokenizers](#module_tokenizers) | |
| **Extends**: Normalizer | |
| * * * | |
| ### `stripAccents.normalize(text)` ⇒ string | |
| Remove all accents from the text. | |
| **Kind**: instance method of [StripAccents](#module_tokenizers..StripAccents) | |
| **Returns**: string - The normalized text without accents. | |
| ParamTypeDescription | |
| textstringThe input text. | |
| * * * | |
| ## tokenizers~Lowercase ⇐ Normalizer | |
| A Normalizer that lowercases the input string. | |
| **Kind**: inner class of [tokenizers](#module_tokenizers) | |
| **Extends**: Normalizer | |
| * * * | |
| ### `lowercase.normalize(text)` ⇒ string | |
| Lowercases the input string. | |
| **Kind**: instance method of [Lowercase](#module_tokenizers..Lowercase) | |
| **Returns**: string - The normalized text. | |
| ParamTypeDescription | |
| textstringThe text to normalize. | |
| * * * | |
| ## tokenizers~Prepend ⇐ Normalizer | |
| A Normalizer that prepends a string to the input string. | |
| **Kind**: inner class of [tokenizers](#module_tokenizers) | |
| **Extends**: Normalizer | |
| * * * | |
| ### `prepend.normalize(text)` ⇒ string | |
| Prepends the input string. | |
| **Kind**: instance method of [Prepend](#module_tokenizers..Prepend) | |
| **Returns**: string - The normalized text. | |
| ParamTypeDescription | |
| textstringThe text to normalize. | |
| * * * | |
| ## tokenizers~NormalizerSequence ⇐ Normalizer | |
| A Normalizer that applies a sequence of Normalizers. | |
| **Kind**: inner class of [tokenizers](#module_tokenizers) | |
| **Extends**: Normalizer | |
| * [~NormalizerSequence](#module_tokenizers..NormalizerSequence) ⇐ Normalizer | |
| * [`new NormalizerSequence(config)`](#new_module_tokenizers..NormalizerSequence_new) | |
| * [`.normalize(text)`](#module_tokenizers..NormalizerSequence+normalize) ⇒ string | |
| * * * | |
| ### `new NormalizerSequence(config)` | |
| Create a new instance of NormalizerSequence. | |
| ParamTypeDescription | |
| configObjectThe configuration object. | |
| config.normalizersArray.<Object>An array of Normalizer configuration objects. | |
| * * * | |
| ### `normalizerSequence.normalize(text)` ⇒ string | |
| Apply a sequence of Normalizers to the input text. | |
| **Kind**: instance method of [NormalizerSequence](#module_tokenizers..NormalizerSequence) | |
| **Returns**: string - The normalized text. | |
| ParamTypeDescription | |
| textstringThe text to normalize. | |
| * * * | |
| ## tokenizers~BertNormalizer ⇐ Normalizer | |
| A class representing a normalizer used in BERT tokenization. | |
| **Kind**: inner class of [tokenizers](#module_tokenizers) | |
| **Extends**: Normalizer | |
| * [~BertNormalizer](#module_tokenizers..BertNormalizer) ⇐ Normalizer | |
| * [`._tokenize_chinese_chars(text)`](#module_tokenizers..BertNormalizer+_tokenize_chinese_chars) ⇒ string | |
| * [`.stripAccents(text)`](#module_tokenizers..BertNormalizer+stripAccents) ⇒ string | |
| * [`.normalize(text)`](#module_tokenizers..BertNormalizer+normalize) ⇒ string | |
| * * * | |
| ### `bertNormalizer._tokenize_chinese_chars(text)` ⇒ string | |
| Adds whitespace around any CJK (Chinese, Japanese, or Korean) character in the input text. | |
| **Kind**: instance method of [BertNormalizer](#module_tokenizers..BertNormalizer) | |
| **Returns**: string - The tokenized text with whitespace added around CJK characters. | |
| ParamTypeDescription | |
| textstringThe input text to tokenize. | |
| * * * | |
| ### `bertNormalizer.stripAccents(text)` ⇒ string | |
| Strips accents from the given text. | |
| **Kind**: instance method of [BertNormalizer](#module_tokenizers..BertNormalizer) | |
| **Returns**: string - The text with accents removed. | |
| ParamTypeDescription | |
| textstringThe text to strip accents from. | |
| * * * | |
| ### `bertNormalizer.normalize(text)` ⇒ string | |
| Normalizes the given text based on the configuration. | |
| **Kind**: instance method of [BertNormalizer](#module_tokenizers..BertNormalizer) | |
| **Returns**: string - The normalized text. | |
| ParamTypeDescription | |
| textstringThe text to normalize. | |
| * * * | |
| ## tokenizers~PreTokenizer ⇐ [Callable](#Callable) | |
| A callable class representing a pre-tokenizer used in tokenization. Subclasses | |
| should implement the `pre_tokenize_text` method to define the specific pre-tokenization logic. | |
| **Kind**: inner class of [tokenizers](#module_tokenizers) | |
| **Extends**: [Callable](#Callable) | |
| * [~PreTokenizer](#module_tokenizers..PreTokenizer) ⇐ [Callable](#Callable) | |
| * _instance_ | |
| * *[`.pre_tokenize_text(text, [options])`](#module_tokenizers..PreTokenizer+pre_tokenize_text) ⇒ Array.<string>* | |
| * [`.pre_tokenize(text, [options])`](#module_tokenizers..PreTokenizer+pre_tokenize) ⇒ Array.<string> | |
| * [`._call(text, [options])`](#module_tokenizers..PreTokenizer+_call) ⇒ Array.<string> | |
| * _static_ | |
| * [`.fromConfig(config)`](#module_tokenizers..PreTokenizer.fromConfig) ⇒ PreTokenizer | |
| * * * | |
| ### *`preTokenizer.pre_tokenize_text(text, [options])` ⇒ Array.<string>* | |
| Method that should be implemented by subclasses to define the specific pre-tokenization logic. | |
| **Kind**: instance abstract method of [PreTokenizer](#module_tokenizers..PreTokenizer) | |
| **Returns**: Array.<string> - The pre-tokenized text. | |
| **Throws**: | |
| - Error If the method is not implemented in the subclass. | |
| ParamTypeDescription | |
| textstringThe text to pre-tokenize. | |
| [options]ObjectAdditional options for the pre-tokenization logic. | |
| * * * | |
| ### `preTokenizer.pre_tokenize(text, [options])` ⇒ Array.<string> | |
| Tokenizes the given text into pre-tokens. | |
| **Kind**: instance method of [PreTokenizer](#module_tokenizers..PreTokenizer) | |
| **Returns**: Array.<string> - An array of pre-tokens. | |
| ParamTypeDescription | |
| textstring | Array<string>The text or array of texts to pre-tokenize. | |
| [options]ObjectAdditional options for the pre-tokenization logic. | |
| * * * | |
| ### `preTokenizer._call(text, [options])` ⇒ Array.<string> | |
| Alias for [PreTokenizer#pre_tokenize](PreTokenizer#pre_tokenize). | |
| **Kind**: instance method of [PreTokenizer](#module_tokenizers..PreTokenizer) | |
| **Overrides**: [_call](#Callable+_call) | |
| **Returns**: Array.<string> - An array of pre-tokens. | |
| ParamTypeDescription | |
| textstring | Array<string>The text or array of texts to pre-tokenize. | |
| [options]ObjectAdditional options for the pre-tokenization logic. | |
| * * * | |
| ### `PreTokenizer.fromConfig(config)` ⇒ PreTokenizer | |
| Factory method that returns an instance of a subclass of `PreTokenizer` based on the provided configuration. | |
| **Kind**: static method of [PreTokenizer](#module_tokenizers..PreTokenizer) | |
| **Returns**: PreTokenizer - An instance of a subclass of `PreTokenizer`. | |
| **Throws**: | |
| - Error If the provided configuration object does not correspond to any known pre-tokenizer. | |
| ParamTypeDescription | |
| configObjectA configuration object for the pre-tokenizer. | |
| * * * | |
| ## tokenizers~BertPreTokenizer ⇐ PreTokenizer | |
| **Kind**: inner class of [tokenizers](#module_tokenizers) | |
| **Extends**: PreTokenizer | |
| * [~BertPreTokenizer](#module_tokenizers..BertPreTokenizer) ⇐ PreTokenizer | |
| * [`new BertPreTokenizer(config)`](#new_module_tokenizers..BertPreTokenizer_new) | |
| * [`.pre_tokenize_text(text, [options])`](#module_tokenizers..BertPreTokenizer+pre_tokenize_text) ⇒ Array.<string> | |
| * * * | |
| ### `new BertPreTokenizer(config)` | |
| A PreTokenizer that splits text into wordpieces using a basic tokenization scheme | |
| similar to that used in the original implementation of BERT. | |
| ParamTypeDescription | |
| configObjectThe configuration object. | |
| * * * | |
| ### `bertPreTokenizer.pre_tokenize_text(text, [options])` ⇒ Array.<string> | |
| Tokenizes a single text using the BERT pre-tokenization scheme. | |
| **Kind**: instance method of [BertPreTokenizer](#module_tokenizers..BertPreTokenizer) | |
| **Returns**: Array.<string> - An array of tokens. | |
| ParamTypeDescription | |
| textstringThe text to tokenize. | |
| [options]ObjectAdditional options for the pre-tokenization logic. | |
| * * * | |
| ## tokenizers~ByteLevelPreTokenizer ⇐ PreTokenizer | |
| A pre-tokenizer that splits text into Byte-Pair-Encoding (BPE) subwords. | |
| **Kind**: inner class of [tokenizers](#module_tokenizers) | |
| **Extends**: PreTokenizer | |
| * [~ByteLevelPreTokenizer](#module_tokenizers..ByteLevelPreTokenizer) ⇐ PreTokenizer | |
| * [`new ByteLevelPreTokenizer(config)`](#new_module_tokenizers..ByteLevelPreTokenizer_new) | |
| * [`.add_prefix_space`](#module_tokenizers..ByteLevelPreTokenizer+add_prefix_space) : boolean | |
| * [`.trim_offsets`](#module_tokenizers..ByteLevelPreTokenizer+trim_offsets) : boolean | |
| * [`.use_regex`](#module_tokenizers..ByteLevelPreTokenizer+use_regex) : boolean | |
| * [`.pre_tokenize_text(text, [options])`](#module_tokenizers..ByteLevelPreTokenizer+pre_tokenize_text) ⇒ Array.<string> | |
| * * * | |
| ### `new ByteLevelPreTokenizer(config)` | |
| Creates a new instance of the `ByteLevelPreTokenizer` class. | |
| ParamTypeDescription | |
| configObjectThe configuration object. | |
| * * * | |
| ### `byteLevelPreTokenizer.add_prefix_space` : boolean | |
| Whether to add a leading space to the first word.This allows to treat the leading word just as any other word. | |
| **Kind**: instance property of [ByteLevelPreTokenizer](#module_tokenizers..ByteLevelPreTokenizer) | |
| * * * | |
| ### `byteLevelPreTokenizer.trim_offsets` : boolean | |
| Whether the post processing step should trim offsetsto avoid including whitespaces. | |
| **Kind**: instance property of [ByteLevelPreTokenizer](#module_tokenizers..ByteLevelPreTokenizer) | |
| **Todo** | |
| - Use this in the pretokenization step. | |
| * * * | |
| ### `byteLevelPreTokenizer.use_regex` : boolean | |
| Whether to use the standard GPT2 regex for whitespace splitting.Set it to False if you want to use your own splitting. Defaults to true. | |
| **Kind**: instance property of [ByteLevelPreTokenizer](#module_tokenizers..ByteLevelPreTokenizer) | |
| * * * | |
| ### `byteLevelPreTokenizer.pre_tokenize_text(text, [options])` ⇒ Array.<string> | |
| Tokenizes a single piece of text using byte-level tokenization. | |
| **Kind**: instance method of [ByteLevelPreTokenizer](#module_tokenizers..ByteLevelPreTokenizer) | |
| **Returns**: Array.<string> - An array of tokens. | |
| ParamTypeDescription | |
| textstringThe text to tokenize. | |
| [options]ObjectAdditional options for the pre-tokenization logic. | |
| * * * | |
| ## tokenizers~SplitPreTokenizer ⇐ PreTokenizer | |
| Splits text using a given pattern. | |
| **Kind**: inner class of [tokenizers](#module_tokenizers) | |
| **Extends**: PreTokenizer | |
| * [~SplitPreTokenizer](#module_tokenizers..SplitPreTokenizer) ⇐ PreTokenizer | |
| * [`new SplitPreTokenizer(config)`](#new_module_tokenizers..SplitPreTokenizer_new) | |
| * [`.pre_tokenize_text(text, [options])`](#module_tokenizers..SplitPreTokenizer+pre_tokenize_text) ⇒ Array.<string> | |
| * * * | |
| ### `new SplitPreTokenizer(config)` | |
| ParamTypeDescription | |
| configObjectThe configuration options for the pre-tokenizer. | |
| config.patternObjectThe pattern used to split the text. Can be a string or a regex object. | |
| config.pattern.Stringstring | undefinedThe string to use for splitting. Only defined if the pattern is a string. | |
| config.pattern.Regexstring | undefinedThe regex to use for splitting. Only defined if the pattern is a regex. | |
| config.behaviorSplitDelimiterBehaviorThe behavior to use when splitting. | |
| config.invertbooleanWhether to split (invert=false) or match (invert=true) the pattern. | |
| * * * | |
| ### `splitPreTokenizer.pre_tokenize_text(text, [options])` ⇒ Array.<string> | |
| Tokenizes text by splitting it using the given pattern. | |
| **Kind**: instance method of [SplitPreTokenizer](#module_tokenizers..SplitPreTokenizer) | |
| **Returns**: Array.<string> - An array of tokens. | |
| ParamTypeDescription | |
| textstringThe text to tokenize. | |
| [options]ObjectAdditional options for the pre-tokenization logic. | |
| * * * | |
| ## tokenizers~PunctuationPreTokenizer ⇐ PreTokenizer | |
| Splits text based on punctuation. | |
| **Kind**: inner class of [tokenizers](#module_tokenizers) | |
| **Extends**: PreTokenizer | |
| * [~PunctuationPreTokenizer](#module_tokenizers..PunctuationPreTokenizer) ⇐ PreTokenizer | |
| * [`new PunctuationPreTokenizer(config)`](#new_module_tokenizers..PunctuationPreTokenizer_new) | |
| * [`.pre_tokenize_text(text, [options])`](#module_tokenizers..PunctuationPreTokenizer+pre_tokenize_text) ⇒ Array.<string> | |
| * * * | |
| ### `new PunctuationPreTokenizer(config)` | |
| ParamTypeDescription | |
| configObjectThe configuration options for the pre-tokenizer. | |
| config.behaviorSplitDelimiterBehaviorThe behavior to use when splitting. | |
| * * * | |
| ### `punctuationPreTokenizer.pre_tokenize_text(text, [options])` ⇒ Array.<string> | |
| Tokenizes text by splitting it using the given pattern. | |
| **Kind**: instance method of [PunctuationPreTokenizer](#module_tokenizers..PunctuationPreTokenizer) | |
| **Returns**: Array.<string> - An array of tokens. | |
| ParamTypeDescription | |
| textstringThe text to tokenize. | |
| [options]ObjectAdditional options for the pre-tokenization logic. | |
| * * * | |
| ## tokenizers~DigitsPreTokenizer ⇐ PreTokenizer | |
| Splits text based on digits. | |
| **Kind**: inner class of [tokenizers](#module_tokenizers) | |
| **Extends**: PreTokenizer | |
| * [~DigitsPreTokenizer](#module_tokenizers..DigitsPreTokenizer) ⇐ PreTokenizer | |
| * [`new DigitsPreTokenizer(config)`](#new_module_tokenizers..DigitsPreTokenizer_new) | |
| * [`.pre_tokenize_text(text, [options])`](#module_tokenizers..DigitsPreTokenizer+pre_tokenize_text) ⇒ Array.<string> | |
| * * * | |
| ### `new DigitsPreTokenizer(config)` | |
| ParamTypeDescription | |
| configObjectThe configuration options for the pre-tokenizer. | |
| config.individual_digitsbooleanWhether to split on individual digits. | |
| * * * | |
| ### `digitsPreTokenizer.pre_tokenize_text(text, [options])` ⇒ Array.<string> | |
| Tokenizes text by splitting it using the given pattern. | |
| **Kind**: instance method of [DigitsPreTokenizer](#module_tokenizers..DigitsPreTokenizer) | |
| **Returns**: Array.<string> - An array of tokens. | |
| ParamTypeDescription | |
| textstringThe text to tokenize. | |
| [options]ObjectAdditional options for the pre-tokenization logic. | |
| * * * | |
| ## tokenizers~PostProcessor ⇐ [Callable](#Callable) | |
| **Kind**: inner class of [tokenizers](#module_tokenizers) | |
| **Extends**: [Callable](#Callable) | |
| * [~PostProcessor](#module_tokenizers..PostProcessor) ⇐ [Callable](#Callable) | |
| * [`new PostProcessor(config)`](#new_module_tokenizers..PostProcessor_new) | |
| * _instance_ | |
| * [`.post_process(tokens, ...args)`](#module_tokenizers..PostProcessor+post_process) ⇒ PostProcessedOutput | |
| * [`._call(tokens, ...args)`](#module_tokenizers..PostProcessor+_call) ⇒ PostProcessedOutput | |
| * _static_ | |
| * [`.fromConfig(config)`](#module_tokenizers..PostProcessor.fromConfig) ⇒ PostProcessor | |
| * * * | |
| ### `new PostProcessor(config)` | |
| ParamTypeDescription | |
| configObjectThe configuration for the post-processor. | |
| * * * | |
| ### `postProcessor.post_process(tokens, ...args)` ⇒ PostProcessedOutput | |
| Method to be implemented in subclass to apply post-processing on the given tokens. | |
| **Kind**: instance method of [PostProcessor](#module_tokenizers..PostProcessor) | |
| **Returns**: PostProcessedOutput - The post-processed tokens. | |
| **Throws**: | |
| - Error If the method is not implemented in subclass. | |
| ParamTypeDescription | |
| tokensArrayThe input tokens to be post-processed. | |
| ...args*Additional arguments required by the post-processing logic. | |
| * * * | |
| ### `postProcessor._call(tokens, ...args)` ⇒ PostProcessedOutput | |
| Alias for [PostProcessor#post_process](PostProcessor#post_process). | |
| **Kind**: instance method of [PostProcessor](#module_tokenizers..PostProcessor) | |
| **Overrides**: [_call](#Callable+_call) | |
| **Returns**: PostProcessedOutput - The post-processed tokens. | |
| ParamTypeDescription | |
| tokensArrayThe text or array of texts to post-process. | |
| ...args*Additional arguments required by the post-processing logic. | |
| * * * | |
| ### `PostProcessor.fromConfig(config)` ⇒ PostProcessor | |
| Factory method to create a PostProcessor object from a configuration object. | |
| **Kind**: static method of [PostProcessor](#module_tokenizers..PostProcessor) | |
| **Returns**: PostProcessor - A PostProcessor object created from the given configuration. | |
| **Throws**: | |
| - Error If an unknown PostProcessor type is encountered. | |
| ParamTypeDescription | |
| configObjectConfiguration object representing a PostProcessor. | |
| * * * | |
| ## tokenizers~BertProcessing | |
| A post-processor that adds special tokens to the beginning and end of the input. | |
| **Kind**: inner class of [tokenizers](#module_tokenizers) | |
| * [~BertProcessing](#module_tokenizers..BertProcessing) | |
| * [`new BertProcessing(config)`](#new_module_tokenizers..BertProcessing_new) | |
| * [`.post_process(tokens, [tokens_pair])`](#module_tokenizers..BertProcessing+post_process) ⇒ PostProcessedOutput | |
| * * * | |
| ### `new BertProcessing(config)` | |
| ParamTypeDescription | |
| configObjectThe configuration for the post-processor. | |
| config.clsArray.<string>The special tokens to add to the beginning of the input. | |
| config.sepArray.<string>The special tokens to add to the end of the input. | |
| * * * | |
| ### `bertProcessing.post_process(tokens, [tokens_pair])` ⇒ PostProcessedOutput | |
| Adds the special tokens to the beginning and end of the input. | |
| **Kind**: instance method of [BertProcessing](#module_tokenizers..BertProcessing) | |
| **Returns**: PostProcessedOutput - The post-processed tokens with the special tokens added to the beginning and end. | |
| ParamTypeDefaultDescription | |
| tokensArray.<string>The input tokens. | |
| [tokens_pair]Array.<string>An optional second set of input tokens. | |
| * * * | |
| ## tokenizers~TemplateProcessing ⇐ PostProcessor | |
| Post processor that replaces special tokens in a template with actual tokens. | |
| **Kind**: inner class of [tokenizers](#module_tokenizers) | |
| **Extends**: PostProcessor | |
| * [~TemplateProcessing](#module_tokenizers..TemplateProcessing) ⇐ PostProcessor | |
| * [`new TemplateProcessing(config)`](#new_module_tokenizers..TemplateProcessing_new) | |
| * [`.post_process(tokens, [tokens_pair])`](#module_tokenizers..TemplateProcessing+post_process) ⇒ PostProcessedOutput | |
| * * * | |
| ### `new TemplateProcessing(config)` | |
| Creates a new instance of `TemplateProcessing`. | |
| ParamTypeDescription | |
| configObjectThe configuration options for the post processor. | |
| config.singleArrayThe template for a single sequence of tokens. | |
| config.pairArrayThe template for a pair of sequences of tokens. | |
| * * * | |
| ### `templateProcessing.post_process(tokens, [tokens_pair])` ⇒ PostProcessedOutput | |
| Replaces special tokens in the template with actual tokens. | |
| **Kind**: instance method of [TemplateProcessing](#module_tokenizers..TemplateProcessing) | |
| **Returns**: PostProcessedOutput - An object containing the list of tokens with the special tokens replaced with actual tokens. | |
| ParamTypeDefaultDescription | |
| tokensArray.<string>The list of tokens for the first sequence. | |
| [tokens_pair]Array.<string>The list of tokens for the second sequence (optional). | |
| * * * | |
| ## tokenizers~ByteLevelPostProcessor ⇐ PostProcessor | |
| A PostProcessor that returns the given tokens as is. | |
| **Kind**: inner class of [tokenizers](#module_tokenizers) | |
| **Extends**: PostProcessor | |
| * * * | |
| ### `byteLevelPostProcessor.post_process(tokens, [tokens_pair])` ⇒ PostProcessedOutput | |
| Post process the given tokens. | |
| **Kind**: instance method of [ByteLevelPostProcessor](#module_tokenizers..ByteLevelPostProcessor) | |
| **Returns**: PostProcessedOutput - An object containing the post-processed tokens. | |
| ParamTypeDefaultDescription | |
| tokensArray.<string>The list of tokens for the first sequence. | |
| [tokens_pair]Array.<string>The list of tokens for the second sequence (optional). | |
| * * * | |
| ## tokenizers~PostProcessorSequence | |
| A post-processor that applies multiple post-processors in sequence. | |
| **Kind**: inner class of [tokenizers](#module_tokenizers) | |
| * [~PostProcessorSequence](#module_tokenizers..PostProcessorSequence) | |
| * [`new PostProcessorSequence(config)`](#new_module_tokenizers..PostProcessorSequence_new) | |
| * [`.post_process(tokens, [tokens_pair])`](#module_tokenizers..PostProcessorSequence+post_process) ⇒ PostProcessedOutput | |
| * * * | |
| ### `new PostProcessorSequence(config)` | |
| Creates a new instance of PostProcessorSequence. | |
| ParamTypeDescription | |
| configObjectThe configuration object. | |
| config.processorsArray.<Object>The list of post-processors to apply. | |
| * * * | |
| ### `postProcessorSequence.post_process(tokens, [tokens_pair])` ⇒ PostProcessedOutput | |
| Post process the given tokens. | |
| **Kind**: instance method of [PostProcessorSequence](#module_tokenizers..PostProcessorSequence) | |
| **Returns**: PostProcessedOutput - An object containing the post-processed tokens. | |
| ParamTypeDefaultDescription | |
| tokensArray.<string>The list of tokens for the first sequence. | |
| [tokens_pair]Array.<string>The list of tokens for the second sequence (optional). | |
| * * * | |
| ## tokenizers~Decoder ⇐ [Callable](#Callable) | |
| The base class for token decoders. | |
| **Kind**: inner class of [tokenizers](#module_tokenizers) | |
| **Extends**: [Callable](#Callable) | |
| * [~Decoder](#module_tokenizers..Decoder) ⇐ [Callable](#Callable) | |
| * [`new Decoder(config)`](#new_module_tokenizers..Decoder_new) | |
| * _instance_ | |
| * [`.added_tokens`](#module_tokenizers..Decoder+added_tokens) : Array.<AddedToken> | |
| * [`._call(tokens)`](#module_tokenizers..Decoder+_call) ⇒ string | |
| * [`.decode(tokens)`](#module_tokenizers..Decoder+decode) ⇒ string | |
| * [`.decode_chain(tokens)`](#module_tokenizers..Decoder+decode_chain) ⇒ Array.<string> | |
| * _static_ | |
| * [`.fromConfig(config)`](#module_tokenizers..Decoder.fromConfig) ⇒ Decoder | |
| * * * | |
| ### `new Decoder(config)` | |
| Creates an instance of `Decoder`. | |
| ParamTypeDescription | |
| configObjectThe configuration object. | |
| * * * | |
| ### `decoder.added_tokens` : Array.<AddedToken> | |
| **Kind**: instance property of [Decoder](#module_tokenizers..Decoder) | |
| * * * | |
| ### `decoder._call(tokens)` ⇒ string | |
| Calls the `decode` method. | |
| **Kind**: instance method of [Decoder](#module_tokenizers..Decoder) | |
| **Overrides**: [_call](#Callable+_call) | |
| **Returns**: string - The decoded string. | |
| ParamTypeDescription | |
| tokensArray.<string>The list of tokens. | |
| * * * | |
| ### `decoder.decode(tokens)` ⇒ string | |
| Decodes a list of tokens. | |
| **Kind**: instance method of [Decoder](#module_tokenizers..Decoder) | |
| **Returns**: string - The decoded string. | |
| ParamTypeDescription | |
| tokensArray.<string>The list of tokens. | |
| * * * | |
| ### `decoder.decode_chain(tokens)` ⇒ Array.<string> | |
| Apply the decoder to a list of tokens. | |
| **Kind**: instance method of [Decoder](#module_tokenizers..Decoder) | |
| **Returns**: Array.<string> - The decoded list of tokens. | |
| **Throws**: | |
| - Error If the `decode_chain` method is not implemented in the subclass. | |
| ParamTypeDescription | |
| tokensArray.<string>The list of tokens. | |
| * * * | |
| ### `Decoder.fromConfig(config)` ⇒ Decoder | |
| Creates a decoder instance based on the provided configuration. | |
| **Kind**: static method of [Decoder](#module_tokenizers..Decoder) | |
| **Returns**: Decoder - A decoder instance. | |
| **Throws**: | |
| - Error If an unknown decoder type is provided. | |
| ParamTypeDescription | |
| configObjectThe configuration object. | |
| * * * | |
| ## tokenizers~FuseDecoder | |
| Fuse simply fuses all tokens into one big string. | |
| It's usually the last decoding step anyway, but this decoder | |
| exists incase some decoders need to happen after that step | |
| **Kind**: inner class of [tokenizers](#module_tokenizers) | |
| * * * | |
| ### `fuseDecoder.decode_chain()` : * | |
| **Kind**: instance method of [FuseDecoder](#module_tokenizers..FuseDecoder) | |
| * * * | |
| ## tokenizers~WordPieceDecoder ⇐ Decoder | |
| A decoder that decodes a list of WordPiece tokens into a single string. | |
| **Kind**: inner class of [tokenizers](#module_tokenizers) | |
| **Extends**: Decoder | |
| * [~WordPieceDecoder](#module_tokenizers..WordPieceDecoder) ⇐ Decoder | |
| * [`new WordPieceDecoder(config)`](#new_module_tokenizers..WordPieceDecoder_new) | |
| * [`.decode_chain()`](#module_tokenizers..WordPieceDecoder+decode_chain) : * | |
| * * * | |
| ### `new WordPieceDecoder(config)` | |
| Creates a new instance of WordPieceDecoder. | |
| ParamTypeDescription | |
| configObjectThe configuration object. | |
| config.prefixstringThe prefix used for WordPiece encoding. | |
| config.cleanupbooleanWhether to cleanup the decoded string. | |
| * * * | |
| ### `wordPieceDecoder.decode_chain()` : * | |
| **Kind**: instance method of [WordPieceDecoder](#module_tokenizers..WordPieceDecoder) | |
| * * * | |
| ## tokenizers~ByteLevelDecoder ⇐ Decoder | |
| Byte-level decoder for tokenization output. Inherits from the `Decoder` class. | |
| **Kind**: inner class of [tokenizers](#module_tokenizers) | |
| **Extends**: Decoder | |
| * [~ByteLevelDecoder](#module_tokenizers..ByteLevelDecoder) ⇐ Decoder | |
| * [`new ByteLevelDecoder(config)`](#new_module_tokenizers..ByteLevelDecoder_new) | |
| * [`.convert_tokens_to_string(tokens)`](#module_tokenizers..ByteLevelDecoder+convert_tokens_to_string) ⇒ string | |
| * [`.decode_chain()`](#module_tokenizers..ByteLevelDecoder+decode_chain) : * | |
| * * * | |
| ### `new ByteLevelDecoder(config)` | |
| Create a `ByteLevelDecoder` object. | |
| ParamTypeDescription | |
| configObjectConfiguration object. | |
| * * * | |
| ### `byteLevelDecoder.convert_tokens_to_string(tokens)` ⇒ string | |
| Convert an array of tokens to string by decoding each byte. | |
| **Kind**: instance method of [ByteLevelDecoder](#module_tokenizers..ByteLevelDecoder) | |
| **Returns**: string - The decoded string. | |
| ParamTypeDescription | |
| tokensArray.<string>Array of tokens to be decoded. | |
| * * * | |
| ### `byteLevelDecoder.decode_chain()` : * | |
| **Kind**: instance method of [ByteLevelDecoder](#module_tokenizers..ByteLevelDecoder) | |
| * * * | |
| ## tokenizers~CTCDecoder | |
| The CTC (Connectionist Temporal Classification) decoder. | |
| See https://github.com/huggingface/tokenizers/blob/bb38f390a61883fc2f29d659af696f428d1cda6b/tokenizers/src/decoders/ctc.rs | |
| **Kind**: inner class of [tokenizers](#module_tokenizers) | |
| * [~CTCDecoder](#module_tokenizers..CTCDecoder) | |
| * [`.convert_tokens_to_string(tokens)`](#module_tokenizers..CTCDecoder+convert_tokens_to_string) ⇒ string | |
| * [`.decode_chain()`](#module_tokenizers..CTCDecoder+decode_chain) : * | |
| * * * | |
| ### `ctcDecoder.convert_tokens_to_string(tokens)` ⇒ string | |
| Converts a connectionist-temporal-classification (CTC) output tokens into a single string. | |
| **Kind**: instance method of [CTCDecoder](#module_tokenizers..CTCDecoder) | |
| **Returns**: string - The decoded string. | |
| ParamTypeDescription | |
| tokensArray.<string>Array of tokens to be decoded. | |
| * * * | |
| ### `ctcDecoder.decode_chain()` : * | |
| **Kind**: instance method of [CTCDecoder](#module_tokenizers..CTCDecoder) | |
| * * * | |
| ## tokenizers~DecoderSequence ⇐ Decoder | |
| Apply a sequence of decoders. | |
| **Kind**: inner class of [tokenizers](#module_tokenizers) | |
| **Extends**: Decoder | |
| * [~DecoderSequence](#module_tokenizers..DecoderSequence) ⇐ Decoder | |
| * [`new DecoderSequence(config)`](#new_module_tokenizers..DecoderSequence_new) | |
| * [`.decode_chain()`](#module_tokenizers..DecoderSequence+decode_chain) : * | |
| * * * | |
| ### `new DecoderSequence(config)` | |
| Creates a new instance of DecoderSequence. | |
| ParamTypeDescription | |
| configObjectThe configuration object. | |
| config.decodersArray.<Object>The list of decoders to apply. | |
| * * * | |
| ### `decoderSequence.decode_chain()` : * | |
| **Kind**: instance method of [DecoderSequence](#module_tokenizers..DecoderSequence) | |
| * * * | |
| ## tokenizers~MetaspacePreTokenizer ⇐ PreTokenizer | |
| This PreTokenizer replaces spaces with the given replacement character, adds a prefix space if requested, | |
| and returns a list of tokens. | |
| **Kind**: inner class of [tokenizers](#module_tokenizers) | |
| **Extends**: PreTokenizer | |
| * [~MetaspacePreTokenizer](#module_tokenizers..MetaspacePreTokenizer) ⇐ PreTokenizer | |
| * [`new MetaspacePreTokenizer(config)`](#new_module_tokenizers..MetaspacePreTokenizer_new) | |
| * [`.pre_tokenize_text(text, [options])`](#module_tokenizers..MetaspacePreTokenizer+pre_tokenize_text) ⇒ Array.<string> | |
| * * * | |
| ### `new MetaspacePreTokenizer(config)` | |
| ParamTypeDefaultDescription | |
| configObjectThe configuration object for the MetaspacePreTokenizer. | |
| config.replacementstringThe character to replace spaces with. | |
| [config.str_rep]string"config.replacement"An optional string representation of the replacement character. | |
| [config.prepend_scheme]'first' | 'never' | 'always''always'The metaspace prepending scheme. | |
| * * * | |
| ### `metaspacePreTokenizer.pre_tokenize_text(text, [options])` ⇒ Array.<string> | |
| This method takes a string, replaces spaces with the replacement character, | |
| adds a prefix space if requested, and returns a new list of tokens. | |
| **Kind**: instance method of [MetaspacePreTokenizer](#module_tokenizers..MetaspacePreTokenizer) | |
| **Returns**: Array.<string> - A new list of pre-tokenized tokens. | |
| ParamTypeDescription | |
| textstringThe text to pre-tokenize. | |
| [options]ObjectThe options for the pre-tokenization. | |
| [options.section_index]numberThe index of the section to pre-tokenize. | |
| * * * | |
| ## tokenizers~MetaspaceDecoder ⇐ Decoder | |
| MetaspaceDecoder class extends the Decoder class and decodes Metaspace tokenization. | |
| **Kind**: inner class of [tokenizers](#module_tokenizers) | |
| **Extends**: Decoder | |
| * [~MetaspaceDecoder](#module_tokenizers..MetaspaceDecoder) ⇐ Decoder | |
| * [`new MetaspaceDecoder(config)`](#new_module_tokenizers..MetaspaceDecoder_new) | |
| * [`.decode_chain()`](#module_tokenizers..MetaspaceDecoder+decode_chain) : * | |
| * * * | |
| ### `new MetaspaceDecoder(config)` | |
| Constructs a new MetaspaceDecoder object. | |
| ParamTypeDescription | |
| configObjectThe configuration object for the MetaspaceDecoder. | |
| config.replacementstringThe string to replace spaces with. | |
| * * * | |
| ### `metaspaceDecoder.decode_chain()` : * | |
| **Kind**: instance method of [MetaspaceDecoder](#module_tokenizers..MetaspaceDecoder) | |
| * * * | |
| ## tokenizers~Precompiled ⇐ Normalizer | |
| A normalizer that applies a precompiled charsmap. | |
| This is useful for applying complex normalizations in C++ and exposing them to JavaScript. | |
| **Kind**: inner class of [tokenizers](#module_tokenizers) | |
| **Extends**: Normalizer | |
| * [~Precompiled](#module_tokenizers..Precompiled) ⇐ Normalizer | |
| * [`new Precompiled(config)`](#new_module_tokenizers..Precompiled_new) | |
| * [`.normalize(text)`](#module_tokenizers..Precompiled+normalize) ⇒ string | |
| * * * | |
| ### `new Precompiled(config)` | |
| Create a new instance of Precompiled normalizer. | |
| ParamTypeDescription | |
| configObjectThe configuration object for the Precompiled normalizer. | |
| config.precompiled_charsmapObjectThe precompiled charsmap object. | |
| * * * | |
| ### `precompiled.normalize(text)` ⇒ string | |
| Normalizes the given text by applying the precompiled charsmap. | |
| **Kind**: instance method of [Precompiled](#module_tokenizers..Precompiled) | |
| **Returns**: string - The normalized text. | |
| ParamTypeDescription | |
| textstringThe text to normalize. | |
| * * * | |
| ## tokenizers~PreTokenizerSequence ⇐ PreTokenizer | |
| A pre-tokenizer that applies a sequence of pre-tokenizers to the input text. | |
| **Kind**: inner class of [tokenizers](#module_tokenizers) | |
| **Extends**: PreTokenizer | |
| * [~PreTokenizerSequence](#module_tokenizers..PreTokenizerSequence) ⇐ PreTokenizer | |
| * [`new PreTokenizerSequence(config)`](#new_module_tokenizers..PreTokenizerSequence_new) | |
| * [`.pre_tokenize_text(text, [options])`](#module_tokenizers..PreTokenizerSequence+pre_tokenize_text) ⇒ Array.<string> | |
| * * * | |
| ### `new PreTokenizerSequence(config)` | |
| Creates an instance of PreTokenizerSequence. | |
| ParamTypeDescription | |
| configObjectThe configuration object for the pre-tokenizer sequence. | |
| config.pretokenizersArray.<Object>An array of pre-tokenizer configurations. | |
| * * * | |
| ### `preTokenizerSequence.pre_tokenize_text(text, [options])` ⇒ Array.<string> | |
| Applies each pre-tokenizer in the sequence to the input text in turn. | |
| **Kind**: instance method of [PreTokenizerSequence](#module_tokenizers..PreTokenizerSequence) | |
| **Returns**: Array.<string> - The pre-tokenized text. | |
| ParamTypeDescription | |
| textstringThe text to pre-tokenize. | |
| [options]ObjectAdditional options for the pre-tokenization logic. | |
| * * * | |
| ## tokenizers~WhitespacePreTokenizer | |
| Splits on word boundaries (using the following regular expression: `\w+|[^\w\s]+`). | |
| **Kind**: inner class of [tokenizers](#module_tokenizers) | |
| * [~WhitespacePreTokenizer](#module_tokenizers..WhitespacePreTokenizer) | |
| * [`new WhitespacePreTokenizer(config)`](#new_module_tokenizers..WhitespacePreTokenizer_new) | |
| * [`.pre_tokenize_text(text, [options])`](#module_tokenizers..WhitespacePreTokenizer+pre_tokenize_text) ⇒ Array.<string> | |
| * * * | |
| ### `new WhitespacePreTokenizer(config)` | |
| Creates an instance of WhitespacePreTokenizer. | |
| ParamTypeDescription | |
| configObjectThe configuration object for the pre-tokenizer. | |
| * * * | |
| ### `whitespacePreTokenizer.pre_tokenize_text(text, [options])` ⇒ Array.<string> | |
| Pre-tokenizes the input text by splitting it on word boundaries. | |
| **Kind**: instance method of [WhitespacePreTokenizer](#module_tokenizers..WhitespacePreTokenizer) | |
| **Returns**: Array.<string> - An array of tokens produced by splitting the input text on whitespace. | |
| ParamTypeDescription | |
| textstringThe text to be pre-tokenized. | |
| [options]ObjectAdditional options for the pre-tokenization logic. | |
| * * * | |
| ## tokenizers~WhitespaceSplit ⇐ PreTokenizer | |
| Splits a string of text by whitespace characters into individual tokens. | |
| **Kind**: inner class of [tokenizers](#module_tokenizers) | |
| **Extends**: PreTokenizer | |
| * [~WhitespaceSplit](#module_tokenizers..WhitespaceSplit) ⇐ PreTokenizer | |
| * [`new WhitespaceSplit(config)`](#new_module_tokenizers..WhitespaceSplit_new) | |
| * [`.pre_tokenize_text(text, [options])`](#module_tokenizers..WhitespaceSplit+pre_tokenize_text) ⇒ Array.<string> | |
| * * * | |
| ### `new WhitespaceSplit(config)` | |
| Creates an instance of WhitespaceSplit. | |
| ParamTypeDescription | |
| configObjectThe configuration object for the pre-tokenizer. | |
| * * * | |
| ### `whitespaceSplit.pre_tokenize_text(text, [options])` ⇒ Array.<string> | |
| Pre-tokenizes the input text by splitting it on whitespace characters. | |
| **Kind**: instance method of [WhitespaceSplit](#module_tokenizers..WhitespaceSplit) | |
| **Returns**: Array.<string> - An array of tokens produced by splitting the input text on whitespace. | |
| ParamTypeDescription | |
| textstringThe text to be pre-tokenized. | |
| [options]ObjectAdditional options for the pre-tokenization logic. | |
| * * * | |
| ## tokenizers~ReplacePreTokenizer | |
| **Kind**: inner class of [tokenizers](#module_tokenizers) | |
| * [~ReplacePreTokenizer](#module_tokenizers..ReplacePreTokenizer) | |
| * [`new ReplacePreTokenizer(config)`](#new_module_tokenizers..ReplacePreTokenizer_new) | |
| * [`.pre_tokenize_text(text, [options])`](#module_tokenizers..ReplacePreTokenizer+pre_tokenize_text) ⇒ Array.<string> | |
| * * * | |
| ### `new ReplacePreTokenizer(config)` | |
| ParamTypeDescription | |
| configObjectThe configuration options for the pre-tokenizer. | |
| config.patternObjectThe pattern used to split the text. Can be a string or a regex object. | |
| config.contentstringWhat to replace the pattern with. | |
| * * * | |
| ### `replacePreTokenizer.pre_tokenize_text(text, [options])` ⇒ Array.<string> | |
| Pre-tokenizes the input text by replacing certain characters. | |
| **Kind**: instance method of [ReplacePreTokenizer](#module_tokenizers..ReplacePreTokenizer) | |
| **Returns**: Array.<string> - An array of tokens produced by replacing certain characters. | |
| ParamTypeDescription | |
| textstringThe text to be pre-tokenized. | |
| [options]ObjectAdditional options for the pre-tokenization logic. | |
| * * * | |
| ## tokenizers~FixedLengthPreTokenizer | |
| **Kind**: inner class of [tokenizers](#module_tokenizers) | |
| * [~FixedLengthPreTokenizer](#module_tokenizers..FixedLengthPreTokenizer) | |
| * [`new FixedLengthPreTokenizer(config)`](#new_module_tokenizers..FixedLengthPreTokenizer_new) | |
| * [`.pre_tokenize_text(text, [options])`](#module_tokenizers..FixedLengthPreTokenizer+pre_tokenize_text) ⇒ Array.<string> | |
| * * * | |
| ### `new FixedLengthPreTokenizer(config)` | |
| ParamTypeDescription | |
| configObjectThe configuration options for the pre-tokenizer. | |
| config.lengthnumberThe fixed length to split the text into. | |
| * * * | |
| ### `fixedLengthPreTokenizer.pre_tokenize_text(text, [options])` ⇒ Array.<string> | |
| Pre-tokenizes the input text by splitting it into fixed-length tokens. | |
| **Kind**: instance method of [FixedLengthPreTokenizer](#module_tokenizers..FixedLengthPreTokenizer) | |
| **Returns**: Array.<string> - An array of tokens produced by splitting the input text into fixed-length tokens. | |
| ParamTypeDescription | |
| textstringThe text to be pre-tokenized. | |
| [options]ObjectAdditional options for the pre-tokenization logic. | |
| * * * | |
| ## `tokenizers~BYTES_TO_UNICODE` ⇒ Object | |
| Returns list of utf-8 byte and a mapping to unicode strings. | |
| Specifically avoids mapping to whitespace/control characters the BPE code barfs on. | |
| **Kind**: inner constant of [tokenizers](#module_tokenizers) | |
| **Returns**: Object - Object with utf-8 byte keys and unicode string values. | |
| * * * | |
| ## `tokenizers~loadTokenizer(pretrained_model_name_or_path, options)` ⇒ Promise.<Array<any>> | |
| Loads a tokenizer from the specified path. | |
| **Kind**: inner method of [tokenizers](#module_tokenizers) | |
| **Returns**: Promise.<Array<any>> - A promise that resolves with information about the loaded tokenizer. | |
| ParamTypeDescription | |
| pretrained_model_name_or_pathstringThe path to the tokenizer directory. | |
| optionsPretrainedTokenizerOptionsAdditional options for loading the tokenizer. | |
| * * * | |
| ## `tokenizers~regexSplit(text, regex)` ⇒ Array.<string> | |
| Helper function to split a string on a regex, but keep the delimiters. | |
| This is required, because the JavaScript `.split()` method does not keep the delimiters, | |
| and wrapping in a capturing group causes issues with existing capturing groups (due to nesting). | |
| **Kind**: inner method of [tokenizers](#module_tokenizers) | |
| **Returns**: Array.<string> - The split string. | |
| ParamTypeDescription | |
| textstringThe text to split. | |
| regexRegExpThe regex to split on. | |
| * * * | |
| ## `tokenizers~createPattern(pattern, invert)` ⇒ RegExp | null | |
| Helper method to construct a pattern from a config object. | |
| **Kind**: inner method of [tokenizers](#module_tokenizers) | |
| **Returns**: RegExp | null - The compiled pattern. | |
| ParamTypeDefaultDescription | |
| patternObjectThe pattern object. | |
| invertbooleantrueWhether to invert the pattern. | |
| * * * | |
| ## `tokenizers~objectToMap(obj)` ⇒ Map.<string, any> | |
| Helper function to convert an Object to a Map | |
| **Kind**: inner method of [tokenizers](#module_tokenizers) | |
| **Returns**: Map.<string, any> - The map. | |
| ParamTypeDescription | |
| objObjectThe object to convert. | |
| * * * | |
| ## `tokenizers~prepareTensorForDecode(tensor)` ⇒ Array.<number> | |
| Helper function to convert a tensor to a list before decoding. | |
| **Kind**: inner method of [tokenizers](#module_tokenizers) | |
| **Returns**: Array.<number> - The tensor as a list. | |
| ParamTypeDescription | |
| tensorTensorThe tensor to convert. | |
| * * * | |
| ## `tokenizers~clean_up_tokenization(text)` ⇒ string | |
| Clean up a list of simple English tokenization artifacts like spaces before punctuations and abbreviated forms | |
| **Kind**: inner method of [tokenizers](#module_tokenizers) | |
| **Returns**: string - The cleaned up text. | |
| ParamTypeDescription | |
| textstringThe text to clean up. | |
| * * * | |
| ## `tokenizers~remove_accents(text)` ⇒ string | |
| Helper function to remove accents from a string. | |
| **Kind**: inner method of [tokenizers](#module_tokenizers) | |
| **Returns**: string - The text with accents removed. | |
| ParamTypeDescription | |
| textstringThe text to remove accents from. | |
| * * * | |
| ## `tokenizers~lowercase_and_remove_accent(text)` ⇒ string | |
| Helper function to lowercase a string and remove accents. | |
| **Kind**: inner method of [tokenizers](#module_tokenizers) | |
| **Returns**: string - The lowercased text with accents removed. | |
| ParamTypeDescription | |
| textstringThe text to lowercase and remove accents from. | |
| * * * | |
| ## `tokenizers~whitespace_split(text)` ⇒ Array.<string> | |
| Split a string on whitespace. | |
| **Kind**: inner method of [tokenizers](#module_tokenizers) | |
| **Returns**: Array.<string> - The split string. | |
| ParamTypeDescription | |
| textstringThe text to split. | |
| * * * | |
| ## `tokenizers~PretrainedTokenizerOptions` : Object | |
| Additional tokenizer-specific properties. | |
| **Kind**: inner typedef of [tokenizers](#module_tokenizers) | |
| **Properties** | |
| NameTypeDefaultDescription | |
| [legacy]booleanfalseWhether or not the legacy behavior of the tokenizer should be used. | |
| * * * | |
| ## `tokenizers~BPENode` : Object | |
| **Kind**: inner typedef of [tokenizers](#module_tokenizers) | |
| **Properties** | |
| NameTypeDescription | |
| tokenstringThe token associated with the node | |
| biasnumberA positional bias for the node. | |
| [score]numberThe score of the node. | |
| [prev]BPENodeThe previous node in the linked list. | |
| [next]BPENodeThe next node in the linked list. | |
| * * * | |
| ## `tokenizers~SplitDelimiterBehavior` : 'removed' | 'isolated' | 'mergedWithPrevious' | 'mergedWithNext' | 'contiguous' | |
| **Kind**: inner typedef of [tokenizers](#module_tokenizers) | |
| * * * | |
| ## `tokenizers~PostProcessedOutput` : Object | |
| **Kind**: inner typedef of [tokenizers](#module_tokenizers) | |
| **Properties** | |
| NameTypeDescription | |
| tokensArray.<string>List of token produced by the post-processor. | |
| [token_type_ids]Array.<number>List of token type ids produced by the post-processor. | |
| * * * | |
| ## `tokenizers~EncodingSingle` : Object | |
| **Kind**: inner typedef of [tokenizers](#module_tokenizers) | |
| **Properties** | |
| NameTypeDescription | |
| input_idsArray.<number>List of token ids to be fed to a model. | |
| attention_maskArray.<number>List of token type ids to be fed to a model | |
| [token_type_ids]Array.<number>List of indices specifying which tokens should be attended to by the model | |
| * * * | |
| ## `tokenizers~Message` : Object | |
| **Kind**: inner typedef of [tokenizers](#module_tokenizers) | |
| **Properties** | |
| NameTypeDescription | |
| rolestringThe role of the message (e.g., "user" or "assistant" or "system"). | |
| contentstringThe content of the message. | |
| * * * | |
| ## `tokenizers~BatchEncoding` : Array<number> | Array<Array<number>> | [Tensor](#Tensor) | |
| Holds the output of the tokenizer's call function. | |
| **Kind**: inner typedef of [tokenizers](#module_tokenizers) | |
| **Properties** | |
| NameTypeDescription | |
| input_idsBatchEncodingItemList of token ids to be fed to a model. | |
| attention_maskBatchEncodingItemList of indices specifying which tokens should be attended to by the model. | |
| [token_type_ids]BatchEncodingItemList of token type ids to be fed to a model. | |
| * * * | |
Xet Storage Details
- Size:
- 110 kB
- Xet hash:
- 515f52a15b3b748b8168059da11f114dffaaa259ab973692059f209cabea8b44
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.