Buckets:
Decoders
DecodeStream[[tokenizers.decoders.DecodeStream]]
tokenizers.decoders.DecodeStream[[tokenizers.decoders.DecodeStream]]
Class needed for streaming decode
steptokenizers.decoders.DecodeStream.step[{"name": "tokenizer", "val": ""}, {"name": "id", "val": ""}]- tokenizer (Tokenizer) --
The tokenizer to use for decoding0Optional[str]The next decoded string chunk, or None if not enough
tokens have been provided yet.
Streaming decode step
id (int or List[int]):
The next token id or list of token ids to add to the stream
Parameters:
tokenizer (Tokenizer) : The tokenizer to use for decoding
Returns:
Optional[str]
The next decoded string chunk, or None if not enough tokens have been provided yet.
BPEDecoder[[tokenizers.decoders.BPEDecoder]]
tokenizers.decoders.BPEDecoder[[tokenizers.decoders.BPEDecoder]]
BPEDecoder Decoder
Parameters:
suffix (str, optional, defaults to </w>) : The suffix that was used to characterize an end-of-word. This suffix will be replaced by whitespaces during the decoding
ByteFallback[[tokenizers.decoders.ByteFallback]]
tokenizers.decoders.ByteFallback[[tokenizers.decoders.ByteFallback]]
ByteFallback Decoder ByteFallback is a simple trick which converts tokens looking like `` to pure bytes, and attempts to make them into a string. If the tokens cannot be decoded you will get � instead for each inconvertible byte token
ByteLevel[[tokenizers.decoders.ByteLevel]]
tokenizers.decoders.ByteLevel[[tokenizers.decoders.ByteLevel]]
ByteLevel Decoder
This decoder is to be used in tandem with the ByteLevel PreTokenizer.
CTC[[tokenizers.decoders.CTC]]
tokenizers.decoders.CTC[[tokenizers.decoders.CTC]]
CTC Decoder
Parameters:
pad_token (str, optional, defaults to <pad>) : The pad token used by CTC to delimit a new token.
word_delimiter_token (str, optional, defaults to |) : The word delimiter token. It will be replaced by a <space>
cleanup (bool, optional, defaults to True) : Whether to cleanup some tokenization artifacts. Mainly spaces before punctuation, and some abbreviated english forms.
Fuse[[tokenizers.decoders.Fuse]]
tokenizers.decoders.Fuse[[tokenizers.decoders.Fuse]]
Fuse Decoder Fuse simply fuses every token into a single string. This is the last step of decoding, this decoder exists only if there is need to add other decoders after the fusion
Metaspace[[tokenizers.decoders.Metaspace]]
tokenizers.decoders.Metaspace[[tokenizers.decoders.Metaspace]]
Metaspace Decoder
Parameters:
replacement (str, optional, defaults to ▁) : The replacement character. Must be exactly one character. By default we use the ▁ (U+2581) meta symbol (Same as in SentencePiece).
prepend_scheme (str, optional, defaults to "always") : Whether to add a space to the first word if there isn't already one. This lets us treat hello exactly like say hello. Choices: "always", "never", "first". First means the space is only added on the first token (relevant when special tokens are used or other pre_tokenizer are used).
Replace[[tokenizers.decoders.Replace]]
tokenizers.decoders.Replace[[tokenizers.decoders.Replace]]
Replace Decoder
This decoder is to be used in tandem with the ~tokenizers.pre_tokenizers.Replace
PreTokenizer.
Sequence[[tokenizers.decoders.Sequence]]
tokenizers.decoders.Sequence[[tokenizers.decoders.Sequence]]
Sequence Decoder
Parameters:
decoders (List[Decoder]) : The decoders that need to be chained
Strip[[tokenizers.decoders.Strip]]
tokenizers.decoders.Strip[[tokenizers.decoders.Strip]]
Strip normalizer Strips n left characters of each token, or n right characters of each token
WordPiece[[tokenizers.decoders.WordPiece]]
tokenizers.decoders.WordPiece[[tokenizers.decoders.WordPiece]]
WordPiece Decoder
Parameters:
prefix (str, optional, defaults to ##) : The prefix to use for subwords that are not a beginning-of-word
cleanup (bool, optional, defaults to True) : Whether to cleanup some tokenization artifacts. Mainly spaces before punctuation, and some abbreviated english forms.
The Rust API Reference is available directly on the Docs.rs website.
The node API has not been documented yet.
Xet Storage Details
- Size:
- 4.78 kB
- Xet hash:
- 6084a1cd6ba3ac52104d1e76ad64034c29b431ac961bcf234daf1d8bd270f75b
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.