Buckets:

rtrm's picture
|
download
raw
4.78 kB
# Decoders
## DecodeStream[[tokenizers.decoders.DecodeStream]]
#### tokenizers.decoders.DecodeStream[[tokenizers.decoders.DecodeStream]]
Class needed for streaming decode
steptokenizers.decoders.DecodeStream.step[{"name": "tokenizer", "val": ""}, {"name": "id", "val": ""}]- **tokenizer** ([Tokenizer](/docs/tokenizers/pr_1968/en/api/tokenizer#tokenizers.Tokenizer)) --
The tokenizer to use for decoding0`Optional[str]`The next decoded string chunk, or None if not enough
tokens have been provided yet.
Streaming decode step
id (`int` or *List[int]*):
The next token id or list of token ids to add to the stream
**Parameters:**
tokenizer ([Tokenizer](/docs/tokenizers/pr_1968/en/api/tokenizer#tokenizers.Tokenizer)) : The tokenizer to use for decoding
**Returns:**
``Optional[str]``
The next decoded string chunk, or None if not enough
tokens have been provided yet.
## BPEDecoder[[tokenizers.decoders.BPEDecoder]]
#### tokenizers.decoders.BPEDecoder[[tokenizers.decoders.BPEDecoder]]
BPEDecoder Decoder
**Parameters:**
suffix (`str`, *optional*, defaults to `</w>`) : The suffix that was used to characterize an end-of-word. This suffix will be replaced by whitespaces during the decoding
## ByteFallback[[tokenizers.decoders.ByteFallback]]
#### tokenizers.decoders.ByteFallback[[tokenizers.decoders.ByteFallback]]
ByteFallback Decoder
ByteFallback is a simple trick which converts tokens looking like ``
to pure bytes, and attempts to make them into a string. If the tokens
cannot be decoded you will get � instead for each inconvertible byte token
## ByteLevel[[tokenizers.decoders.ByteLevel]]
#### tokenizers.decoders.ByteLevel[[tokenizers.decoders.ByteLevel]]
ByteLevel Decoder
This decoder is to be used in tandem with the [ByteLevel](/docs/tokenizers/pr_1968/en/api/pre-tokenizers#tokenizers.pre_tokenizers.ByteLevel)
[PreTokenizer](/docs/tokenizers/pr_1968/en/api/pre-tokenizers#tokenizers.pre_tokenizers.PreTokenizer).
## CTC[[tokenizers.decoders.CTC]]
#### tokenizers.decoders.CTC[[tokenizers.decoders.CTC]]
CTC Decoder
**Parameters:**
pad_token (`str`, *optional*, defaults to `<pad>`) : The pad token used by CTC to delimit a new token.
word_delimiter_token (`str`, *optional*, defaults to `|`) : The word delimiter token. It will be replaced by a <space>
cleanup (`bool`, *optional*, defaults to `True`) : Whether to cleanup some tokenization artifacts. Mainly spaces before punctuation, and some abbreviated english forms.
## Fuse[[tokenizers.decoders.Fuse]]
#### tokenizers.decoders.Fuse[[tokenizers.decoders.Fuse]]
Fuse Decoder
Fuse simply fuses every token into a single string.
This is the last step of decoding, this decoder exists only if
there is need to add other decoders *after* the fusion
## Metaspace[[tokenizers.decoders.Metaspace]]
#### tokenizers.decoders.Metaspace[[tokenizers.decoders.Metaspace]]
Metaspace Decoder
**Parameters:**
replacement (`str`, *optional*, defaults to `▁`) : The replacement character. Must be exactly one character. By default we use the *▁* (U+2581) meta symbol (Same as in SentencePiece).
prepend_scheme (`str`, *optional*, defaults to `"always"`) : Whether to add a space to the first word if there isn't already one. This lets us treat *hello* exactly like *say hello*. Choices: "always", "never", "first". First means the space is only added on the first token (relevant when special tokens are used or other pre_tokenizer are used).
## Replace[[tokenizers.decoders.Replace]]
#### tokenizers.decoders.Replace[[tokenizers.decoders.Replace]]
Replace Decoder
This decoder is to be used in tandem with the `~tokenizers.pre_tokenizers.Replace`
[PreTokenizer](/docs/tokenizers/pr_1968/en/api/pre-tokenizers#tokenizers.pre_tokenizers.PreTokenizer).
## Sequence[[tokenizers.decoders.Sequence]]
#### tokenizers.decoders.Sequence[[tokenizers.decoders.Sequence]]
Sequence Decoder
**Parameters:**
decoders (`List[Decoder]`) : The decoders that need to be chained
## Strip[[tokenizers.decoders.Strip]]
#### tokenizers.decoders.Strip[[tokenizers.decoders.Strip]]
Strip normalizer
Strips n left characters of each token, or n right characters of each token
## WordPiece[[tokenizers.decoders.WordPiece]]
#### tokenizers.decoders.WordPiece[[tokenizers.decoders.WordPiece]]
WordPiece Decoder
**Parameters:**
prefix (`str`, *optional*, defaults to `##`) : The prefix to use for subwords that are not a beginning-of-word
cleanup (`bool`, *optional*, defaults to `True`) : Whether to cleanup some tokenization artifacts. Mainly spaces before punctuation, and some abbreviated english forms.
The Rust API Reference is available directly on the [Docs.rs](https://docs.rs/tokenizers/latest/tokenizers/) website.
The node API has not been documented yet.

Xet Storage Details

Size:
4.78 kB
·
Xet hash:
6084a1cd6ba3ac52104d1e76ad64034c29b431ac961bcf234daf1d8bd270f75b

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.