Buckets:
| # Decoders | |
| ## DecodeStream[[tokenizers.decoders.DecodeStream]] | |
| #### tokenizers.decoders.DecodeStream[[tokenizers.decoders.DecodeStream]] | |
| Class needed for streaming decode | |
| steptokenizers.decoders.DecodeStream.step[{"name": "tokenizer", "val": ""}, {"name": "id", "val": ""}]- **tokenizer** ([Tokenizer](/docs/tokenizers/pr_1968/en/api/tokenizer#tokenizers.Tokenizer)) -- | |
| The tokenizer to use for decoding0`Optional[str]`The next decoded string chunk, or None if not enough | |
| tokens have been provided yet. | |
| Streaming decode step | |
| id (`int` or *List[int]*): | |
| The next token id or list of token ids to add to the stream | |
| **Parameters:** | |
| tokenizer ([Tokenizer](/docs/tokenizers/pr_1968/en/api/tokenizer#tokenizers.Tokenizer)) : The tokenizer to use for decoding | |
| **Returns:** | |
| ``Optional[str]`` | |
| The next decoded string chunk, or None if not enough | |
| tokens have been provided yet. | |
| ## BPEDecoder[[tokenizers.decoders.BPEDecoder]] | |
| #### tokenizers.decoders.BPEDecoder[[tokenizers.decoders.BPEDecoder]] | |
| BPEDecoder Decoder | |
| **Parameters:** | |
| suffix (`str`, *optional*, defaults to `</w>`) : The suffix that was used to characterize an end-of-word. This suffix will be replaced by whitespaces during the decoding | |
| ## ByteFallback[[tokenizers.decoders.ByteFallback]] | |
| #### tokenizers.decoders.ByteFallback[[tokenizers.decoders.ByteFallback]] | |
| ByteFallback Decoder | |
| ByteFallback is a simple trick which converts tokens looking like `` | |
| to pure bytes, and attempts to make them into a string. If the tokens | |
| cannot be decoded you will get � instead for each inconvertible byte token | |
| ## ByteLevel[[tokenizers.decoders.ByteLevel]] | |
| #### tokenizers.decoders.ByteLevel[[tokenizers.decoders.ByteLevel]] | |
| ByteLevel Decoder | |
| This decoder is to be used in tandem with the [ByteLevel](/docs/tokenizers/pr_1968/en/api/pre-tokenizers#tokenizers.pre_tokenizers.ByteLevel) | |
| [PreTokenizer](/docs/tokenizers/pr_1968/en/api/pre-tokenizers#tokenizers.pre_tokenizers.PreTokenizer). | |
| ## CTC[[tokenizers.decoders.CTC]] | |
| #### tokenizers.decoders.CTC[[tokenizers.decoders.CTC]] | |
| CTC Decoder | |
| **Parameters:** | |
| pad_token (`str`, *optional*, defaults to `<pad>`) : The pad token used by CTC to delimit a new token. | |
| word_delimiter_token (`str`, *optional*, defaults to `|`) : The word delimiter token. It will be replaced by a <space> | |
| cleanup (`bool`, *optional*, defaults to `True`) : Whether to cleanup some tokenization artifacts. Mainly spaces before punctuation, and some abbreviated english forms. | |
| ## Fuse[[tokenizers.decoders.Fuse]] | |
| #### tokenizers.decoders.Fuse[[tokenizers.decoders.Fuse]] | |
| Fuse Decoder | |
| Fuse simply fuses every token into a single string. | |
| This is the last step of decoding, this decoder exists only if | |
| there is need to add other decoders *after* the fusion | |
| ## Metaspace[[tokenizers.decoders.Metaspace]] | |
| #### tokenizers.decoders.Metaspace[[tokenizers.decoders.Metaspace]] | |
| Metaspace Decoder | |
| **Parameters:** | |
| replacement (`str`, *optional*, defaults to `▁`) : The replacement character. Must be exactly one character. By default we use the *▁* (U+2581) meta symbol (Same as in SentencePiece). | |
| prepend_scheme (`str`, *optional*, defaults to `"always"`) : Whether to add a space to the first word if there isn't already one. This lets us treat *hello* exactly like *say hello*. Choices: "always", "never", "first". First means the space is only added on the first token (relevant when special tokens are used or other pre_tokenizer are used). | |
| ## Replace[[tokenizers.decoders.Replace]] | |
| #### tokenizers.decoders.Replace[[tokenizers.decoders.Replace]] | |
| Replace Decoder | |
| This decoder is to be used in tandem with the `~tokenizers.pre_tokenizers.Replace` | |
| [PreTokenizer](/docs/tokenizers/pr_1968/en/api/pre-tokenizers#tokenizers.pre_tokenizers.PreTokenizer). | |
| ## Sequence[[tokenizers.decoders.Sequence]] | |
| #### tokenizers.decoders.Sequence[[tokenizers.decoders.Sequence]] | |
| Sequence Decoder | |
| **Parameters:** | |
| decoders (`List[Decoder]`) : The decoders that need to be chained | |
| ## Strip[[tokenizers.decoders.Strip]] | |
| #### tokenizers.decoders.Strip[[tokenizers.decoders.Strip]] | |
| Strip normalizer | |
| Strips n left characters of each token, or n right characters of each token | |
| ## WordPiece[[tokenizers.decoders.WordPiece]] | |
| #### tokenizers.decoders.WordPiece[[tokenizers.decoders.WordPiece]] | |
| WordPiece Decoder | |
| **Parameters:** | |
| prefix (`str`, *optional*, defaults to `##`) : The prefix to use for subwords that are not a beginning-of-word | |
| cleanup (`bool`, *optional*, defaults to `True`) : Whether to cleanup some tokenization artifacts. Mainly spaces before punctuation, and some abbreviated english forms. | |
| The Rust API Reference is available directly on the [Docs.rs](https://docs.rs/tokenizers/latest/tokenizers/) website. | |
| The node API has not been documented yet. | |
Xet Storage Details
- Size:
- 4.78 kB
- Xet hash:
- 6084a1cd6ba3ac52104d1e76ad64034c29b431ac961bcf234daf1d8bd270f75b
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.