Buckets:
| # Normalizers | |
| ## ByteLevel[[tokenizers.normalizers.ByteLevel]] | |
| #### tokenizers.normalizers.ByteLevel[[tokenizers.normalizers.ByteLevel]] | |
| Bytelevel Normalizer | |
| ## Lowercase[[tokenizers.normalizers.Lowercase]] | |
| #### tokenizers.normalizers.Lowercase[[tokenizers.normalizers.Lowercase]] | |
| Lowercase Normalizer | |
| ## NFC[[tokenizers.normalizers.NFC]] | |
| #### tokenizers.normalizers.NFC[[tokenizers.normalizers.NFC]] | |
| NFC Unicode Normalizer | |
| ## NFD[[tokenizers.normalizers.NFD]] | |
| #### tokenizers.normalizers.NFD[[tokenizers.normalizers.NFD]] | |
| NFD Unicode Normalizer | |
| ## NFKC[[tokenizers.normalizers.NFKC]] | |
| #### tokenizers.normalizers.NFKC[[tokenizers.normalizers.NFKC]] | |
| NFKC Unicode Normalizer | |
| ## NFKD[[tokenizers.normalizers.NFKD]] | |
| #### tokenizers.normalizers.NFKD[[tokenizers.normalizers.NFKD]] | |
| NFKD Unicode Normalizer | |
| ## Nmt[[tokenizers.normalizers.Nmt]] | |
| #### tokenizers.normalizers.Nmt[[tokenizers.normalizers.Nmt]] | |
| Nmt normalizer | |
| ## Normalizer[[tokenizers.normalizers.Normalizer]] | |
| #### tokenizers.normalizers.Normalizer[[tokenizers.normalizers.Normalizer]] | |
| Base class for all normalizers | |
| This class is not supposed to be instantiated directly. Instead, any implementation of a | |
| Normalizer will return an instance of this class when instantiated. | |
| normalizetokenizers.normalizers.Normalizer.normalize[{"name": "normalized", "val": ""}]- **normalized** (`NormalizedString`) -- | |
| The normalized string on which to apply this | |
| [Normalizer](/docs/tokenizers/pr_1968/en/api/normalizers#tokenizers.normalizers.Normalizer)0 | |
| Normalize a `NormalizedString` in-place | |
| This method allows to modify a `NormalizedString` to | |
| keep track of the alignment information. If you just want to see the result | |
| of the normalization on a raw string, you can use | |
| `normalize_str()` | |
| **Parameters:** | |
| normalized (`NormalizedString`) : The normalized string on which to apply this [Normalizer](/docs/tokenizers/pr_1968/en/api/normalizers#tokenizers.normalizers.Normalizer) | |
| #### normalize_str[[tokenizers.normalizers.Normalizer.normalize_str]] | |
| Normalize the given string | |
| This method provides a way to visualize the effect of a | |
| [Normalizer](/docs/tokenizers/pr_1968/en/api/normalizers#tokenizers.normalizers.Normalizer) but it does not keep track of the alignment | |
| information. If you need to get/convert offsets, you can use | |
| `normalize()` | |
| **Parameters:** | |
| sequence (`str`) : A string to normalize | |
| **Returns:** | |
| ``str`` | |
| A string after normalization | |
| ## Precompiled[[tokenizers.normalizers.Precompiled]] | |
| #### tokenizers.normalizers.Precompiled[[tokenizers.normalizers.Precompiled]] | |
| Precompiled normalizer | |
| Don't use manually it is used for compatibility for SentencePiece. | |
| ## Replace[[tokenizers.normalizers.Replace]] | |
| #### tokenizers.normalizers.Replace[[tokenizers.normalizers.Replace]] | |
| Replace normalizer | |
| ## Sequence[[tokenizers.normalizers.Sequence]] | |
| #### tokenizers.normalizers.Sequence[[tokenizers.normalizers.Sequence]] | |
| Allows concatenating multiple other Normalizer as a Sequence. | |
| All the normalizers run in sequence in the given order | |
| **Parameters:** | |
| normalizers (`List[Normalizer]`) : A list of Normalizer to be run as a sequence | |
| ## Strip[[tokenizers.normalizers.Strip]] | |
| #### tokenizers.normalizers.Strip[[tokenizers.normalizers.Strip]] | |
| Strip normalizer | |
| ## StripAccents[[tokenizers.normalizers.StripAccents]] | |
| #### tokenizers.normalizers.StripAccents[[tokenizers.normalizers.StripAccents]] | |
| StripAccents normalizer | |
| ## BertNormalizer[[tokenizers.normalizers.BertNormalizer]] | |
| #### tokenizers.normalizers.BertNormalizer[[tokenizers.normalizers.BertNormalizer]] | |
| BertNormalizer | |
| Takes care of normalizing raw text before giving it to a Bert model. | |
| This includes cleaning the text, handling accents, chinese chars and lowercasing | |
| **Parameters:** | |
| clean_text (`bool`, *optional*, defaults to `True`) : Whether to clean the text, by removing any control characters and replacing all whitespaces by the classic one. | |
| handle_chinese_chars (`bool`, *optional*, defaults to `True`) : Whether to handle chinese chars by putting spaces around them. | |
| strip_accents (`bool`, *optional*) : Whether to strip all accents. If this option is not specified (ie == None), then it will be determined by the value for *lowercase* (as in the original Bert). | |
| lowercase (`bool`, *optional*, defaults to `True`) : Whether to lowercase. | |
| The Rust API Reference is available directly on the [Docs.rs](https://docs.rs/tokenizers/latest/tokenizers/) website. | |
| The node API has not been documented yet. | |
Xet Storage Details
- Size:
- 4.48 kB
- Xet hash:
- dff3f6d7da982bf8342930619c58ae21a7193d8a15e9707b44a98e514b3be8a0
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.