Buckets:

rtrm's picture
|
download
raw
4.48 kB
# Normalizers
## ByteLevel[[tokenizers.normalizers.ByteLevel]]
#### tokenizers.normalizers.ByteLevel[[tokenizers.normalizers.ByteLevel]]
Bytelevel Normalizer
## Lowercase[[tokenizers.normalizers.Lowercase]]
#### tokenizers.normalizers.Lowercase[[tokenizers.normalizers.Lowercase]]
Lowercase Normalizer
## NFC[[tokenizers.normalizers.NFC]]
#### tokenizers.normalizers.NFC[[tokenizers.normalizers.NFC]]
NFC Unicode Normalizer
## NFD[[tokenizers.normalizers.NFD]]
#### tokenizers.normalizers.NFD[[tokenizers.normalizers.NFD]]
NFD Unicode Normalizer
## NFKC[[tokenizers.normalizers.NFKC]]
#### tokenizers.normalizers.NFKC[[tokenizers.normalizers.NFKC]]
NFKC Unicode Normalizer
## NFKD[[tokenizers.normalizers.NFKD]]
#### tokenizers.normalizers.NFKD[[tokenizers.normalizers.NFKD]]
NFKD Unicode Normalizer
## Nmt[[tokenizers.normalizers.Nmt]]
#### tokenizers.normalizers.Nmt[[tokenizers.normalizers.Nmt]]
Nmt normalizer
## Normalizer[[tokenizers.normalizers.Normalizer]]
#### tokenizers.normalizers.Normalizer[[tokenizers.normalizers.Normalizer]]
Base class for all normalizers
This class is not supposed to be instantiated directly. Instead, any implementation of a
Normalizer will return an instance of this class when instantiated.
normalizetokenizers.normalizers.Normalizer.normalize[{"name": "normalized", "val": ""}]- **normalized** (`NormalizedString`) --
The normalized string on which to apply this
[Normalizer](/docs/tokenizers/pr_1968/en/api/normalizers#tokenizers.normalizers.Normalizer)0
Normalize a `NormalizedString` in-place
This method allows to modify a `NormalizedString` to
keep track of the alignment information. If you just want to see the result
of the normalization on a raw string, you can use
`normalize_str()`
**Parameters:**
normalized (`NormalizedString`) : The normalized string on which to apply this [Normalizer](/docs/tokenizers/pr_1968/en/api/normalizers#tokenizers.normalizers.Normalizer)
#### normalize_str[[tokenizers.normalizers.Normalizer.normalize_str]]
Normalize the given string
This method provides a way to visualize the effect of a
[Normalizer](/docs/tokenizers/pr_1968/en/api/normalizers#tokenizers.normalizers.Normalizer) but it does not keep track of the alignment
information. If you need to get/convert offsets, you can use
`normalize()`
**Parameters:**
sequence (`str`) : A string to normalize
**Returns:**
``str``
A string after normalization
## Precompiled[[tokenizers.normalizers.Precompiled]]
#### tokenizers.normalizers.Precompiled[[tokenizers.normalizers.Precompiled]]
Precompiled normalizer
Don't use manually it is used for compatibility for SentencePiece.
## Replace[[tokenizers.normalizers.Replace]]
#### tokenizers.normalizers.Replace[[tokenizers.normalizers.Replace]]
Replace normalizer
## Sequence[[tokenizers.normalizers.Sequence]]
#### tokenizers.normalizers.Sequence[[tokenizers.normalizers.Sequence]]
Allows concatenating multiple other Normalizer as a Sequence.
All the normalizers run in sequence in the given order
**Parameters:**
normalizers (`List[Normalizer]`) : A list of Normalizer to be run as a sequence
## Strip[[tokenizers.normalizers.Strip]]
#### tokenizers.normalizers.Strip[[tokenizers.normalizers.Strip]]
Strip normalizer
## StripAccents[[tokenizers.normalizers.StripAccents]]
#### tokenizers.normalizers.StripAccents[[tokenizers.normalizers.StripAccents]]
StripAccents normalizer
## BertNormalizer[[tokenizers.normalizers.BertNormalizer]]
#### tokenizers.normalizers.BertNormalizer[[tokenizers.normalizers.BertNormalizer]]
BertNormalizer
Takes care of normalizing raw text before giving it to a Bert model.
This includes cleaning the text, handling accents, chinese chars and lowercasing
**Parameters:**
clean_text (`bool`, *optional*, defaults to `True`) : Whether to clean the text, by removing any control characters and replacing all whitespaces by the classic one.
handle_chinese_chars (`bool`, *optional*, defaults to `True`) : Whether to handle chinese chars by putting spaces around them.
strip_accents (`bool`, *optional*) : Whether to strip all accents. If this option is not specified (ie == None), then it will be determined by the value for *lowercase* (as in the original Bert).
lowercase (`bool`, *optional*, defaults to `True`) : Whether to lowercase.
The Rust API Reference is available directly on the [Docs.rs](https://docs.rs/tokenizers/latest/tokenizers/) website.
The node API has not been documented yet.

Xet Storage Details

Size:
4.48 kB
·
Xet hash:
dff3f6d7da982bf8342930619c58ae21a7193d8a15e9707b44a98e514b3be8a0

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.