Buckets:
Normalizers
ByteLevel[[tokenizers.normalizers.ByteLevel]]
tokenizers.normalizers.ByteLevel[[tokenizers.normalizers.ByteLevel]]
Bytelevel Normalizer
Lowercase[[tokenizers.normalizers.Lowercase]]
tokenizers.normalizers.Lowercase[[tokenizers.normalizers.Lowercase]]
Lowercase Normalizer
NFC[[tokenizers.normalizers.NFC]]
tokenizers.normalizers.NFC[[tokenizers.normalizers.NFC]]
NFC Unicode Normalizer
NFD[[tokenizers.normalizers.NFD]]
tokenizers.normalizers.NFD[[tokenizers.normalizers.NFD]]
NFD Unicode Normalizer
NFKC[[tokenizers.normalizers.NFKC]]
tokenizers.normalizers.NFKC[[tokenizers.normalizers.NFKC]]
NFKC Unicode Normalizer
NFKD[[tokenizers.normalizers.NFKD]]
tokenizers.normalizers.NFKD[[tokenizers.normalizers.NFKD]]
NFKD Unicode Normalizer
Nmt[[tokenizers.normalizers.Nmt]]
tokenizers.normalizers.Nmt[[tokenizers.normalizers.Nmt]]
Nmt normalizer
Normalizer[[tokenizers.normalizers.Normalizer]]
tokenizers.normalizers.Normalizer[[tokenizers.normalizers.Normalizer]]
Base class for all normalizers
This class is not supposed to be instantiated directly. Instead, any implementation of a Normalizer will return an instance of this class when instantiated.
normalizetokenizers.normalizers.Normalizer.normalize[{"name": "normalized", "val": ""}]- normalized (NormalizedString) --
The normalized string on which to apply this
Normalizer0
Normalize a NormalizedString in-place
This method allows to modify a NormalizedString to
keep track of the alignment information. If you just want to see the result
of the normalization on a raw string, you can use
normalize_str()
Parameters:
normalized (NormalizedString) : The normalized string on which to apply this Normalizer
normalize_str[[tokenizers.normalizers.Normalizer.normalize_str]]
Normalize the given string
This method provides a way to visualize the effect of a
Normalizer but it does not keep track of the alignment
information. If you need to get/convert offsets, you can use
normalize()
Parameters:
sequence (str) : A string to normalize
Returns:
str
A string after normalization
Precompiled[[tokenizers.normalizers.Precompiled]]
tokenizers.normalizers.Precompiled[[tokenizers.normalizers.Precompiled]]
Precompiled normalizer Don't use manually it is used for compatibility for SentencePiece.
Replace[[tokenizers.normalizers.Replace]]
tokenizers.normalizers.Replace[[tokenizers.normalizers.Replace]]
Replace normalizer
Sequence[[tokenizers.normalizers.Sequence]]
tokenizers.normalizers.Sequence[[tokenizers.normalizers.Sequence]]
Allows concatenating multiple other Normalizer as a Sequence. All the normalizers run in sequence in the given order
Parameters:
normalizers (List[Normalizer]) : A list of Normalizer to be run as a sequence
Strip[[tokenizers.normalizers.Strip]]
tokenizers.normalizers.Strip[[tokenizers.normalizers.Strip]]
Strip normalizer
StripAccents[[tokenizers.normalizers.StripAccents]]
tokenizers.normalizers.StripAccents[[tokenizers.normalizers.StripAccents]]
StripAccents normalizer
BertNormalizer[[tokenizers.normalizers.BertNormalizer]]
tokenizers.normalizers.BertNormalizer[[tokenizers.normalizers.BertNormalizer]]
BertNormalizer
Takes care of normalizing raw text before giving it to a Bert model. This includes cleaning the text, handling accents, chinese chars and lowercasing
Parameters:
clean_text (bool, optional, defaults to True) : Whether to clean the text, by removing any control characters and replacing all whitespaces by the classic one.
handle_chinese_chars (bool, optional, defaults to True) : Whether to handle chinese chars by putting spaces around them.
strip_accents (bool, optional) : Whether to strip all accents. If this option is not specified (ie == None), then it will be determined by the value for lowercase (as in the original Bert).
lowercase (bool, optional, defaults to True) : Whether to lowercase.
The Rust API Reference is available directly on the Docs.rs website.
The node API has not been documented yet.
Xet Storage Details
- Size:
- 4.48 kB
- Xet hash:
- dff3f6d7da982bf8342930619c58ae21a7193d8a15e9707b44a98e514b3be8a0
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.