Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / tokenizers /pr_1968 /en /api /normalizers.md

rtrm

about 1 month ago

preview code

download

raw

4.48 kB

Normalizers

ByteLevel[[tokenizers.normalizers.ByteLevel]]

tokenizers.normalizers.ByteLevel[[tokenizers.normalizers.ByteLevel]]

Bytelevel Normalizer

Lowercase[[tokenizers.normalizers.Lowercase]]

tokenizers.normalizers.Lowercase[[tokenizers.normalizers.Lowercase]]

Lowercase Normalizer

NFC[[tokenizers.normalizers.NFC]]

tokenizers.normalizers.NFC[[tokenizers.normalizers.NFC]]

NFC Unicode Normalizer

NFD[[tokenizers.normalizers.NFD]]

tokenizers.normalizers.NFD[[tokenizers.normalizers.NFD]]

NFD Unicode Normalizer

NFKC[[tokenizers.normalizers.NFKC]]

tokenizers.normalizers.NFKC[[tokenizers.normalizers.NFKC]]

NFKC Unicode Normalizer

NFKD[[tokenizers.normalizers.NFKD]]

tokenizers.normalizers.NFKD[[tokenizers.normalizers.NFKD]]

NFKD Unicode Normalizer

Nmt[[tokenizers.normalizers.Nmt]]

tokenizers.normalizers.Nmt[[tokenizers.normalizers.Nmt]]

Nmt normalizer

Normalizer[[tokenizers.normalizers.Normalizer]]

tokenizers.normalizers.Normalizer[[tokenizers.normalizers.Normalizer]]

Base class for all normalizers

This class is not supposed to be instantiated directly. Instead, any implementation of a Normalizer will return an instance of this class when instantiated.

normalizetokenizers.normalizers.Normalizer.normalize[{"name": "normalized", "val": ""}]- normalized (NormalizedString) -- The normalized string on which to apply this Normalizer0 Normalize a NormalizedString in-place

This method allows to modify a NormalizedString to keep track of the alignment information. If you just want to see the result of the normalization on a raw string, you can use normalize_str()

Parameters:

normalized (NormalizedString) : The normalized string on which to apply this Normalizer

normalize_str[[tokenizers.normalizers.Normalizer.normalize_str]]

Normalize the given string

This method provides a way to visualize the effect of a Normalizer but it does not keep track of the alignment information. If you need to get/convert offsets, you can use normalize()

Parameters:

sequence (str) : A string to normalize

Returns:

str

A string after normalization

Precompiled[[tokenizers.normalizers.Precompiled]]

tokenizers.normalizers.Precompiled[[tokenizers.normalizers.Precompiled]]

Precompiled normalizer Don't use manually it is used for compatibility for SentencePiece.

Replace[[tokenizers.normalizers.Replace]]

tokenizers.normalizers.Replace[[tokenizers.normalizers.Replace]]

Replace normalizer

Sequence[[tokenizers.normalizers.Sequence]]

tokenizers.normalizers.Sequence[[tokenizers.normalizers.Sequence]]

Allows concatenating multiple other Normalizer as a Sequence. All the normalizers run in sequence in the given order

Parameters:

normalizers (List[Normalizer]) : A list of Normalizer to be run as a sequence

Strip[[tokenizers.normalizers.Strip]]

tokenizers.normalizers.Strip[[tokenizers.normalizers.Strip]]

Strip normalizer

StripAccents[[tokenizers.normalizers.StripAccents]]

tokenizers.normalizers.StripAccents[[tokenizers.normalizers.StripAccents]]

StripAccents normalizer

BertNormalizer[[tokenizers.normalizers.BertNormalizer]]

tokenizers.normalizers.BertNormalizer[[tokenizers.normalizers.BertNormalizer]]

BertNormalizer

Takes care of normalizing raw text before giving it to a Bert model. This includes cleaning the text, handling accents, chinese chars and lowercasing

Parameters:

clean_text (bool, optional, defaults to True) : Whether to clean the text, by removing any control characters and replacing all whitespaces by the classic one.

handle_chinese_chars (bool, optional, defaults to True) : Whether to handle chinese chars by putting spaces around them.

strip_accents (bool, optional) : Whether to strip all accents. If this option is not specified (ie == None), then it will be determined by the value for lowercase (as in the original Bert).

lowercase (bool, optional, defaults to True) : Whether to lowercase.

The Rust API Reference is available directly on the Docs.rs website.

The node API has not been documented yet.

Xet Storage Details

Size:: 4.48 kB
Xet hash:: dff3f6d7da982bf8342930619c58ae21a7193d8a15e9707b44a98e514b3be8a0

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.