File size: 12,437 Bytes
72c0672 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 | Components
====================================================================================================
When building a Tokenizer, you can attach various types of components to this Tokenizer in order
to customize its behavior. This page lists most provided components.
.. _normalizers:
.. entities:: python
BertNormalizer.clean_text
clean_text
BertNormalizer.handle_chinese_chars
handle_chinese_chars
BertNormalizer.strip_accents
strip_accents
BertNormalizer.lowercase
lowercase
Normalizer.Sequence
``Sequence([NFKC(), Lowercase()])``
PreTokenizer.Sequence
``Sequence([Punctuation(), WhitespaceSplit()])``
SplitDelimiterBehavior.removed
:obj:`removed`
SplitDelimiterBehavior.isolated
:obj:`isolated`
SplitDelimiterBehavior.merged_with_previous
:obj:`merged_with_previous`
SplitDelimiterBehavior.merged_with_next
:obj:`merged_with_next`
SplitDelimiterBehavior.contiguous
:obj:`contiguous`
.. entities:: rust
BertNormalizer.clean_text
clean_text
BertNormalizer.handle_chinese_chars
handle_chinese_chars
BertNormalizer.strip_accents
strip_accents
BertNormalizer.lowercase
lowercase
Normalizer.Sequence
``Sequence::new(vec![NFKC, Lowercase])``
PreTokenizer.Sequence
``Sequence::new(vec![Punctuation, WhitespaceSplit])``
SplitDelimiterBehavior.removed
:obj:`Removed`
SplitDelimiterBehavior.isolated
:obj:`Isolated`
SplitDelimiterBehavior.merged_with_previous
:obj:`MergedWithPrevious`
SplitDelimiterBehavior.merged_with_next
:obj:`MergedWithNext`
SplitDelimiterBehavior.contiguous
:obj:`Contiguous`
.. entities:: node
BertNormalizer.clean_text
cleanText
BertNormalizer.handle_chinese_chars
handleChineseChars
BertNormalizer.strip_accents
stripAccents
BertNormalizer.lowercase
lowercase
Normalizer.Sequence
..
PreTokenizer.Sequence
..
SplitDelimiterBehavior.removed
:obj:`removed`
SplitDelimiterBehavior.isolated
:obj:`isolated`
SplitDelimiterBehavior.merged_with_previous
:obj:`mergedWithPrevious`
SplitDelimiterBehavior.merged_with_next
:obj:`mergedWithNext`
SplitDelimiterBehavior.contiguous
:obj:`contiguous`
Normalizers
----------------------------------------------------------------------------------------------------
A ``Normalizer`` is in charge of pre-processing the input string in order to normalize it as
relevant for a given use case. Some common examples of normalization are the Unicode normalization
algorithms (NFD, NFKD, NFC & NFKC), lowercasing etc...
The specificity of ``tokenizers`` is that we keep track of the alignment while normalizing. This
is essential to allow mapping from the generated tokens back to the input text.
The ``Normalizer`` is optional.
.. list-table::
:header-rows: 1
* - Name
- Description
- Example
* - NFD
- NFD unicode normalization
-
* - NFKD
- NFKD unicode normalization
-
* - NFC
- NFC unicode normalization
-
* - NFKC
- NFKC unicode normalization
-
* - Lowercase
- Replaces all uppercase to lowercase
- Input: ``HELLO ὈΔΥΣΣΕΎΣ``
Output: ``hello ὀδυσσεύς``
* - Strip
- Removes all whitespace characters on the specified sides (left, right or both) of the input
- Input: ``" hi "``
Output: ``"hi"``
* - StripAccents
- Removes all accent symbols in unicode (to be used with NFD for consistency)
- Input: ``é``
Ouput: ``e``
* - Replace
- Replaces a custom string or regexp and changes it with given content
- ``Replace("a", "e")`` will behave like this:
Input: ``"banana"``
Ouput: ``"benene"``
* - BertNormalizer
- Provides an implementation of the Normalizer used in the original BERT. Options
that can be set are:
- :entity:`BertNormalizer.clean_text`
- :entity:`BertNormalizer.handle_chinese_chars`
- :entity:`BertNormalizer.strip_accents`
- :entity:`BertNormalizer.lowercase`
-
* - Sequence
- Composes multiple normalizers that will run in the provided order
- :entity:`Normalizer.Sequence`
.. _pre-tokenizers:
Pre tokenizers
----------------------------------------------------------------------------------------------------
The ``PreTokenizer`` takes care of splitting the input according to a set of rules. This
pre-processing lets you ensure that the underlying ``Model`` does not build tokens across multiple
"splits".
For example if you don't want to have whitespaces inside a token, then you can have a
``PreTokenizer`` that splits on these whitespaces.
You can easily combine multiple ``PreTokenizer`` together using a ``Sequence`` (see below).
The ``PreTokenizer`` is also allowed to modify the string, just like a ``Normalizer`` does. This
is necessary to allow some complicated algorithms that require to split before normalizing (e.g.
the ByteLevel)
.. list-table::
:header-rows: 1
* - Name
- Description
- Example
* - ByteLevel
- Splits on whitespaces while remapping all the bytes to a set of visible characters. This
technique as been introduced by OpenAI with GPT-2 and has some more or less nice properties:
- Since it maps on bytes, a tokenizer using this only requires **256** characters as initial
alphabet (the number of values a byte can have), as opposed to the 130,000+ Unicode
characters.
- A consequence of the previous point is that it is absolutely unnecessary to have an
unknown token using this since we can represent anything with 256 tokens (Youhou!! 🎉🎉)
- For non ascii characters, it gets completely unreadable, but it works nonetheless!
- Input: ``"Hello my friend, how are you?"``
Ouput: ``"Hello", "Ġmy", Ġfriend", ",", "Ġhow", "Ġare", "Ġyou", "?"``
* - Whitespace
- Splits on word boundaries (using the following regular expression: ``\w+|[^\w\s]+``
- Input: ``"Hello there!"``
Output: ``"Hello", "there", "!"``
* - WhitespaceSplit
- Splits on any whitespace character
- Input: ``"Hello there!"``
Output: ``"Hello", "there!"``
* - Punctuation
- Will isolate all punctuation characters
- Input: ``"Hello?"``
Ouput: ``"Hello", "?"``
* - Metaspace
- Splits on whitespaces and replaces them with a special char "▁" (U+2581)
- Input: ``"Hello there"``
Ouput: ``"Hello", "▁there"``
* - CharDelimiterSplit
- Splits on a given character
- Example with ``x``:
Input: ``"Helloxthere"``
Ouput: ``"Hello", "there"``
* - Digits
- Splits the numbers from any other characters.
- Input: ``"Hello123there"``
Output: ```"Hello", "123", "there"```
* - Split
- Versatile pre-tokenizer that splits on provided pattern and according to provided behavior.
The pattern can be inverted if necessary.
- pattern should be either a custom string or regexp.
- behavior should be one of:
* :entity:`SplitDelimiterBehavior.removed`
* :entity:`SplitDelimiterBehavior.isolated`
* :entity:`SplitDelimiterBehavior.merged_with_previous`
* :entity:`SplitDelimiterBehavior.merged_with_next`
* :entity:`SplitDelimiterBehavior.contiguous`
- invert should be a boolean flag.
- Example with `pattern` = :obj:`" "`, `behavior` = :obj:`"isolated"`, `invert` = :obj:`False`:
Input: ``"Hello, how are you?"``
Output: ```"Hello,", " ", "how", " ", "are", " ", "you?"```
* - Sequence
- Lets you compose multiple ``PreTokenizer`` that will be run in the given order
- :entity:`PreTokenizer.Sequence`
.. _models:
Models
----------------------------------------------------------------------------------------------------
Models are the core algorithms used to actually tokenize, and therefore, they are the only mandatory
component of a Tokenizer.
.. list-table::
:header-rows: 1
* - Name
- Description
* - WordLevel
- This is the "classic" tokenization algorithm. It let's you simply map words to IDs
without anything fancy. This has the advantage of being really simple to use and
understand, but it requires extremely large vocabularies for a good coverage.
*Using this* ``Model`` *requires the use of a* ``PreTokenizer``. *No choice will be made by
this model directly, it simply maps input tokens to IDs*
* - BPE
- One of the most popular subword tokenization algorithm. The Byte-Pair-Encoding works by
starting with characters, while merging those that are the most frequently seen together,
thus creating new tokens. It then works iteratively to build new tokens out of the most
frequent pairs it sees in a corpus.
BPE is able to build words it has never seen by using multiple subword tokens, and thus
requires smaller vocabularies, with less chances of having "unk" (unknown) tokens.
* - WordPiece
- This is a subword tokenization algorithm quite similar to BPE, used mainly by Google in
models like BERT. It uses a greedy algorithm, that tries to build long words first, splitting
in multiple tokens when entire words don't exist in the vocabulary. This is different from
BPE that starts from characters, building bigger tokens as possible.
It uses the famous ``##`` prefix to identify tokens that are part of a word (ie not starting
a word).
* - Unigram
- Unigram is also a subword tokenization algorithm, and works by trying to identify the best
set of subword tokens to maximize the probability for a given sentence. This is different
from BPE in the way that this is not deterministic based on a set of rules applied
sequentially. Instead Unigram will be able to compute multiple ways of tokenizing, while
choosing the most probable one.
.. _post-processors:
PostProcessor
----------------------------------------------------------------------------------------------------
After the whole pipeline, we sometimes want to insert some special tokens before feed
a tokenized string into a model like "[CLS] My horse is amazing [SEP]". The ``PostProcessor``
is the component doing just that.
.. list-table::
:header-rows: 1
* - Name
- Description
- Example
* - TemplateProcessing
- Let's you easily template the post processing, adding special tokens, and specifying
the ``type_id`` for each sequence/special token. The template is given two strings
representing the single sequence and the pair of sequences, as well as a set of
special tokens to use.
- Example, when specifying a template with these values:
- single: ``"[CLS] $A [SEP]"``
- pair: ``"[CLS] $A [SEP] $B [SEP]"``
- special tokens:
- ``"[CLS]"``
- ``"[SEP]"``
Input: ``("I like this", "but not this")``
Output: ``"[CLS] I like this [SEP] but not this [SEP]"``
.. _decoders:
Decoders
----------------------------------------------------------------------------------------------------
The Decoder knows how to go from the IDs used by the Tokenizer, back to a readable piece of text.
Some ``Normalizer`` and ``PreTokenizer`` use special characters or identifiers that need to be
reverted for example.
.. list-table::
:header-rows: 1
* - Name
- Description
* - ByteLevel
- Reverts the ByteLevel PreTokenizer. This PreTokenizer encodes at the byte-level, using
a set of visible Unicode characters to represent each byte, so we need a Decoder to
revert this process and get something readable again.
* - Metaspace
- Reverts the Metaspace PreTokenizer. This PreTokenizer uses a special identifer ``▁`` to
identify whitespaces, and so this Decoder helps with decoding these.
* - WordPiece
- Reverts the WordPiece Model. This model uses a special identifier ``##`` for continuing
subwords, and so this Decoder helps with decoding these.
|