File size: 12,437 Bytes
72c0672
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
Components
====================================================================================================

When building a Tokenizer, you can attach various types of components to this Tokenizer in order
to customize its behavior. This page lists most provided components.

.. _normalizers:


.. entities:: python

    BertNormalizer.clean_text
        clean_text
    BertNormalizer.handle_chinese_chars
        handle_chinese_chars
    BertNormalizer.strip_accents
        strip_accents
    BertNormalizer.lowercase
        lowercase
    Normalizer.Sequence
        ``Sequence([NFKC(), Lowercase()])``
    PreTokenizer.Sequence
        ``Sequence([Punctuation(), WhitespaceSplit()])``
    SplitDelimiterBehavior.removed
        :obj:`removed`
    SplitDelimiterBehavior.isolated
        :obj:`isolated`
    SplitDelimiterBehavior.merged_with_previous
        :obj:`merged_with_previous`
    SplitDelimiterBehavior.merged_with_next
        :obj:`merged_with_next`
    SplitDelimiterBehavior.contiguous
        :obj:`contiguous`

.. entities:: rust

    BertNormalizer.clean_text
        clean_text
    BertNormalizer.handle_chinese_chars
        handle_chinese_chars
    BertNormalizer.strip_accents
        strip_accents
    BertNormalizer.lowercase
        lowercase
    Normalizer.Sequence
        ``Sequence::new(vec![NFKC, Lowercase])``
    PreTokenizer.Sequence
        ``Sequence::new(vec![Punctuation, WhitespaceSplit])``
    SplitDelimiterBehavior.removed
        :obj:`Removed`
    SplitDelimiterBehavior.isolated
        :obj:`Isolated`
    SplitDelimiterBehavior.merged_with_previous
        :obj:`MergedWithPrevious`
    SplitDelimiterBehavior.merged_with_next
        :obj:`MergedWithNext`
    SplitDelimiterBehavior.contiguous
        :obj:`Contiguous`

.. entities:: node

    BertNormalizer.clean_text
        cleanText
    BertNormalizer.handle_chinese_chars
        handleChineseChars
    BertNormalizer.strip_accents
        stripAccents
    BertNormalizer.lowercase
        lowercase
    Normalizer.Sequence
        ..
    PreTokenizer.Sequence
        ..
    SplitDelimiterBehavior.removed
        :obj:`removed`
    SplitDelimiterBehavior.isolated
        :obj:`isolated`
    SplitDelimiterBehavior.merged_with_previous
        :obj:`mergedWithPrevious`
    SplitDelimiterBehavior.merged_with_next
        :obj:`mergedWithNext`
    SplitDelimiterBehavior.contiguous
        :obj:`contiguous`

Normalizers
----------------------------------------------------------------------------------------------------

A ``Normalizer`` is in charge of pre-processing the input string in order to normalize it as
relevant for a given use case. Some common examples of normalization are the Unicode normalization
algorithms (NFD, NFKD, NFC & NFKC), lowercasing etc...
The specificity of ``tokenizers`` is that we keep track of the alignment while normalizing. This
is essential to allow mapping from the generated tokens back to the input text.

The ``Normalizer`` is optional.

.. list-table::
   :header-rows: 1

   * - Name
     - Description
     - Example

   * - NFD
     - NFD unicode normalization
     -

   * - NFKD
     - NFKD unicode normalization
     -

   * - NFC
     - NFC unicode normalization
     -

   * - NFKC
     - NFKC unicode normalization
     -

   * - Lowercase
     - Replaces all uppercase to lowercase
     - Input: ``HELLO ὈΔΥΣΣΕΎΣ``

       Output: ``hello ὀδυσσεύς``

   * - Strip
     - Removes all whitespace characters on the specified sides (left, right or both) of the input
     - Input: ``" hi "``

       Output: ``"hi"``

   * - StripAccents
     - Removes all accent symbols in unicode (to be used with NFD for consistency)
     - Input: ``é``

       Ouput: ``e``

   * - Replace
     - Replaces a custom string or regexp and changes it with given content
     - ``Replace("a", "e")`` will behave like this:

       Input: ``"banana"``
       Ouput: ``"benene"``

   * - BertNormalizer
     - Provides an implementation of the Normalizer used in the original BERT. Options
       that can be set are:

            - :entity:`BertNormalizer.clean_text`
            - :entity:`BertNormalizer.handle_chinese_chars`
            - :entity:`BertNormalizer.strip_accents`
            - :entity:`BertNormalizer.lowercase`

     -

   * - Sequence
     - Composes multiple normalizers that will run in the provided order
     - :entity:`Normalizer.Sequence`


.. _pre-tokenizers:

Pre tokenizers
----------------------------------------------------------------------------------------------------

The ``PreTokenizer`` takes care of splitting the input according to a set of rules. This
pre-processing lets you ensure that the underlying ``Model`` does not build tokens across multiple
"splits".
For example if you don't want to have whitespaces inside a token, then you can have a
``PreTokenizer`` that splits on these whitespaces.

You can easily combine multiple ``PreTokenizer`` together using a ``Sequence`` (see below).
The ``PreTokenizer`` is also allowed to modify the string, just like a ``Normalizer`` does. This
is necessary to allow some complicated algorithms that require to split before normalizing (e.g.
the ByteLevel)

.. list-table::
   :header-rows: 1

   * - Name
     - Description
     - Example

   * - ByteLevel
     - Splits on whitespaces while remapping all the bytes to a set of visible characters. This
       technique as been introduced by OpenAI with GPT-2 and has some more or less nice properties:

        - Since it maps on bytes, a tokenizer using this only requires **256** characters as initial
          alphabet (the number of values a byte can have), as opposed to the 130,000+ Unicode
          characters.
        - A consequence of the previous point is that it is absolutely unnecessary to have an
          unknown token using this since we can represent anything with 256 tokens (Youhou!! 🎉🎉)
        - For non ascii characters, it gets completely unreadable, but it works nonetheless!

     - Input: ``"Hello my friend, how are you?"``

       Ouput: ``"Hello", "Ġmy", Ġfriend", ",", "Ġhow", "Ġare", "Ġyou", "?"``

   * - Whitespace
     - Splits on word boundaries (using the following regular expression: ``\w+|[^\w\s]+``
     - Input: ``"Hello there!"``

       Output: ``"Hello", "there", "!"``

   * - WhitespaceSplit
     - Splits on any whitespace character
     - Input: ``"Hello there!"``

       Output: ``"Hello", "there!"``

   * - Punctuation
     - Will isolate all punctuation characters
     - Input: ``"Hello?"``

       Ouput: ``"Hello", "?"``

   * - Metaspace
     - Splits on whitespaces and replaces them with a special char "▁" (U+2581)
     - Input: ``"Hello there"``

       Ouput: ``"Hello", "▁there"``

   * - CharDelimiterSplit
     - Splits on a given character
     - Example with ``x``:

       Input: ``"Helloxthere"``

       Ouput: ``"Hello", "there"``

   * - Digits
     - Splits the numbers from any other characters.
     - Input: ``"Hello123there"``

       Output: ```"Hello", "123", "there"```

   * - Split
     - Versatile pre-tokenizer that splits on provided pattern and according to provided behavior.
       The pattern can be inverted if necessary.

         - pattern should be either a custom string or regexp.
         - behavior should be one of:

            * :entity:`SplitDelimiterBehavior.removed`
            * :entity:`SplitDelimiterBehavior.isolated`
            * :entity:`SplitDelimiterBehavior.merged_with_previous`
            * :entity:`SplitDelimiterBehavior.merged_with_next`
            * :entity:`SplitDelimiterBehavior.contiguous`

         - invert should be a boolean flag.

     - Example with `pattern` = :obj:`" "`, `behavior` = :obj:`"isolated"`, `invert` = :obj:`False`:

        Input: ``"Hello, how are you?"``

        Output: ```"Hello,", " ", "how", " ", "are", " ", "you?"```

   * - Sequence
     - Lets you compose multiple ``PreTokenizer`` that will be run in the given order
     - :entity:`PreTokenizer.Sequence`


.. _models:

Models
----------------------------------------------------------------------------------------------------

Models are the core algorithms used to actually tokenize, and therefore, they are the only mandatory
component of a Tokenizer.

.. list-table::
   :header-rows: 1

   * - Name
     - Description

   * - WordLevel
     - This is the "classic" tokenization algorithm. It let's you simply map words to IDs
       without anything fancy. This has the advantage of being really simple to use and
       understand, but it requires extremely large vocabularies for a good coverage.


       *Using this* ``Model`` *requires the use of a* ``PreTokenizer``. *No choice will be made by
       this model directly, it simply maps input tokens to IDs*

   * - BPE
     - One of the most popular subword tokenization algorithm. The Byte-Pair-Encoding works by
       starting with characters, while merging those that are the most frequently seen together,
       thus creating new tokens. It then works iteratively to build new tokens out of the most
       frequent pairs it sees in a corpus.

       BPE is able to build words it has never seen by using multiple subword tokens, and thus
       requires smaller vocabularies, with less chances of having "unk" (unknown) tokens.

   * - WordPiece
     - This is a subword tokenization algorithm quite similar to BPE, used mainly by Google in
       models like BERT. It uses a greedy algorithm, that tries to build long words first, splitting
       in multiple tokens when entire words don't exist in the vocabulary. This is different from
       BPE that starts from characters, building bigger tokens as possible.

       It uses the famous ``##`` prefix to identify tokens that are part of a word (ie not starting
       a word).

   * - Unigram
     - Unigram is also a subword tokenization algorithm, and works by trying to identify the best
       set of subword tokens to maximize the probability for a given sentence. This is different
       from BPE in the way that this is not deterministic based on a set of rules applied
       sequentially. Instead Unigram will be able to compute multiple ways of tokenizing, while
       choosing the most probable one.


.. _post-processors:

PostProcessor
----------------------------------------------------------------------------------------------------

After the whole pipeline, we sometimes want to insert some special tokens before feed
a tokenized string into a model like "[CLS] My horse is amazing [SEP]". The ``PostProcessor``
is the component doing just that.

.. list-table::
   :header-rows: 1

   * - Name
     - Description
     - Example
   * - TemplateProcessing
     - Let's you easily template the post processing, adding special tokens, and specifying
       the ``type_id`` for each sequence/special token. The template is given two strings
       representing the single sequence and the pair of sequences, as well as a set of
       special tokens to use.
     - Example, when specifying a template with these values:

            - single: ``"[CLS] $A [SEP]"``
            - pair: ``"[CLS] $A [SEP] $B [SEP]"``
            - special tokens:

                - ``"[CLS]"``
                - ``"[SEP]"``

       Input: ``("I like this", "but not this")``

       Output: ``"[CLS] I like this [SEP] but not this [SEP]"``


.. _decoders:

Decoders
----------------------------------------------------------------------------------------------------

The Decoder knows how to go from the IDs used by the Tokenizer, back to a readable piece of text.
Some ``Normalizer`` and ``PreTokenizer`` use special characters or identifiers that need to be
reverted for example.

.. list-table::
   :header-rows: 1

   * - Name
     - Description
   * - ByteLevel
     - Reverts the ByteLevel PreTokenizer. This PreTokenizer encodes at the byte-level, using
       a set of visible Unicode characters to represent each byte, so we need a Decoder to
       revert this process and get something readable again.
   * - Metaspace
     - Reverts the Metaspace PreTokenizer. This PreTokenizer uses a special identifer ``▁`` to
       identify whitespaces, and so this Decoder helps with decoding these.
   * - WordPiece
     - Reverts the WordPiece Model. This model uses a special identifier ``##`` for continuing
       subwords, and so this Decoder helps with decoding these.