michaelmuellersmao commited on
Commit
474b76e
·
verified ·
1 Parent(s): ecdd04d

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,435 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - de
4
+ - en
5
+ - fr
6
+ - it
7
+ - pt
8
+ - es
9
+ - tr
10
+ - sv
11
+ license: cc-by-nc-4.0
12
+ tags:
13
+ - text-normalization
14
+ - tts
15
+ - text-to-speech
16
+ - byt5
17
+ - multilingual
18
+ - seq2seq
19
+ - speech-processing
20
+ datasets:
21
+ - custom
22
+ metrics:
23
+ - accuracy
24
+ pipeline_tag: text2text-generation
25
+ model-index:
26
+ - name: saytext
27
+ results:
28
+ - task:
29
+ type: text2text-generation
30
+ name: Text Normalization
31
+ metrics:
32
+ - type: accuracy
33
+ value: 94.2
34
+ name: Sentence Accuracy (test set)
35
+ - type: accuracy
36
+ value: 22.6
37
+ name: Sentence Accuracy (PolyNorm benchmark)
38
+ ---
39
+
40
+ # SayText — Multilingual Text Normalization for TTS
41
+
42
+ **A multilingual neural text normalization model for TTS (text-to-speech) pipelines.** Converts written text to spoken form across 8 European languages using a fine-tuned ByT5-Base (580M parameters).
43
+
44
+ > `"Das kostet 12,50 €."` → `"Das kostet zwölf Euro fünfzig."`
45
+
46
+ ## Key Features
47
+
48
+ - **8 languages**: German, English, French, Italian, Portuguese, Spanish, Turkish, Swedish
49
+ - **24 semiotic classes**: cardinals, money, dates, time, phone numbers, percentages, units, passthrough, and more
50
+ - **Passthrough-aware**: learns when NOT to normalize (plain text, already-spoken forms, technical identifiers)
51
+ - **Voice-agent optimized**: designed for LLM → TN → TTS pipelines where input is always well-formatted
52
+ - **Byte-level**: ByT5 processes raw UTF-8 bytes — no tokenizer vocabulary limitations, handles `€`, `₺`, `°C` natively
53
+
54
+ ## Quick Start
55
+
56
+ ```python
57
+ from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
58
+ import torch
59
+
60
+ model = AutoModelForSeq2SeqLM.from_pretrained("smaoai/saytext")
61
+ tokenizer = AutoTokenizer.from_pretrained("google/byt5-base")
62
+ model.eval()
63
+
64
+ def normalize(text: str, language: str) -> str:
65
+ """Normalize text for TTS. Language: de, en, fr, it, pt, es, tr, sv"""
66
+ input_text = f"<{language}> {text}"
67
+ inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)
68
+ with torch.no_grad():
69
+ output = model.generate(**inputs, max_new_tokens=512, num_beams=1)
70
+ return tokenizer.decode(output[0], skip_special_tokens=True)
71
+
72
+ # Examples
73
+ print(normalize("Das kostet 12,50 €.", "de"))
74
+ # → Das kostet zwölf Euro fünfzig.
75
+
76
+ print(normalize("The flight departs at 8:45 AM.", "en"))
77
+ # → The flight departs at eight forty-five AM.
78
+
79
+ print(normalize("Le prix est de 327,67 €.", "fr"))
80
+ # → Le prix est de trois cent vingt-sept euros et soixante-sept centimes.
81
+
82
+ print(normalize("Ich helfe Ihnen gerne weiter.", "de"))
83
+ # → Ich helfe Ihnen gerne weiter. (passthrough — no normalization needed)
84
+
85
+ print(normalize("Wir verwenden Python 3.10 in unserem Projekt.", "de"))
86
+ # → Wir verwenden Python 3.10 in unserem Projekt. (technical identifier preserved)
87
+ ```
88
+
89
+ ## Available Formats
90
+
91
+ This repo includes the model in three formats:
92
+
93
+ | Format | Path | Size | Use case |
94
+ |--------|------|------|----------|
95
+ | **PyTorch** (default) | `model.safetensors` | 2.2 GB | Development, fine-tuning, HuggingFace `pipeline()` |
96
+ | **CTranslate2 FP16** | `ct2_float16/` | 1.1 GB | GPU production inference (T4, A10G, A100) |
97
+ | **CTranslate2 INT8** | `ct2_int8/` | 556 MB | CPU production inference, edge deployment |
98
+
99
+ ```
100
+ repo/
101
+ model.safetensors # PyTorch weights
102
+ config.json # Model architecture
103
+ tokenizer_config.json # Tokenizer config
104
+ added_tokens.json # Additional tokens
105
+ generation_config.json # Default generation parameters
106
+ handler.py # HuggingFace Inference API handler
107
+ ct2_float16/ # CTranslate2 FP16 (GPU)
108
+ model.bin
109
+ config.json
110
+ shared_vocabulary.json
111
+ ct2_int8/ # CTranslate2 INT8 (CPU)
112
+ model.bin
113
+ config.json
114
+ shared_vocabulary.json
115
+ ```
116
+
117
+ ## Model Description
118
+
119
+ ### Why ByT5?
120
+
121
+ Text normalization deals with symbols, digits, and special characters (`€12,50`, `+49(0)30`, `info@web.de`) that subword tokenizers fragment unpredictably. ByT5 operates on raw UTF-8 bytes — every character is processed individually with no tokenizer artifacts. This is critical for:
122
+
123
+ - **Locale-sensitive formats**: `1.500` means "one thousand five hundred" in German but "one point five" in English
124
+ - **Special symbols**: `€`, `£`, `₺`, `%`, `°C` are single bytes, not fragmented subwords
125
+ - **Phone numbers**: `+49 (0)30 12345678` stays intact byte-by-byte
126
+
127
+ ### Architecture
128
+
129
+ | Component | Details |
130
+ |-----------|---------|
131
+ | Base model | [google/byt5-base](https://huggingface.co/google/byt5-base) |
132
+ | Parameters | 580M |
133
+ | Type | Encoder-decoder (seq2seq) |
134
+ | Tokenization | Byte-level UTF-8 (no SentencePiece) |
135
+ | Max input length | 512 bytes (~250 characters) |
136
+ | Max output length | 512 bytes |
137
+ | Language conditioning | Prefix token: `<de>`, `<en>`, `<fr>`, etc. |
138
+
139
+ ### How It Works
140
+
141
+ The model takes text with a language prefix and outputs the spoken form:
142
+
143
+ ```
144
+ Input: <de> Am 03.04.2026 um 14:30 kostet der Flug 249,99 €.
145
+ Output: Am dritte April zweitausendsechsundzwanzig um vierzehn Uhr dreißig
146
+ kostet der Flug zweihundertneunundvierzig Euro neunundneunzig.
147
+ ```
148
+
149
+ For text that doesn't need normalization, the model learns to pass it through unchanged:
150
+
151
+ ```
152
+ Input: <en> Sure, I can help you with that.
153
+ Output: Sure, I can help you with that.
154
+ ```
155
+
156
+ ## Training Data
157
+
158
+ The model was trained on **3.07M pairs** across 8 languages, generated through a two-layer pipeline:
159
+
160
+ ### Data Pipeline
161
+
162
+ 1. **Entity Sampler** (deterministic) — Generates verified (written, spoken) pairs per semiotic class using locale-aware libraries (Babel, num2words). Every pair is programmatically verified correct.
163
+
164
+ 2. **Sentence Generator** (LLM-powered) — 192,000 natural sentence templates with `{SLOT}` placeholders, generated via Nebius GLM-5. Templates are filled with entity sampler pairs at assembly time.
165
+
166
+ 3. **Real-world enrichment** — Leipzig news corpora + OPUS ECB financial text, auto-labeled by NVIDIA NeMo WFST grammars, validated and merged.
167
+
168
+ ### Data Composition
169
+
170
+ | Category | Count | % | Description |
171
+ |----------|-------|---|-------------|
172
+ | MIXED (multi-entity) | 434k | 14.1% | Sentences with 2-3 different entity types |
173
+ | Cardinal numbers | 373k | 12.1% | `1,523` → `eintausendfünfhundertdreiundzwanzig` |
174
+ | Money | 313k | 10.2% | `12,50 €` → `zwölf Euro fünfzig` |
175
+ | Written dates | 192k | 6.3% | `3. April 2026` → `dritter April...` |
176
+ | **Plain (passthrough)** | 166k | 5.4% | Clean text → unchanged output |
177
+ | Decimals | 161k | 5.3% | `3,14` → `drei Komma eins vier` |
178
+ | Time | 160k | 5.2% | `14:30` → `vierzehn Uhr dreißig` |
179
+ | Units | 145k | 4.7% | `3,5 kg` → `drei Komma fünf Kilogramm` |
180
+ | **Already normalized** | 145k | 4.7% | Already-spoken forms → unchanged |
181
+ | Percentages | 133k | 4.3% | `15,3%` → `fünfzehn Komma drei Prozent` |
182
+ | Ordinals | 123k | 4.0% | `3.` → `dritte` |
183
+ | Phone (local) | 108k | 3.5% | `030 12345678` → digit-by-digit |
184
+ | Numeric dates | 106k | 3.5% | `03.04.2026` → `dritter April...` |
185
+ | Years | 95k | 3.1% | `2026` → `zweitausendsechsundzwanzig` |
186
+ | Ranges | 81k | 2.7% | `10–15` → `zehn bis fünfzehn` |
187
+ | Phone (international) | 80k | 2.6% | `+49 30 12345678` → `plus vier neun...` |
188
+ | Email | 54k | 1.8% | Spelled out |
189
+ | Version | 53k | 1.7% | `v2.3.1` → `Version zwei Punkt drei Punkt eins` |
190
+ | Fractions | 51k | 1.7% | `3/4` → `drei Viertel` |
191
+ | URL | 27k | 0.9% | Spelled out |
192
+ | Abbreviations | 27k | 0.9% | `z.B.` → `zum Beispiel` |
193
+ | Postal codes | 27k | 0.9% | Digit-by-digit |
194
+ | **Don't normalize** | 8k | 0.3% | `Python 3.10`, `HTTP 404` → pass through |
195
+
196
+ ### Per-Language Distribution
197
+
198
+ | Language | Train pairs | Real-world data |
199
+ |----------|------------|-----------------|
200
+ | German (DE) | 373k | Leipzig + ECB (NeMo labeled) |
201
+ | English (EN) | 379k | Leipzig + ECB + NeMo eval pairs |
202
+ | Spanish (ES) | 364k | Leipzig + ECB (NeMo labeled) |
203
+ | French (FR) | 359k | Leipzig + ECB (NeMo labeled) |
204
+ | Italian (IT) | 359k | Leipzig + ECB (NeMo labeled) |
205
+ | Portuguese (PT) | 308k | Synthetic only (NeMo doesn't support PT) |
206
+ | Swedish (SV) | 477k | Leipzig (NeMo labeled), 1.5x oversampled |
207
+ | Turkish (TR) | 447k | Synthetic only (NeMo doesn't support TR), 1.5x oversampled |
208
+
209
+ Turkish and Swedish are oversampled 1.5x because ByT5's pretraining data has less coverage for these languages.
210
+
211
+ ## Evaluation Results
212
+
213
+ ### Internal Test Set (1,900 stratified sample)
214
+
215
+ **Overall: 94.2% sentence-level exact match accuracy**
216
+
217
+ | Language | Accuracy |
218
+ |----------|----------|
219
+ | Swedish | 96.7% |
220
+ | French | 95.4% |
221
+ | Turkish | 94.8% |
222
+ | Spanish | 94.2% |
223
+ | English | 93.8% |
224
+ | Italian | 93.3% |
225
+ | German | 92.9% |
226
+ | Portuguese | 92.6% |
227
+
228
+ #### Per-Class Accuracy (selected)
229
+
230
+ | Class | Accuracy | Notes |
231
+ |-------|----------|-------|
232
+ | Cardinal | 97.5% | |
233
+ | Money | 97.5% | |
234
+ | Year | 97.5% | |
235
+ | Time | 96.2% | |
236
+ | Ordinal | 96.2% | |
237
+ | Passthrough (plain) | 100% | Correctly leaves clean text unchanged |
238
+ | Already normalized | 93.8% | |
239
+ | Don't normalize | 97.5% | Correctly preserves technical identifiers |
240
+ | Phone numbers | 90.0% | Some formatting differences |
241
+ | Multi-entity (MIXED) | 51.2% | Complex sentences with 2-3 entities |
242
+
243
+ ### Apple PolyNorm Benchmark (1,620 pairs, 5 languages)
244
+
245
+ **22.6% sentence accuracy** — lower due to:
246
+ - 30% of PolyNorm tests classes we intentionally dropped (ACRONYM, HASHTAG, ROMAN, IBAN, SCORE)
247
+ - Unseen date format variants (abbreviated months, slash dates, 2-digit years)
248
+ - German case/declension differences
249
+
250
+ This benchmark measures generalization to unseen formats, which is an area for improvement in future versions.
251
+
252
+ ## Inference
253
+
254
+ ### Input Format
255
+
256
+ The model expects a **language prefix** followed by the text:
257
+
258
+ ```
259
+ <de> Das kostet 12,50 €.
260
+ <en> The price is $99.99.
261
+ <fr> Le prix est de 327,67 €.
262
+ ```
263
+
264
+ The language prefix is **required** — it tells the model which normalization rules to apply. The same written form can normalize differently per language (e.g., `1.500` → German "eintausendfünfhundert" vs English "one point five").
265
+
266
+ ### CPU Inference (PyTorch)
267
+
268
+ | Metric | Value |
269
+ |--------|-------|
270
+ | Average latency | 600ms per sentence |
271
+ | Passthrough (no entities) | 300ms |
272
+ | Complex multi-entity | 800-1300ms |
273
+ | Model size | 2.2 GB |
274
+ | RAM usage | ~3 GB |
275
+
276
+ **ByT5 is inherently slow on CPU** because it generates one byte at a time. A 30-character output requires 30 autoregressive decoder steps, each being a full forward pass through 580M parameters. This is the fundamental trade-off for byte-level accuracy.
277
+
278
+ ### CPU Inference (CTranslate2 INT8)
279
+
280
+ | Metric | Value |
281
+ |--------|-------|
282
+ | Average latency (single) | 248ms |
283
+ | Average latency (batch of 10) | 120ms per sentence |
284
+ | Passthrough | 165ms |
285
+ | Model size | 556 MB |
286
+ | RAM usage | ~1 GB |
287
+
288
+ CTranslate2 with INT8 quantization provides **~2.5x speedup** over PyTorch and **4x smaller** model. Quality is preserved with minor token mapping artifacts on special characters.
289
+
290
+ ```bash
291
+ # Convert to CTranslate2 INT8
292
+ pip install ctranslate2
293
+ ct2-transformers-converter --model smaoai/saytext \
294
+ --output_dir ct2_int8 --quantization int8
295
+
296
+ # Usage
297
+ import ctranslate2
298
+ from transformers import AutoTokenizer
299
+
300
+ translator = ctranslate2.Translator("ct2_int8", device="cpu", intra_threads=4)
301
+ tokenizer = AutoTokenizer.from_pretrained("google/byt5-base")
302
+
303
+ text = "<de> Das kostet 12,50 €."
304
+ tokens = tokenizer(text, return_tensors=None)["input_ids"]
305
+ token_strs = [tokenizer.decode([t]) for t in tokens]
306
+
307
+ result = translator.translate_batch([token_strs], beam_size=1, max_decoding_length=512)
308
+ output_ids = tokenizer.convert_tokens_to_ids(result[0].hypotheses[0])
309
+ print(tokenizer.decode(output_ids, skip_special_tokens=True))
310
+ ```
311
+
312
+ ### GPU Inference
313
+
314
+ | Setup | Single sentence | Batch of 8 | Model size |
315
+ |-------|----------------|------------|------------|
316
+ | PyTorch FP32, A100 | ~280ms | ~63ms/sent | 2.2 GB |
317
+ | CTranslate2 FP16, A100 | ~15-30ms | ~8ms/sent | 1.1 GB |
318
+ | CTranslate2 FP16, T4 | ~30-50ms | ~15ms/sent | 1.1 GB |
319
+
320
+ **For production TTS pipelines, a small GPU (T4, L4) is recommended.** With CTranslate2 FP16 on a T4 ($0.20/hr on cloud), you get <50ms latency — well within real-time TTS requirements.
321
+
322
+ ### Latency vs. Quality Summary
323
+
324
+ ```
325
+ Latency Quality Cost
326
+ ─────── ─────── ────
327
+ PyTorch CPU 600ms/sent Best Free
328
+ CT2 INT8 CPU 248ms/sent Good* Free
329
+ CT2 FP16 T4 GPU 30-50ms/sent Best $0.20/hr
330
+ CT2 FP16 A100 GPU 15-30ms/sent Best $1.50/hr
331
+
332
+ * Minor token mapping artifacts on special characters (€, ₺)
333
+ ```
334
+
335
+ ## Supported Languages
336
+
337
+ | Code | Language | Example Input | Example Output |
338
+ |------|----------|---------------|----------------|
339
+ | `de` | German | `Das kostet 12,50 €.` | `Das kostet zwölf Euro fünfzig.` |
340
+ | `en` | English | `The flight departs at 8:45 AM.` | `The flight departs at eight forty-five AM.` |
341
+ | `fr` | French | `Le prix est de 327,67 €.` | `Le prix est de trois cent vingt-sept euros et soixante-sept centimes.` |
342
+ | `it` | Italian | `La riunione è alle 14:30.` | `La riunione è alle quattordici e trenta.` |
343
+ | `pt` | Portuguese | `Em 2023, o crescimento foi notável.` | `Em dois mil e vinte e três, o crescimento foi notável.` |
344
+ | `es` | Spanish | `La fecha es el 20 de enero de 2025.` | `La fecha es el veinte de enero de dos mil veinticinco.` |
345
+ | `tr` | Turkish | `Toplam 250 kişi katıldı.` | `Toplam ikiyüzelli kişi katıldı.` |
346
+ | `sv` | Swedish | `Det väger 2,5 kg.` | `Det väger två komma fem kilogram.` |
347
+
348
+ ## Limitations & Known Issues
349
+
350
+ ### Current Limitations
351
+
352
+ 1. **Multi-entity accuracy is 51%** — sentences with 2-3 different entity types (date + time + money) sometimes produce errors on one of the entities.
353
+
354
+ 2. **Date format variants** — the model handles standard formats (`03.04.2026`, `March 15, 2026`) well but struggles with abbreviated months (`5. Nov. 1990`), slash dates (`3/4/2024`), and 2-digit years (`'23`).
355
+
356
+ 3. **German declension** — sometimes produces nominative case (`fünfter`) instead of the correct accusative/dative (`fünften`). This is a systematic issue with the training data.
357
+
358
+ 4. **CPU latency** — 250-600ms per sentence on CPU. Not suitable for real-time applications without GPU. See inference section for alternatives.
359
+
360
+ 5. **Currency symbol verbalization** — the `€` symbol is sometimes passed through instead of being verbalized as "Euro". More consistent with spelled-out currency codes.
361
+
362
+ 6. **Training not complete** — the model was trained for 1.29 epochs out of 3 planned. While already performant, further training would improve accuracy, especially on edge cases.
363
+
364
+ ### Not Designed For
365
+
366
+ - **Raw user input** — the model is optimized for well-formatted LLM output, not OCR text or social media posts with typos
367
+ - **Acronym spelling** — `BMW → B M W` is intentionally not included (TTS engines handle this natively via SSML)
368
+ - **Hashtags, IBANs, Roman numerals, sports scores** — dropped from training as low-priority for voice agent use cases
369
+
370
+ ### Recommended Post-Processing
371
+
372
+ For production use, add a post-validation step:
373
+
374
+ ```python
375
+ import re
376
+
377
+ def post_validate(input_text: str, output_text: str) -> bool:
378
+ """Check if the model output is safe for TTS."""
379
+ # Flag if output still contains raw digits (possible normalization failure)
380
+ if re.search(r"\d", output_text):
381
+ return False # Fall back to rule-based normalization
382
+ # Flag if output is suspiciously short or long
383
+ ratio = len(output_text) / max(len(input_text), 1)
384
+ if ratio > 5.0 or ratio < 0.3:
385
+ return False
386
+ return True
387
+ ```
388
+
389
+ ## Training Details
390
+
391
+ | Parameter | Value |
392
+ |-----------|-------|
393
+ | Base model | google/byt5-base |
394
+ | Training pairs | 3,067,205 |
395
+ | Validation pairs | 170,317 |
396
+ | Effective batch size | 128 (16 per device × 8 accumulation) |
397
+ | Learning rate | 3e-4 (cosine schedule) |
398
+ | Precision | bf16 |
399
+ | Hardware | NVIDIA A100 80GB SXM |
400
+ | Training time | ~13 hours (31,000 steps, 1.29 epochs) |
401
+ | Best eval loss | 0.000897 |
402
+ | Framework | HuggingFace Transformers 5.5 + PyTorch 2.6 |
403
+
404
+ ## Intended Use
405
+
406
+ This model is designed for **text-to-speech preprocessing** in voice agent / conversational AI pipelines:
407
+
408
+ ```
409
+ User speaks → ASR → LLM generates response → Text Normalizer → TTS speaks
410
+ ```
411
+
412
+ The text normalizer converts written forms (numbers, dates, currencies, etc.) in the LLM's response to spoken forms that the TTS engine can pronounce naturally.
413
+
414
+ ## License
415
+
416
+ This model is released under [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/) — free for research and non-commercial use.
417
+
418
+ **For commercial licensing**, contact us at [business@smao.ai](mailto:business@smao.ai).
419
+
420
+ ## About
421
+
422
+ Built by [SMAO](https://smao.ai) — Michael Müller and team.
423
+
424
+ For questions, issues, or commercial inquiries: [business@smao.ai](mailto:business@smao.ai)
425
+
426
+ ## Citation
427
+
428
+ ```bibtex
429
+ @misc{smao-byt5-tn-2026,
430
+ title={SayText: Multilingual Text Normalization for TTS},
431
+ author={Michael Müller and SMAO AI},
432
+ year={2026},
433
+ url={https://huggingface.co/smaoai/saytext}
434
+ }
435
+ ```
__pycache__/handler.cpython-311.pyc ADDED
Binary file (2.77 kB). View file
 
added_tokens.json ADDED
@@ -0,0 +1,127 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "<extra_id_0>": 259,
3
+ "<extra_id_100>": 359,
4
+ "<extra_id_101>": 360,
5
+ "<extra_id_102>": 361,
6
+ "<extra_id_103>": 362,
7
+ "<extra_id_104>": 363,
8
+ "<extra_id_105>": 364,
9
+ "<extra_id_106>": 365,
10
+ "<extra_id_107>": 366,
11
+ "<extra_id_108>": 367,
12
+ "<extra_id_109>": 368,
13
+ "<extra_id_10>": 269,
14
+ "<extra_id_110>": 369,
15
+ "<extra_id_111>": 370,
16
+ "<extra_id_112>": 371,
17
+ "<extra_id_113>": 372,
18
+ "<extra_id_114>": 373,
19
+ "<extra_id_115>": 374,
20
+ "<extra_id_116>": 375,
21
+ "<extra_id_117>": 376,
22
+ "<extra_id_118>": 377,
23
+ "<extra_id_119>": 378,
24
+ "<extra_id_11>": 270,
25
+ "<extra_id_120>": 379,
26
+ "<extra_id_121>": 380,
27
+ "<extra_id_122>": 381,
28
+ "<extra_id_123>": 382,
29
+ "<extra_id_124>": 383,
30
+ "<extra_id_12>": 271,
31
+ "<extra_id_13>": 272,
32
+ "<extra_id_14>": 273,
33
+ "<extra_id_15>": 274,
34
+ "<extra_id_16>": 275,
35
+ "<extra_id_17>": 276,
36
+ "<extra_id_18>": 277,
37
+ "<extra_id_19>": 278,
38
+ "<extra_id_1>": 260,
39
+ "<extra_id_20>": 279,
40
+ "<extra_id_21>": 280,
41
+ "<extra_id_22>": 281,
42
+ "<extra_id_23>": 282,
43
+ "<extra_id_24>": 283,
44
+ "<extra_id_25>": 284,
45
+ "<extra_id_26>": 285,
46
+ "<extra_id_27>": 286,
47
+ "<extra_id_28>": 287,
48
+ "<extra_id_29>": 288,
49
+ "<extra_id_2>": 261,
50
+ "<extra_id_30>": 289,
51
+ "<extra_id_31>": 290,
52
+ "<extra_id_32>": 291,
53
+ "<extra_id_33>": 292,
54
+ "<extra_id_34>": 293,
55
+ "<extra_id_35>": 294,
56
+ "<extra_id_36>": 295,
57
+ "<extra_id_37>": 296,
58
+ "<extra_id_38>": 297,
59
+ "<extra_id_39>": 298,
60
+ "<extra_id_3>": 262,
61
+ "<extra_id_40>": 299,
62
+ "<extra_id_41>": 300,
63
+ "<extra_id_42>": 301,
64
+ "<extra_id_43>": 302,
65
+ "<extra_id_44>": 303,
66
+ "<extra_id_45>": 304,
67
+ "<extra_id_46>": 305,
68
+ "<extra_id_47>": 306,
69
+ "<extra_id_48>": 307,
70
+ "<extra_id_49>": 308,
71
+ "<extra_id_4>": 263,
72
+ "<extra_id_50>": 309,
73
+ "<extra_id_51>": 310,
74
+ "<extra_id_52>": 311,
75
+ "<extra_id_53>": 312,
76
+ "<extra_id_54>": 313,
77
+ "<extra_id_55>": 314,
78
+ "<extra_id_56>": 315,
79
+ "<extra_id_57>": 316,
80
+ "<extra_id_58>": 317,
81
+ "<extra_id_59>": 318,
82
+ "<extra_id_5>": 264,
83
+ "<extra_id_60>": 319,
84
+ "<extra_id_61>": 320,
85
+ "<extra_id_62>": 321,
86
+ "<extra_id_63>": 322,
87
+ "<extra_id_64>": 323,
88
+ "<extra_id_65>": 324,
89
+ "<extra_id_66>": 325,
90
+ "<extra_id_67>": 326,
91
+ "<extra_id_68>": 327,
92
+ "<extra_id_69>": 328,
93
+ "<extra_id_6>": 265,
94
+ "<extra_id_70>": 329,
95
+ "<extra_id_71>": 330,
96
+ "<extra_id_72>": 331,
97
+ "<extra_id_73>": 332,
98
+ "<extra_id_74>": 333,
99
+ "<extra_id_75>": 334,
100
+ "<extra_id_76>": 335,
101
+ "<extra_id_77>": 336,
102
+ "<extra_id_78>": 337,
103
+ "<extra_id_79>": 338,
104
+ "<extra_id_7>": 266,
105
+ "<extra_id_80>": 339,
106
+ "<extra_id_81>": 340,
107
+ "<extra_id_82>": 341,
108
+ "<extra_id_83>": 342,
109
+ "<extra_id_84>": 343,
110
+ "<extra_id_85>": 344,
111
+ "<extra_id_86>": 345,
112
+ "<extra_id_87>": 346,
113
+ "<extra_id_88>": 347,
114
+ "<extra_id_89>": 348,
115
+ "<extra_id_8>": 267,
116
+ "<extra_id_90>": 349,
117
+ "<extra_id_91>": 350,
118
+ "<extra_id_92>": 351,
119
+ "<extra_id_93>": 352,
120
+ "<extra_id_94>": 353,
121
+ "<extra_id_95>": 354,
122
+ "<extra_id_96>": 355,
123
+ "<extra_id_97>": 356,
124
+ "<extra_id_98>": 357,
125
+ "<extra_id_99>": 358,
126
+ "<extra_id_9>": 268
127
+ }
config.json ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "T5ForConditionalGeneration"
4
+ ],
5
+ "classifier_dropout": 0.0,
6
+ "d_ff": 3968,
7
+ "d_kv": 64,
8
+ "d_model": 1536,
9
+ "decoder_start_token_id": 0,
10
+ "dense_act_fn": "gelu_new",
11
+ "dropout_rate": 0.1,
12
+ "dtype": "float32",
13
+ "eos_token_id": 1,
14
+ "feed_forward_proj": "gated-gelu",
15
+ "gradient_checkpointing": false,
16
+ "initializer_factor": 1.0,
17
+ "is_decoder": false,
18
+ "is_encoder_decoder": true,
19
+ "is_gated_act": true,
20
+ "layer_norm_epsilon": 1e-06,
21
+ "model_type": "t5",
22
+ "num_decoder_layers": 6,
23
+ "num_heads": 12,
24
+ "num_layers": 18,
25
+ "output_past": true,
26
+ "pad_token_id": 0,
27
+ "relative_attention_max_distance": 128,
28
+ "relative_attention_num_buckets": 32,
29
+ "scale_decoder_outputs": false,
30
+ "tie_word_embeddings": true,
31
+ "tokenizer_class": "ByT5Tokenizer",
32
+ "transformers_version": "5.5.0",
33
+ "use_cache": false,
34
+ "vocab_size": 384
35
+ }
ct2_float16/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_source_bos": false,
3
+ "add_source_eos": false,
4
+ "bos_token": "<pad>",
5
+ "decoder_start_token": "<pad>",
6
+ "eos_token": "</s>",
7
+ "layer_norm_epsilon": null,
8
+ "multi_query_attention": false,
9
+ "unk_token": "<unk>"
10
+ }
ct2_float16/model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:16ce8b9fa5b53f6cc041b18c0801bfa6e5c065fb1a0a0c467395a647406b3d13
3
+ size 1163324406
ct2_float16/shared_vocabulary.json ADDED
@@ -0,0 +1,386 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ "<pad>",
3
+ "</s>",
4
+ "<unk>",
5
+ "\u0000",
6
+ "\u0001",
7
+ "\u0002",
8
+ "\u0003",
9
+ "\u0004",
10
+ "\u0005",
11
+ "\u0006",
12
+ "\u0007",
13
+ "\b",
14
+ "\t",
15
+ "\n",
16
+ "\u000b",
17
+ "\f",
18
+ "\r",
19
+ "\u000e",
20
+ "\u000f",
21
+ "\u0010",
22
+ "\u0011",
23
+ "\u0012",
24
+ "\u0013",
25
+ "\u0014",
26
+ "\u0015",
27
+ "\u0016",
28
+ "\u0017",
29
+ "\u0018",
30
+ "\u0019",
31
+ "\u001a",
32
+ "\u001b",
33
+ "\u001c",
34
+ "\u001d",
35
+ "\u001e",
36
+ "\u001f",
37
+ " ",
38
+ "!",
39
+ "\"",
40
+ "#",
41
+ "$",
42
+ "%",
43
+ "&",
44
+ "'",
45
+ "(",
46
+ ")",
47
+ "*",
48
+ "+",
49
+ ",",
50
+ "-",
51
+ ".",
52
+ "/",
53
+ "0",
54
+ "1",
55
+ "2",
56
+ "3",
57
+ "4",
58
+ "5",
59
+ "6",
60
+ "7",
61
+ "8",
62
+ "9",
63
+ ":",
64
+ ";",
65
+ "<",
66
+ "=",
67
+ ">",
68
+ "?",
69
+ "@",
70
+ "A",
71
+ "B",
72
+ "C",
73
+ "D",
74
+ "E",
75
+ "F",
76
+ "G",
77
+ "H",
78
+ "I",
79
+ "J",
80
+ "K",
81
+ "L",
82
+ "M",
83
+ "N",
84
+ "O",
85
+ "P",
86
+ "Q",
87
+ "R",
88
+ "S",
89
+ "T",
90
+ "U",
91
+ "V",
92
+ "W",
93
+ "X",
94
+ "Y",
95
+ "Z",
96
+ "[",
97
+ "\\",
98
+ "]",
99
+ "^",
100
+ "_",
101
+ "`",
102
+ "a",
103
+ "b",
104
+ "c",
105
+ "d",
106
+ "e",
107
+ "f",
108
+ "g",
109
+ "h",
110
+ "i",
111
+ "j",
112
+ "k",
113
+ "l",
114
+ "m",
115
+ "n",
116
+ "o",
117
+ "p",
118
+ "q",
119
+ "r",
120
+ "s",
121
+ "t",
122
+ "u",
123
+ "v",
124
+ "w",
125
+ "x",
126
+ "y",
127
+ "z",
128
+ "{",
129
+ "|",
130
+ "}",
131
+ "~",
132
+ "\u007f",
133
+ "\u0080",
134
+ "\u0081",
135
+ "\u0082",
136
+ "\u0083",
137
+ "\u0084",
138
+ "\u0085",
139
+ "\u0086",
140
+ "\u0087",
141
+ "\u0088",
142
+ "\u0089",
143
+ "\u008a",
144
+ "\u008b",
145
+ "\u008c",
146
+ "\u008d",
147
+ "\u008e",
148
+ "\u008f",
149
+ "\u0090",
150
+ "\u0091",
151
+ "\u0092",
152
+ "\u0093",
153
+ "\u0094",
154
+ "\u0095",
155
+ "\u0096",
156
+ "\u0097",
157
+ "\u0098",
158
+ "\u0099",
159
+ "\u009a",
160
+ "\u009b",
161
+ "\u009c",
162
+ "\u009d",
163
+ "\u009e",
164
+ "\u009f",
165
+ "\u00a0",
166
+ "\u00a1",
167
+ "\u00a2",
168
+ "\u00a3",
169
+ "\u00a4",
170
+ "\u00a5",
171
+ "\u00a6",
172
+ "\u00a7",
173
+ "\u00a8",
174
+ "\u00a9",
175
+ "\u00aa",
176
+ "\u00ab",
177
+ "\u00ac",
178
+ "\u00ad",
179
+ "\u00ae",
180
+ "\u00af",
181
+ "\u00b0",
182
+ "\u00b1",
183
+ "\u00b2",
184
+ "\u00b3",
185
+ "\u00b4",
186
+ "\u00b5",
187
+ "\u00b6",
188
+ "\u00b7",
189
+ "\u00b8",
190
+ "\u00b9",
191
+ "\u00ba",
192
+ "\u00bb",
193
+ "\u00bc",
194
+ "\u00bd",
195
+ "\u00be",
196
+ "\u00bf",
197
+ "\u00c0",
198
+ "\u00c1",
199
+ "\u00c2",
200
+ "\u00c3",
201
+ "\u00c4",
202
+ "\u00c5",
203
+ "\u00c6",
204
+ "\u00c7",
205
+ "\u00c8",
206
+ "\u00c9",
207
+ "\u00ca",
208
+ "\u00cb",
209
+ "\u00cc",
210
+ "\u00cd",
211
+ "\u00ce",
212
+ "\u00cf",
213
+ "\u00d0",
214
+ "\u00d1",
215
+ "\u00d2",
216
+ "\u00d3",
217
+ "\u00d4",
218
+ "\u00d5",
219
+ "\u00d6",
220
+ "\u00d7",
221
+ "\u00d8",
222
+ "\u00d9",
223
+ "\u00da",
224
+ "\u00db",
225
+ "\u00dc",
226
+ "\u00dd",
227
+ "\u00de",
228
+ "\u00df",
229
+ "\u00e0",
230
+ "\u00e1",
231
+ "\u00e2",
232
+ "\u00e3",
233
+ "\u00e4",
234
+ "\u00e5",
235
+ "\u00e6",
236
+ "\u00e7",
237
+ "\u00e8",
238
+ "\u00e9",
239
+ "\u00ea",
240
+ "\u00eb",
241
+ "\u00ec",
242
+ "\u00ed",
243
+ "\u00ee",
244
+ "\u00ef",
245
+ "\u00f0",
246
+ "\u00f1",
247
+ "\u00f2",
248
+ "\u00f3",
249
+ "\u00f4",
250
+ "\u00f5",
251
+ "\u00f6",
252
+ "\u00f7",
253
+ "\u00f8",
254
+ "\u00f9",
255
+ "\u00fa",
256
+ "\u00fb",
257
+ "\u00fc",
258
+ "\u00fd",
259
+ "\u00fe",
260
+ "\u00ff",
261
+ "<extra_id_0>",
262
+ "<extra_id_1>",
263
+ "<extra_id_2>",
264
+ "<extra_id_3>",
265
+ "<extra_id_4>",
266
+ "<extra_id_5>",
267
+ "<extra_id_6>",
268
+ "<extra_id_7>",
269
+ "<extra_id_8>",
270
+ "<extra_id_9>",
271
+ "<extra_id_10>",
272
+ "<extra_id_11>",
273
+ "<extra_id_12>",
274
+ "<extra_id_13>",
275
+ "<extra_id_14>",
276
+ "<extra_id_15>",
277
+ "<extra_id_16>",
278
+ "<extra_id_17>",
279
+ "<extra_id_18>",
280
+ "<extra_id_19>",
281
+ "<extra_id_20>",
282
+ "<extra_id_21>",
283
+ "<extra_id_22>",
284
+ "<extra_id_23>",
285
+ "<extra_id_24>",
286
+ "<extra_id_25>",
287
+ "<extra_id_26>",
288
+ "<extra_id_27>",
289
+ "<extra_id_28>",
290
+ "<extra_id_29>",
291
+ "<extra_id_30>",
292
+ "<extra_id_31>",
293
+ "<extra_id_32>",
294
+ "<extra_id_33>",
295
+ "<extra_id_34>",
296
+ "<extra_id_35>",
297
+ "<extra_id_36>",
298
+ "<extra_id_37>",
299
+ "<extra_id_38>",
300
+ "<extra_id_39>",
301
+ "<extra_id_40>",
302
+ "<extra_id_41>",
303
+ "<extra_id_42>",
304
+ "<extra_id_43>",
305
+ "<extra_id_44>",
306
+ "<extra_id_45>",
307
+ "<extra_id_46>",
308
+ "<extra_id_47>",
309
+ "<extra_id_48>",
310
+ "<extra_id_49>",
311
+ "<extra_id_50>",
312
+ "<extra_id_51>",
313
+ "<extra_id_52>",
314
+ "<extra_id_53>",
315
+ "<extra_id_54>",
316
+ "<extra_id_55>",
317
+ "<extra_id_56>",
318
+ "<extra_id_57>",
319
+ "<extra_id_58>",
320
+ "<extra_id_59>",
321
+ "<extra_id_60>",
322
+ "<extra_id_61>",
323
+ "<extra_id_62>",
324
+ "<extra_id_63>",
325
+ "<extra_id_64>",
326
+ "<extra_id_65>",
327
+ "<extra_id_66>",
328
+ "<extra_id_67>",
329
+ "<extra_id_68>",
330
+ "<extra_id_69>",
331
+ "<extra_id_70>",
332
+ "<extra_id_71>",
333
+ "<extra_id_72>",
334
+ "<extra_id_73>",
335
+ "<extra_id_74>",
336
+ "<extra_id_75>",
337
+ "<extra_id_76>",
338
+ "<extra_id_77>",
339
+ "<extra_id_78>",
340
+ "<extra_id_79>",
341
+ "<extra_id_80>",
342
+ "<extra_id_81>",
343
+ "<extra_id_82>",
344
+ "<extra_id_83>",
345
+ "<extra_id_84>",
346
+ "<extra_id_85>",
347
+ "<extra_id_86>",
348
+ "<extra_id_87>",
349
+ "<extra_id_88>",
350
+ "<extra_id_89>",
351
+ "<extra_id_90>",
352
+ "<extra_id_91>",
353
+ "<extra_id_92>",
354
+ "<extra_id_93>",
355
+ "<extra_id_94>",
356
+ "<extra_id_95>",
357
+ "<extra_id_96>",
358
+ "<extra_id_97>",
359
+ "<extra_id_98>",
360
+ "<extra_id_99>",
361
+ "<extra_id_100>",
362
+ "<extra_id_101>",
363
+ "<extra_id_102>",
364
+ "<extra_id_103>",
365
+ "<extra_id_104>",
366
+ "<extra_id_105>",
367
+ "<extra_id_106>",
368
+ "<extra_id_107>",
369
+ "<extra_id_108>",
370
+ "<extra_id_109>",
371
+ "<extra_id_110>",
372
+ "<extra_id_111>",
373
+ "<extra_id_112>",
374
+ "<extra_id_113>",
375
+ "<extra_id_114>",
376
+ "<extra_id_115>",
377
+ "<extra_id_116>",
378
+ "<extra_id_117>",
379
+ "<extra_id_118>",
380
+ "<extra_id_119>",
381
+ "<extra_id_120>",
382
+ "<extra_id_121>",
383
+ "<extra_id_122>",
384
+ "<extra_id_123>",
385
+ "<extra_id_124>"
386
+ ]
ct2_int8/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_source_bos": false,
3
+ "add_source_eos": false,
4
+ "bos_token": "<pad>",
5
+ "decoder_start_token": "<pad>",
6
+ "eos_token": "</s>",
7
+ "layer_norm_epsilon": null,
8
+ "multi_query_attention": false,
9
+ "unk_token": "<unk>"
10
+ }
ct2_int8/model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e9b288eabb289e368558bba02334d7d6c200609388cdc36fc19becb6d7a61fed
3
+ size 583313054
ct2_int8/shared_vocabulary.json ADDED
@@ -0,0 +1,386 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ "<pad>",
3
+ "</s>",
4
+ "<unk>",
5
+ "\u0000",
6
+ "\u0001",
7
+ "\u0002",
8
+ "\u0003",
9
+ "\u0004",
10
+ "\u0005",
11
+ "\u0006",
12
+ "\u0007",
13
+ "\b",
14
+ "\t",
15
+ "\n",
16
+ "\u000b",
17
+ "\f",
18
+ "\r",
19
+ "\u000e",
20
+ "\u000f",
21
+ "\u0010",
22
+ "\u0011",
23
+ "\u0012",
24
+ "\u0013",
25
+ "\u0014",
26
+ "\u0015",
27
+ "\u0016",
28
+ "\u0017",
29
+ "\u0018",
30
+ "\u0019",
31
+ "\u001a",
32
+ "\u001b",
33
+ "\u001c",
34
+ "\u001d",
35
+ "\u001e",
36
+ "\u001f",
37
+ " ",
38
+ "!",
39
+ "\"",
40
+ "#",
41
+ "$",
42
+ "%",
43
+ "&",
44
+ "'",
45
+ "(",
46
+ ")",
47
+ "*",
48
+ "+",
49
+ ",",
50
+ "-",
51
+ ".",
52
+ "/",
53
+ "0",
54
+ "1",
55
+ "2",
56
+ "3",
57
+ "4",
58
+ "5",
59
+ "6",
60
+ "7",
61
+ "8",
62
+ "9",
63
+ ":",
64
+ ";",
65
+ "<",
66
+ "=",
67
+ ">",
68
+ "?",
69
+ "@",
70
+ "A",
71
+ "B",
72
+ "C",
73
+ "D",
74
+ "E",
75
+ "F",
76
+ "G",
77
+ "H",
78
+ "I",
79
+ "J",
80
+ "K",
81
+ "L",
82
+ "M",
83
+ "N",
84
+ "O",
85
+ "P",
86
+ "Q",
87
+ "R",
88
+ "S",
89
+ "T",
90
+ "U",
91
+ "V",
92
+ "W",
93
+ "X",
94
+ "Y",
95
+ "Z",
96
+ "[",
97
+ "\\",
98
+ "]",
99
+ "^",
100
+ "_",
101
+ "`",
102
+ "a",
103
+ "b",
104
+ "c",
105
+ "d",
106
+ "e",
107
+ "f",
108
+ "g",
109
+ "h",
110
+ "i",
111
+ "j",
112
+ "k",
113
+ "l",
114
+ "m",
115
+ "n",
116
+ "o",
117
+ "p",
118
+ "q",
119
+ "r",
120
+ "s",
121
+ "t",
122
+ "u",
123
+ "v",
124
+ "w",
125
+ "x",
126
+ "y",
127
+ "z",
128
+ "{",
129
+ "|",
130
+ "}",
131
+ "~",
132
+ "\u007f",
133
+ "\u0080",
134
+ "\u0081",
135
+ "\u0082",
136
+ "\u0083",
137
+ "\u0084",
138
+ "\u0085",
139
+ "\u0086",
140
+ "\u0087",
141
+ "\u0088",
142
+ "\u0089",
143
+ "\u008a",
144
+ "\u008b",
145
+ "\u008c",
146
+ "\u008d",
147
+ "\u008e",
148
+ "\u008f",
149
+ "\u0090",
150
+ "\u0091",
151
+ "\u0092",
152
+ "\u0093",
153
+ "\u0094",
154
+ "\u0095",
155
+ "\u0096",
156
+ "\u0097",
157
+ "\u0098",
158
+ "\u0099",
159
+ "\u009a",
160
+ "\u009b",
161
+ "\u009c",
162
+ "\u009d",
163
+ "\u009e",
164
+ "\u009f",
165
+ "\u00a0",
166
+ "\u00a1",
167
+ "\u00a2",
168
+ "\u00a3",
169
+ "\u00a4",
170
+ "\u00a5",
171
+ "\u00a6",
172
+ "\u00a7",
173
+ "\u00a8",
174
+ "\u00a9",
175
+ "\u00aa",
176
+ "\u00ab",
177
+ "\u00ac",
178
+ "\u00ad",
179
+ "\u00ae",
180
+ "\u00af",
181
+ "\u00b0",
182
+ "\u00b1",
183
+ "\u00b2",
184
+ "\u00b3",
185
+ "\u00b4",
186
+ "\u00b5",
187
+ "\u00b6",
188
+ "\u00b7",
189
+ "\u00b8",
190
+ "\u00b9",
191
+ "\u00ba",
192
+ "\u00bb",
193
+ "\u00bc",
194
+ "\u00bd",
195
+ "\u00be",
196
+ "\u00bf",
197
+ "\u00c0",
198
+ "\u00c1",
199
+ "\u00c2",
200
+ "\u00c3",
201
+ "\u00c4",
202
+ "\u00c5",
203
+ "\u00c6",
204
+ "\u00c7",
205
+ "\u00c8",
206
+ "\u00c9",
207
+ "\u00ca",
208
+ "\u00cb",
209
+ "\u00cc",
210
+ "\u00cd",
211
+ "\u00ce",
212
+ "\u00cf",
213
+ "\u00d0",
214
+ "\u00d1",
215
+ "\u00d2",
216
+ "\u00d3",
217
+ "\u00d4",
218
+ "\u00d5",
219
+ "\u00d6",
220
+ "\u00d7",
221
+ "\u00d8",
222
+ "\u00d9",
223
+ "\u00da",
224
+ "\u00db",
225
+ "\u00dc",
226
+ "\u00dd",
227
+ "\u00de",
228
+ "\u00df",
229
+ "\u00e0",
230
+ "\u00e1",
231
+ "\u00e2",
232
+ "\u00e3",
233
+ "\u00e4",
234
+ "\u00e5",
235
+ "\u00e6",
236
+ "\u00e7",
237
+ "\u00e8",
238
+ "\u00e9",
239
+ "\u00ea",
240
+ "\u00eb",
241
+ "\u00ec",
242
+ "\u00ed",
243
+ "\u00ee",
244
+ "\u00ef",
245
+ "\u00f0",
246
+ "\u00f1",
247
+ "\u00f2",
248
+ "\u00f3",
249
+ "\u00f4",
250
+ "\u00f5",
251
+ "\u00f6",
252
+ "\u00f7",
253
+ "\u00f8",
254
+ "\u00f9",
255
+ "\u00fa",
256
+ "\u00fb",
257
+ "\u00fc",
258
+ "\u00fd",
259
+ "\u00fe",
260
+ "\u00ff",
261
+ "<extra_id_0>",
262
+ "<extra_id_1>",
263
+ "<extra_id_2>",
264
+ "<extra_id_3>",
265
+ "<extra_id_4>",
266
+ "<extra_id_5>",
267
+ "<extra_id_6>",
268
+ "<extra_id_7>",
269
+ "<extra_id_8>",
270
+ "<extra_id_9>",
271
+ "<extra_id_10>",
272
+ "<extra_id_11>",
273
+ "<extra_id_12>",
274
+ "<extra_id_13>",
275
+ "<extra_id_14>",
276
+ "<extra_id_15>",
277
+ "<extra_id_16>",
278
+ "<extra_id_17>",
279
+ "<extra_id_18>",
280
+ "<extra_id_19>",
281
+ "<extra_id_20>",
282
+ "<extra_id_21>",
283
+ "<extra_id_22>",
284
+ "<extra_id_23>",
285
+ "<extra_id_24>",
286
+ "<extra_id_25>",
287
+ "<extra_id_26>",
288
+ "<extra_id_27>",
289
+ "<extra_id_28>",
290
+ "<extra_id_29>",
291
+ "<extra_id_30>",
292
+ "<extra_id_31>",
293
+ "<extra_id_32>",
294
+ "<extra_id_33>",
295
+ "<extra_id_34>",
296
+ "<extra_id_35>",
297
+ "<extra_id_36>",
298
+ "<extra_id_37>",
299
+ "<extra_id_38>",
300
+ "<extra_id_39>",
301
+ "<extra_id_40>",
302
+ "<extra_id_41>",
303
+ "<extra_id_42>",
304
+ "<extra_id_43>",
305
+ "<extra_id_44>",
306
+ "<extra_id_45>",
307
+ "<extra_id_46>",
308
+ "<extra_id_47>",
309
+ "<extra_id_48>",
310
+ "<extra_id_49>",
311
+ "<extra_id_50>",
312
+ "<extra_id_51>",
313
+ "<extra_id_52>",
314
+ "<extra_id_53>",
315
+ "<extra_id_54>",
316
+ "<extra_id_55>",
317
+ "<extra_id_56>",
318
+ "<extra_id_57>",
319
+ "<extra_id_58>",
320
+ "<extra_id_59>",
321
+ "<extra_id_60>",
322
+ "<extra_id_61>",
323
+ "<extra_id_62>",
324
+ "<extra_id_63>",
325
+ "<extra_id_64>",
326
+ "<extra_id_65>",
327
+ "<extra_id_66>",
328
+ "<extra_id_67>",
329
+ "<extra_id_68>",
330
+ "<extra_id_69>",
331
+ "<extra_id_70>",
332
+ "<extra_id_71>",
333
+ "<extra_id_72>",
334
+ "<extra_id_73>",
335
+ "<extra_id_74>",
336
+ "<extra_id_75>",
337
+ "<extra_id_76>",
338
+ "<extra_id_77>",
339
+ "<extra_id_78>",
340
+ "<extra_id_79>",
341
+ "<extra_id_80>",
342
+ "<extra_id_81>",
343
+ "<extra_id_82>",
344
+ "<extra_id_83>",
345
+ "<extra_id_84>",
346
+ "<extra_id_85>",
347
+ "<extra_id_86>",
348
+ "<extra_id_87>",
349
+ "<extra_id_88>",
350
+ "<extra_id_89>",
351
+ "<extra_id_90>",
352
+ "<extra_id_91>",
353
+ "<extra_id_92>",
354
+ "<extra_id_93>",
355
+ "<extra_id_94>",
356
+ "<extra_id_95>",
357
+ "<extra_id_96>",
358
+ "<extra_id_97>",
359
+ "<extra_id_98>",
360
+ "<extra_id_99>",
361
+ "<extra_id_100>",
362
+ "<extra_id_101>",
363
+ "<extra_id_102>",
364
+ "<extra_id_103>",
365
+ "<extra_id_104>",
366
+ "<extra_id_105>",
367
+ "<extra_id_106>",
368
+ "<extra_id_107>",
369
+ "<extra_id_108>",
370
+ "<extra_id_109>",
371
+ "<extra_id_110>",
372
+ "<extra_id_111>",
373
+ "<extra_id_112>",
374
+ "<extra_id_113>",
375
+ "<extra_id_114>",
376
+ "<extra_id_115>",
377
+ "<extra_id_116>",
378
+ "<extra_id_117>",
379
+ "<extra_id_118>",
380
+ "<extra_id_119>",
381
+ "<extra_id_120>",
382
+ "<extra_id_121>",
383
+ "<extra_id_122>",
384
+ "<extra_id_123>",
385
+ "<extra_id_124>"
386
+ ]
generation_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "decoder_start_token_id": 0,
4
+ "eos_token_id": [
5
+ 1
6
+ ],
7
+ "pad_token_id": 0,
8
+ "transformers_version": "5.5.0"
9
+ }
handler.py ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """HuggingFace Inference API handler for text normalization.
2
+
3
+ This enables the model to work with the HuggingFace Inference API
4
+ and the `text2text-generation` pipeline.
5
+ """
6
+
7
+ from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, pipeline
8
+
9
+
10
+ class EndpointHandler:
11
+ def __init__(self, path: str = ""):
12
+ self.model = AutoModelForSeq2SeqLM.from_pretrained(path)
13
+ self.tokenizer = AutoTokenizer.from_pretrained("google/byt5-base")
14
+ self.model.eval()
15
+
16
+ def __call__(self, data):
17
+ """Handle inference request.
18
+
19
+ Expected input format:
20
+ {"inputs": "<de> Das kostet 12,50 €."}
21
+ or:
22
+ {"inputs": "Das kostet 12,50 €.", "parameters": {"language": "de"}}
23
+ """
24
+ inputs = data.get("inputs", "")
25
+ params = data.get("parameters", {})
26
+
27
+ # If language is passed separately, add the prefix
28
+ if not inputs.startswith("<") and "language" in params:
29
+ inputs = f"<{params['language']}> {inputs}"
30
+
31
+ tokenized = self.tokenizer(
32
+ inputs, return_tensors="pt", max_length=512, truncation=True
33
+ )
34
+
35
+ import torch
36
+
37
+ with torch.no_grad():
38
+ output = self.model.generate(
39
+ **tokenized,
40
+ max_new_tokens=params.get("max_new_tokens", 512),
41
+ num_beams=params.get("num_beams", 1),
42
+ )
43
+
44
+ result = self.tokenizer.decode(output[0], skip_special_tokens=True)
45
+ return [{"generated_text": result}]
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6d134b0cf7cd66a6b5d8bf0a9bc678ddc2179c2d453ea2a00bad27ce7591029a
3
+ size 2326643632
tokenizer_config.json ADDED
@@ -0,0 +1,1290 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<pad>",
5
+ "lstrip": false,
6
+ "normalized": true,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "</s>",
13
+ "lstrip": false,
14
+ "normalized": true,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "<unk>",
21
+ "lstrip": false,
22
+ "normalized": true,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "259": {
28
+ "content": "<extra_id_0>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "260": {
36
+ "content": "<extra_id_1>",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ "261": {
44
+ "content": "<extra_id_2>",
45
+ "lstrip": false,
46
+ "normalized": false,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": true
50
+ },
51
+ "262": {
52
+ "content": "<extra_id_3>",
53
+ "lstrip": false,
54
+ "normalized": false,
55
+ "rstrip": false,
56
+ "single_word": false,
57
+ "special": true
58
+ },
59
+ "263": {
60
+ "content": "<extra_id_4>",
61
+ "lstrip": false,
62
+ "normalized": false,
63
+ "rstrip": false,
64
+ "single_word": false,
65
+ "special": true
66
+ },
67
+ "264": {
68
+ "content": "<extra_id_5>",
69
+ "lstrip": false,
70
+ "normalized": false,
71
+ "rstrip": false,
72
+ "single_word": false,
73
+ "special": true
74
+ },
75
+ "265": {
76
+ "content": "<extra_id_6>",
77
+ "lstrip": false,
78
+ "normalized": false,
79
+ "rstrip": false,
80
+ "single_word": false,
81
+ "special": true
82
+ },
83
+ "266": {
84
+ "content": "<extra_id_7>",
85
+ "lstrip": false,
86
+ "normalized": false,
87
+ "rstrip": false,
88
+ "single_word": false,
89
+ "special": true
90
+ },
91
+ "267": {
92
+ "content": "<extra_id_8>",
93
+ "lstrip": false,
94
+ "normalized": false,
95
+ "rstrip": false,
96
+ "single_word": false,
97
+ "special": true
98
+ },
99
+ "268": {
100
+ "content": "<extra_id_9>",
101
+ "lstrip": false,
102
+ "normalized": false,
103
+ "rstrip": false,
104
+ "single_word": false,
105
+ "special": true
106
+ },
107
+ "269": {
108
+ "content": "<extra_id_10>",
109
+ "lstrip": false,
110
+ "normalized": false,
111
+ "rstrip": false,
112
+ "single_word": false,
113
+ "special": true
114
+ },
115
+ "270": {
116
+ "content": "<extra_id_11>",
117
+ "lstrip": false,
118
+ "normalized": false,
119
+ "rstrip": false,
120
+ "single_word": false,
121
+ "special": true
122
+ },
123
+ "271": {
124
+ "content": "<extra_id_12>",
125
+ "lstrip": false,
126
+ "normalized": false,
127
+ "rstrip": false,
128
+ "single_word": false,
129
+ "special": true
130
+ },
131
+ "272": {
132
+ "content": "<extra_id_13>",
133
+ "lstrip": false,
134
+ "normalized": false,
135
+ "rstrip": false,
136
+ "single_word": false,
137
+ "special": true
138
+ },
139
+ "273": {
140
+ "content": "<extra_id_14>",
141
+ "lstrip": false,
142
+ "normalized": false,
143
+ "rstrip": false,
144
+ "single_word": false,
145
+ "special": true
146
+ },
147
+ "274": {
148
+ "content": "<extra_id_15>",
149
+ "lstrip": false,
150
+ "normalized": false,
151
+ "rstrip": false,
152
+ "single_word": false,
153
+ "special": true
154
+ },
155
+ "275": {
156
+ "content": "<extra_id_16>",
157
+ "lstrip": false,
158
+ "normalized": false,
159
+ "rstrip": false,
160
+ "single_word": false,
161
+ "special": true
162
+ },
163
+ "276": {
164
+ "content": "<extra_id_17>",
165
+ "lstrip": false,
166
+ "normalized": false,
167
+ "rstrip": false,
168
+ "single_word": false,
169
+ "special": true
170
+ },
171
+ "277": {
172
+ "content": "<extra_id_18>",
173
+ "lstrip": false,
174
+ "normalized": false,
175
+ "rstrip": false,
176
+ "single_word": false,
177
+ "special": true
178
+ },
179
+ "278": {
180
+ "content": "<extra_id_19>",
181
+ "lstrip": false,
182
+ "normalized": false,
183
+ "rstrip": false,
184
+ "single_word": false,
185
+ "special": true
186
+ },
187
+ "279": {
188
+ "content": "<extra_id_20>",
189
+ "lstrip": false,
190
+ "normalized": false,
191
+ "rstrip": false,
192
+ "single_word": false,
193
+ "special": true
194
+ },
195
+ "280": {
196
+ "content": "<extra_id_21>",
197
+ "lstrip": false,
198
+ "normalized": false,
199
+ "rstrip": false,
200
+ "single_word": false,
201
+ "special": true
202
+ },
203
+ "281": {
204
+ "content": "<extra_id_22>",
205
+ "lstrip": false,
206
+ "normalized": false,
207
+ "rstrip": false,
208
+ "single_word": false,
209
+ "special": true
210
+ },
211
+ "282": {
212
+ "content": "<extra_id_23>",
213
+ "lstrip": false,
214
+ "normalized": false,
215
+ "rstrip": false,
216
+ "single_word": false,
217
+ "special": true
218
+ },
219
+ "283": {
220
+ "content": "<extra_id_24>",
221
+ "lstrip": false,
222
+ "normalized": false,
223
+ "rstrip": false,
224
+ "single_word": false,
225
+ "special": true
226
+ },
227
+ "284": {
228
+ "content": "<extra_id_25>",
229
+ "lstrip": false,
230
+ "normalized": false,
231
+ "rstrip": false,
232
+ "single_word": false,
233
+ "special": true
234
+ },
235
+ "285": {
236
+ "content": "<extra_id_26>",
237
+ "lstrip": false,
238
+ "normalized": false,
239
+ "rstrip": false,
240
+ "single_word": false,
241
+ "special": true
242
+ },
243
+ "286": {
244
+ "content": "<extra_id_27>",
245
+ "lstrip": false,
246
+ "normalized": false,
247
+ "rstrip": false,
248
+ "single_word": false,
249
+ "special": true
250
+ },
251
+ "287": {
252
+ "content": "<extra_id_28>",
253
+ "lstrip": false,
254
+ "normalized": false,
255
+ "rstrip": false,
256
+ "single_word": false,
257
+ "special": true
258
+ },
259
+ "288": {
260
+ "content": "<extra_id_29>",
261
+ "lstrip": false,
262
+ "normalized": false,
263
+ "rstrip": false,
264
+ "single_word": false,
265
+ "special": true
266
+ },
267
+ "289": {
268
+ "content": "<extra_id_30>",
269
+ "lstrip": false,
270
+ "normalized": false,
271
+ "rstrip": false,
272
+ "single_word": false,
273
+ "special": true
274
+ },
275
+ "290": {
276
+ "content": "<extra_id_31>",
277
+ "lstrip": false,
278
+ "normalized": false,
279
+ "rstrip": false,
280
+ "single_word": false,
281
+ "special": true
282
+ },
283
+ "291": {
284
+ "content": "<extra_id_32>",
285
+ "lstrip": false,
286
+ "normalized": false,
287
+ "rstrip": false,
288
+ "single_word": false,
289
+ "special": true
290
+ },
291
+ "292": {
292
+ "content": "<extra_id_33>",
293
+ "lstrip": false,
294
+ "normalized": false,
295
+ "rstrip": false,
296
+ "single_word": false,
297
+ "special": true
298
+ },
299
+ "293": {
300
+ "content": "<extra_id_34>",
301
+ "lstrip": false,
302
+ "normalized": false,
303
+ "rstrip": false,
304
+ "single_word": false,
305
+ "special": true
306
+ },
307
+ "294": {
308
+ "content": "<extra_id_35>",
309
+ "lstrip": false,
310
+ "normalized": false,
311
+ "rstrip": false,
312
+ "single_word": false,
313
+ "special": true
314
+ },
315
+ "295": {
316
+ "content": "<extra_id_36>",
317
+ "lstrip": false,
318
+ "normalized": false,
319
+ "rstrip": false,
320
+ "single_word": false,
321
+ "special": true
322
+ },
323
+ "296": {
324
+ "content": "<extra_id_37>",
325
+ "lstrip": false,
326
+ "normalized": false,
327
+ "rstrip": false,
328
+ "single_word": false,
329
+ "special": true
330
+ },
331
+ "297": {
332
+ "content": "<extra_id_38>",
333
+ "lstrip": false,
334
+ "normalized": false,
335
+ "rstrip": false,
336
+ "single_word": false,
337
+ "special": true
338
+ },
339
+ "298": {
340
+ "content": "<extra_id_39>",
341
+ "lstrip": false,
342
+ "normalized": false,
343
+ "rstrip": false,
344
+ "single_word": false,
345
+ "special": true
346
+ },
347
+ "299": {
348
+ "content": "<extra_id_40>",
349
+ "lstrip": false,
350
+ "normalized": false,
351
+ "rstrip": false,
352
+ "single_word": false,
353
+ "special": true
354
+ },
355
+ "300": {
356
+ "content": "<extra_id_41>",
357
+ "lstrip": false,
358
+ "normalized": false,
359
+ "rstrip": false,
360
+ "single_word": false,
361
+ "special": true
362
+ },
363
+ "301": {
364
+ "content": "<extra_id_42>",
365
+ "lstrip": false,
366
+ "normalized": false,
367
+ "rstrip": false,
368
+ "single_word": false,
369
+ "special": true
370
+ },
371
+ "302": {
372
+ "content": "<extra_id_43>",
373
+ "lstrip": false,
374
+ "normalized": false,
375
+ "rstrip": false,
376
+ "single_word": false,
377
+ "special": true
378
+ },
379
+ "303": {
380
+ "content": "<extra_id_44>",
381
+ "lstrip": false,
382
+ "normalized": false,
383
+ "rstrip": false,
384
+ "single_word": false,
385
+ "special": true
386
+ },
387
+ "304": {
388
+ "content": "<extra_id_45>",
389
+ "lstrip": false,
390
+ "normalized": false,
391
+ "rstrip": false,
392
+ "single_word": false,
393
+ "special": true
394
+ },
395
+ "305": {
396
+ "content": "<extra_id_46>",
397
+ "lstrip": false,
398
+ "normalized": false,
399
+ "rstrip": false,
400
+ "single_word": false,
401
+ "special": true
402
+ },
403
+ "306": {
404
+ "content": "<extra_id_47>",
405
+ "lstrip": false,
406
+ "normalized": false,
407
+ "rstrip": false,
408
+ "single_word": false,
409
+ "special": true
410
+ },
411
+ "307": {
412
+ "content": "<extra_id_48>",
413
+ "lstrip": false,
414
+ "normalized": false,
415
+ "rstrip": false,
416
+ "single_word": false,
417
+ "special": true
418
+ },
419
+ "308": {
420
+ "content": "<extra_id_49>",
421
+ "lstrip": false,
422
+ "normalized": false,
423
+ "rstrip": false,
424
+ "single_word": false,
425
+ "special": true
426
+ },
427
+ "309": {
428
+ "content": "<extra_id_50>",
429
+ "lstrip": false,
430
+ "normalized": false,
431
+ "rstrip": false,
432
+ "single_word": false,
433
+ "special": true
434
+ },
435
+ "310": {
436
+ "content": "<extra_id_51>",
437
+ "lstrip": false,
438
+ "normalized": false,
439
+ "rstrip": false,
440
+ "single_word": false,
441
+ "special": true
442
+ },
443
+ "311": {
444
+ "content": "<extra_id_52>",
445
+ "lstrip": false,
446
+ "normalized": false,
447
+ "rstrip": false,
448
+ "single_word": false,
449
+ "special": true
450
+ },
451
+ "312": {
452
+ "content": "<extra_id_53>",
453
+ "lstrip": false,
454
+ "normalized": false,
455
+ "rstrip": false,
456
+ "single_word": false,
457
+ "special": true
458
+ },
459
+ "313": {
460
+ "content": "<extra_id_54>",
461
+ "lstrip": false,
462
+ "normalized": false,
463
+ "rstrip": false,
464
+ "single_word": false,
465
+ "special": true
466
+ },
467
+ "314": {
468
+ "content": "<extra_id_55>",
469
+ "lstrip": false,
470
+ "normalized": false,
471
+ "rstrip": false,
472
+ "single_word": false,
473
+ "special": true
474
+ },
475
+ "315": {
476
+ "content": "<extra_id_56>",
477
+ "lstrip": false,
478
+ "normalized": false,
479
+ "rstrip": false,
480
+ "single_word": false,
481
+ "special": true
482
+ },
483
+ "316": {
484
+ "content": "<extra_id_57>",
485
+ "lstrip": false,
486
+ "normalized": false,
487
+ "rstrip": false,
488
+ "single_word": false,
489
+ "special": true
490
+ },
491
+ "317": {
492
+ "content": "<extra_id_58>",
493
+ "lstrip": false,
494
+ "normalized": false,
495
+ "rstrip": false,
496
+ "single_word": false,
497
+ "special": true
498
+ },
499
+ "318": {
500
+ "content": "<extra_id_59>",
501
+ "lstrip": false,
502
+ "normalized": false,
503
+ "rstrip": false,
504
+ "single_word": false,
505
+ "special": true
506
+ },
507
+ "319": {
508
+ "content": "<extra_id_60>",
509
+ "lstrip": false,
510
+ "normalized": false,
511
+ "rstrip": false,
512
+ "single_word": false,
513
+ "special": true
514
+ },
515
+ "320": {
516
+ "content": "<extra_id_61>",
517
+ "lstrip": false,
518
+ "normalized": false,
519
+ "rstrip": false,
520
+ "single_word": false,
521
+ "special": true
522
+ },
523
+ "321": {
524
+ "content": "<extra_id_62>",
525
+ "lstrip": false,
526
+ "normalized": false,
527
+ "rstrip": false,
528
+ "single_word": false,
529
+ "special": true
530
+ },
531
+ "322": {
532
+ "content": "<extra_id_63>",
533
+ "lstrip": false,
534
+ "normalized": false,
535
+ "rstrip": false,
536
+ "single_word": false,
537
+ "special": true
538
+ },
539
+ "323": {
540
+ "content": "<extra_id_64>",
541
+ "lstrip": false,
542
+ "normalized": false,
543
+ "rstrip": false,
544
+ "single_word": false,
545
+ "special": true
546
+ },
547
+ "324": {
548
+ "content": "<extra_id_65>",
549
+ "lstrip": false,
550
+ "normalized": false,
551
+ "rstrip": false,
552
+ "single_word": false,
553
+ "special": true
554
+ },
555
+ "325": {
556
+ "content": "<extra_id_66>",
557
+ "lstrip": false,
558
+ "normalized": false,
559
+ "rstrip": false,
560
+ "single_word": false,
561
+ "special": true
562
+ },
563
+ "326": {
564
+ "content": "<extra_id_67>",
565
+ "lstrip": false,
566
+ "normalized": false,
567
+ "rstrip": false,
568
+ "single_word": false,
569
+ "special": true
570
+ },
571
+ "327": {
572
+ "content": "<extra_id_68>",
573
+ "lstrip": false,
574
+ "normalized": false,
575
+ "rstrip": false,
576
+ "single_word": false,
577
+ "special": true
578
+ },
579
+ "328": {
580
+ "content": "<extra_id_69>",
581
+ "lstrip": false,
582
+ "normalized": false,
583
+ "rstrip": false,
584
+ "single_word": false,
585
+ "special": true
586
+ },
587
+ "329": {
588
+ "content": "<extra_id_70>",
589
+ "lstrip": false,
590
+ "normalized": false,
591
+ "rstrip": false,
592
+ "single_word": false,
593
+ "special": true
594
+ },
595
+ "330": {
596
+ "content": "<extra_id_71>",
597
+ "lstrip": false,
598
+ "normalized": false,
599
+ "rstrip": false,
600
+ "single_word": false,
601
+ "special": true
602
+ },
603
+ "331": {
604
+ "content": "<extra_id_72>",
605
+ "lstrip": false,
606
+ "normalized": false,
607
+ "rstrip": false,
608
+ "single_word": false,
609
+ "special": true
610
+ },
611
+ "332": {
612
+ "content": "<extra_id_73>",
613
+ "lstrip": false,
614
+ "normalized": false,
615
+ "rstrip": false,
616
+ "single_word": false,
617
+ "special": true
618
+ },
619
+ "333": {
620
+ "content": "<extra_id_74>",
621
+ "lstrip": false,
622
+ "normalized": false,
623
+ "rstrip": false,
624
+ "single_word": false,
625
+ "special": true
626
+ },
627
+ "334": {
628
+ "content": "<extra_id_75>",
629
+ "lstrip": false,
630
+ "normalized": false,
631
+ "rstrip": false,
632
+ "single_word": false,
633
+ "special": true
634
+ },
635
+ "335": {
636
+ "content": "<extra_id_76>",
637
+ "lstrip": false,
638
+ "normalized": false,
639
+ "rstrip": false,
640
+ "single_word": false,
641
+ "special": true
642
+ },
643
+ "336": {
644
+ "content": "<extra_id_77>",
645
+ "lstrip": false,
646
+ "normalized": false,
647
+ "rstrip": false,
648
+ "single_word": false,
649
+ "special": true
650
+ },
651
+ "337": {
652
+ "content": "<extra_id_78>",
653
+ "lstrip": false,
654
+ "normalized": false,
655
+ "rstrip": false,
656
+ "single_word": false,
657
+ "special": true
658
+ },
659
+ "338": {
660
+ "content": "<extra_id_79>",
661
+ "lstrip": false,
662
+ "normalized": false,
663
+ "rstrip": false,
664
+ "single_word": false,
665
+ "special": true
666
+ },
667
+ "339": {
668
+ "content": "<extra_id_80>",
669
+ "lstrip": false,
670
+ "normalized": false,
671
+ "rstrip": false,
672
+ "single_word": false,
673
+ "special": true
674
+ },
675
+ "340": {
676
+ "content": "<extra_id_81>",
677
+ "lstrip": false,
678
+ "normalized": false,
679
+ "rstrip": false,
680
+ "single_word": false,
681
+ "special": true
682
+ },
683
+ "341": {
684
+ "content": "<extra_id_82>",
685
+ "lstrip": false,
686
+ "normalized": false,
687
+ "rstrip": false,
688
+ "single_word": false,
689
+ "special": true
690
+ },
691
+ "342": {
692
+ "content": "<extra_id_83>",
693
+ "lstrip": false,
694
+ "normalized": false,
695
+ "rstrip": false,
696
+ "single_word": false,
697
+ "special": true
698
+ },
699
+ "343": {
700
+ "content": "<extra_id_84>",
701
+ "lstrip": false,
702
+ "normalized": false,
703
+ "rstrip": false,
704
+ "single_word": false,
705
+ "special": true
706
+ },
707
+ "344": {
708
+ "content": "<extra_id_85>",
709
+ "lstrip": false,
710
+ "normalized": false,
711
+ "rstrip": false,
712
+ "single_word": false,
713
+ "special": true
714
+ },
715
+ "345": {
716
+ "content": "<extra_id_86>",
717
+ "lstrip": false,
718
+ "normalized": false,
719
+ "rstrip": false,
720
+ "single_word": false,
721
+ "special": true
722
+ },
723
+ "346": {
724
+ "content": "<extra_id_87>",
725
+ "lstrip": false,
726
+ "normalized": false,
727
+ "rstrip": false,
728
+ "single_word": false,
729
+ "special": true
730
+ },
731
+ "347": {
732
+ "content": "<extra_id_88>",
733
+ "lstrip": false,
734
+ "normalized": false,
735
+ "rstrip": false,
736
+ "single_word": false,
737
+ "special": true
738
+ },
739
+ "348": {
740
+ "content": "<extra_id_89>",
741
+ "lstrip": false,
742
+ "normalized": false,
743
+ "rstrip": false,
744
+ "single_word": false,
745
+ "special": true
746
+ },
747
+ "349": {
748
+ "content": "<extra_id_90>",
749
+ "lstrip": false,
750
+ "normalized": false,
751
+ "rstrip": false,
752
+ "single_word": false,
753
+ "special": true
754
+ },
755
+ "350": {
756
+ "content": "<extra_id_91>",
757
+ "lstrip": false,
758
+ "normalized": false,
759
+ "rstrip": false,
760
+ "single_word": false,
761
+ "special": true
762
+ },
763
+ "351": {
764
+ "content": "<extra_id_92>",
765
+ "lstrip": false,
766
+ "normalized": false,
767
+ "rstrip": false,
768
+ "single_word": false,
769
+ "special": true
770
+ },
771
+ "352": {
772
+ "content": "<extra_id_93>",
773
+ "lstrip": false,
774
+ "normalized": false,
775
+ "rstrip": false,
776
+ "single_word": false,
777
+ "special": true
778
+ },
779
+ "353": {
780
+ "content": "<extra_id_94>",
781
+ "lstrip": false,
782
+ "normalized": false,
783
+ "rstrip": false,
784
+ "single_word": false,
785
+ "special": true
786
+ },
787
+ "354": {
788
+ "content": "<extra_id_95>",
789
+ "lstrip": false,
790
+ "normalized": false,
791
+ "rstrip": false,
792
+ "single_word": false,
793
+ "special": true
794
+ },
795
+ "355": {
796
+ "content": "<extra_id_96>",
797
+ "lstrip": false,
798
+ "normalized": false,
799
+ "rstrip": false,
800
+ "single_word": false,
801
+ "special": true
802
+ },
803
+ "356": {
804
+ "content": "<extra_id_97>",
805
+ "lstrip": false,
806
+ "normalized": false,
807
+ "rstrip": false,
808
+ "single_word": false,
809
+ "special": true
810
+ },
811
+ "357": {
812
+ "content": "<extra_id_98>",
813
+ "lstrip": false,
814
+ "normalized": false,
815
+ "rstrip": false,
816
+ "single_word": false,
817
+ "special": true
818
+ },
819
+ "358": {
820
+ "content": "<extra_id_99>",
821
+ "lstrip": false,
822
+ "normalized": false,
823
+ "rstrip": false,
824
+ "single_word": false,
825
+ "special": true
826
+ },
827
+ "359": {
828
+ "content": "<extra_id_100>",
829
+ "lstrip": false,
830
+ "normalized": false,
831
+ "rstrip": false,
832
+ "single_word": false,
833
+ "special": true
834
+ },
835
+ "360": {
836
+ "content": "<extra_id_101>",
837
+ "lstrip": false,
838
+ "normalized": false,
839
+ "rstrip": false,
840
+ "single_word": false,
841
+ "special": true
842
+ },
843
+ "361": {
844
+ "content": "<extra_id_102>",
845
+ "lstrip": false,
846
+ "normalized": false,
847
+ "rstrip": false,
848
+ "single_word": false,
849
+ "special": true
850
+ },
851
+ "362": {
852
+ "content": "<extra_id_103>",
853
+ "lstrip": false,
854
+ "normalized": false,
855
+ "rstrip": false,
856
+ "single_word": false,
857
+ "special": true
858
+ },
859
+ "363": {
860
+ "content": "<extra_id_104>",
861
+ "lstrip": false,
862
+ "normalized": false,
863
+ "rstrip": false,
864
+ "single_word": false,
865
+ "special": true
866
+ },
867
+ "364": {
868
+ "content": "<extra_id_105>",
869
+ "lstrip": false,
870
+ "normalized": false,
871
+ "rstrip": false,
872
+ "single_word": false,
873
+ "special": true
874
+ },
875
+ "365": {
876
+ "content": "<extra_id_106>",
877
+ "lstrip": false,
878
+ "normalized": false,
879
+ "rstrip": false,
880
+ "single_word": false,
881
+ "special": true
882
+ },
883
+ "366": {
884
+ "content": "<extra_id_107>",
885
+ "lstrip": false,
886
+ "normalized": false,
887
+ "rstrip": false,
888
+ "single_word": false,
889
+ "special": true
890
+ },
891
+ "367": {
892
+ "content": "<extra_id_108>",
893
+ "lstrip": false,
894
+ "normalized": false,
895
+ "rstrip": false,
896
+ "single_word": false,
897
+ "special": true
898
+ },
899
+ "368": {
900
+ "content": "<extra_id_109>",
901
+ "lstrip": false,
902
+ "normalized": false,
903
+ "rstrip": false,
904
+ "single_word": false,
905
+ "special": true
906
+ },
907
+ "369": {
908
+ "content": "<extra_id_110>",
909
+ "lstrip": false,
910
+ "normalized": false,
911
+ "rstrip": false,
912
+ "single_word": false,
913
+ "special": true
914
+ },
915
+ "370": {
916
+ "content": "<extra_id_111>",
917
+ "lstrip": false,
918
+ "normalized": false,
919
+ "rstrip": false,
920
+ "single_word": false,
921
+ "special": true
922
+ },
923
+ "371": {
924
+ "content": "<extra_id_112>",
925
+ "lstrip": false,
926
+ "normalized": false,
927
+ "rstrip": false,
928
+ "single_word": false,
929
+ "special": true
930
+ },
931
+ "372": {
932
+ "content": "<extra_id_113>",
933
+ "lstrip": false,
934
+ "normalized": false,
935
+ "rstrip": false,
936
+ "single_word": false,
937
+ "special": true
938
+ },
939
+ "373": {
940
+ "content": "<extra_id_114>",
941
+ "lstrip": false,
942
+ "normalized": false,
943
+ "rstrip": false,
944
+ "single_word": false,
945
+ "special": true
946
+ },
947
+ "374": {
948
+ "content": "<extra_id_115>",
949
+ "lstrip": false,
950
+ "normalized": false,
951
+ "rstrip": false,
952
+ "single_word": false,
953
+ "special": true
954
+ },
955
+ "375": {
956
+ "content": "<extra_id_116>",
957
+ "lstrip": false,
958
+ "normalized": false,
959
+ "rstrip": false,
960
+ "single_word": false,
961
+ "special": true
962
+ },
963
+ "376": {
964
+ "content": "<extra_id_117>",
965
+ "lstrip": false,
966
+ "normalized": false,
967
+ "rstrip": false,
968
+ "single_word": false,
969
+ "special": true
970
+ },
971
+ "377": {
972
+ "content": "<extra_id_118>",
973
+ "lstrip": false,
974
+ "normalized": false,
975
+ "rstrip": false,
976
+ "single_word": false,
977
+ "special": true
978
+ },
979
+ "378": {
980
+ "content": "<extra_id_119>",
981
+ "lstrip": false,
982
+ "normalized": false,
983
+ "rstrip": false,
984
+ "single_word": false,
985
+ "special": true
986
+ },
987
+ "379": {
988
+ "content": "<extra_id_120>",
989
+ "lstrip": false,
990
+ "normalized": false,
991
+ "rstrip": false,
992
+ "single_word": false,
993
+ "special": true
994
+ },
995
+ "380": {
996
+ "content": "<extra_id_121>",
997
+ "lstrip": false,
998
+ "normalized": false,
999
+ "rstrip": false,
1000
+ "single_word": false,
1001
+ "special": true
1002
+ },
1003
+ "381": {
1004
+ "content": "<extra_id_122>",
1005
+ "lstrip": false,
1006
+ "normalized": false,
1007
+ "rstrip": false,
1008
+ "single_word": false,
1009
+ "special": true
1010
+ },
1011
+ "382": {
1012
+ "content": "<extra_id_123>",
1013
+ "lstrip": false,
1014
+ "normalized": false,
1015
+ "rstrip": false,
1016
+ "single_word": false,
1017
+ "special": true
1018
+ },
1019
+ "383": {
1020
+ "content": "<extra_id_124>",
1021
+ "lstrip": false,
1022
+ "normalized": false,
1023
+ "rstrip": false,
1024
+ "single_word": false,
1025
+ "special": true
1026
+ }
1027
+ },
1028
+ "additional_special_tokens": [
1029
+ "<extra_id_0>",
1030
+ "<extra_id_1>",
1031
+ "<extra_id_2>",
1032
+ "<extra_id_3>",
1033
+ "<extra_id_4>",
1034
+ "<extra_id_5>",
1035
+ "<extra_id_6>",
1036
+ "<extra_id_7>",
1037
+ "<extra_id_8>",
1038
+ "<extra_id_9>",
1039
+ "<extra_id_10>",
1040
+ "<extra_id_11>",
1041
+ "<extra_id_12>",
1042
+ "<extra_id_13>",
1043
+ "<extra_id_14>",
1044
+ "<extra_id_15>",
1045
+ "<extra_id_16>",
1046
+ "<extra_id_17>",
1047
+ "<extra_id_18>",
1048
+ "<extra_id_19>",
1049
+ "<extra_id_20>",
1050
+ "<extra_id_21>",
1051
+ "<extra_id_22>",
1052
+ "<extra_id_23>",
1053
+ "<extra_id_24>",
1054
+ "<extra_id_25>",
1055
+ "<extra_id_26>",
1056
+ "<extra_id_27>",
1057
+ "<extra_id_28>",
1058
+ "<extra_id_29>",
1059
+ "<extra_id_30>",
1060
+ "<extra_id_31>",
1061
+ "<extra_id_32>",
1062
+ "<extra_id_33>",
1063
+ "<extra_id_34>",
1064
+ "<extra_id_35>",
1065
+ "<extra_id_36>",
1066
+ "<extra_id_37>",
1067
+ "<extra_id_38>",
1068
+ "<extra_id_39>",
1069
+ "<extra_id_40>",
1070
+ "<extra_id_41>",
1071
+ "<extra_id_42>",
1072
+ "<extra_id_43>",
1073
+ "<extra_id_44>",
1074
+ "<extra_id_45>",
1075
+ "<extra_id_46>",
1076
+ "<extra_id_47>",
1077
+ "<extra_id_48>",
1078
+ "<extra_id_49>",
1079
+ "<extra_id_50>",
1080
+ "<extra_id_51>",
1081
+ "<extra_id_52>",
1082
+ "<extra_id_53>",
1083
+ "<extra_id_54>",
1084
+ "<extra_id_55>",
1085
+ "<extra_id_56>",
1086
+ "<extra_id_57>",
1087
+ "<extra_id_58>",
1088
+ "<extra_id_59>",
1089
+ "<extra_id_60>",
1090
+ "<extra_id_61>",
1091
+ "<extra_id_62>",
1092
+ "<extra_id_63>",
1093
+ "<extra_id_64>",
1094
+ "<extra_id_65>",
1095
+ "<extra_id_66>",
1096
+ "<extra_id_67>",
1097
+ "<extra_id_68>",
1098
+ "<extra_id_69>",
1099
+ "<extra_id_70>",
1100
+ "<extra_id_71>",
1101
+ "<extra_id_72>",
1102
+ "<extra_id_73>",
1103
+ "<extra_id_74>",
1104
+ "<extra_id_75>",
1105
+ "<extra_id_76>",
1106
+ "<extra_id_77>",
1107
+ "<extra_id_78>",
1108
+ "<extra_id_79>",
1109
+ "<extra_id_80>",
1110
+ "<extra_id_81>",
1111
+ "<extra_id_82>",
1112
+ "<extra_id_83>",
1113
+ "<extra_id_84>",
1114
+ "<extra_id_85>",
1115
+ "<extra_id_86>",
1116
+ "<extra_id_87>",
1117
+ "<extra_id_88>",
1118
+ "<extra_id_89>",
1119
+ "<extra_id_90>",
1120
+ "<extra_id_91>",
1121
+ "<extra_id_92>",
1122
+ "<extra_id_93>",
1123
+ "<extra_id_94>",
1124
+ "<extra_id_95>",
1125
+ "<extra_id_96>",
1126
+ "<extra_id_97>",
1127
+ "<extra_id_98>",
1128
+ "<extra_id_99>",
1129
+ "<extra_id_100>",
1130
+ "<extra_id_101>",
1131
+ "<extra_id_102>",
1132
+ "<extra_id_103>",
1133
+ "<extra_id_104>",
1134
+ "<extra_id_105>",
1135
+ "<extra_id_106>",
1136
+ "<extra_id_107>",
1137
+ "<extra_id_108>",
1138
+ "<extra_id_109>",
1139
+ "<extra_id_110>",
1140
+ "<extra_id_111>",
1141
+ "<extra_id_112>",
1142
+ "<extra_id_113>",
1143
+ "<extra_id_114>",
1144
+ "<extra_id_115>",
1145
+ "<extra_id_116>",
1146
+ "<extra_id_117>",
1147
+ "<extra_id_118>",
1148
+ "<extra_id_119>",
1149
+ "<extra_id_120>",
1150
+ "<extra_id_121>",
1151
+ "<extra_id_122>",
1152
+ "<extra_id_123>",
1153
+ "<extra_id_124>"
1154
+ ],
1155
+ "backend": "custom",
1156
+ "eos_token": "</s>",
1157
+ "extra_ids": 0,
1158
+ "extra_special_tokens": [
1159
+ "<extra_id_0>",
1160
+ "<extra_id_1>",
1161
+ "<extra_id_2>",
1162
+ "<extra_id_3>",
1163
+ "<extra_id_4>",
1164
+ "<extra_id_5>",
1165
+ "<extra_id_6>",
1166
+ "<extra_id_7>",
1167
+ "<extra_id_8>",
1168
+ "<extra_id_9>",
1169
+ "<extra_id_10>",
1170
+ "<extra_id_11>",
1171
+ "<extra_id_12>",
1172
+ "<extra_id_13>",
1173
+ "<extra_id_14>",
1174
+ "<extra_id_15>",
1175
+ "<extra_id_16>",
1176
+ "<extra_id_17>",
1177
+ "<extra_id_18>",
1178
+ "<extra_id_19>",
1179
+ "<extra_id_20>",
1180
+ "<extra_id_21>",
1181
+ "<extra_id_22>",
1182
+ "<extra_id_23>",
1183
+ "<extra_id_24>",
1184
+ "<extra_id_25>",
1185
+ "<extra_id_26>",
1186
+ "<extra_id_27>",
1187
+ "<extra_id_28>",
1188
+ "<extra_id_29>",
1189
+ "<extra_id_30>",
1190
+ "<extra_id_31>",
1191
+ "<extra_id_32>",
1192
+ "<extra_id_33>",
1193
+ "<extra_id_34>",
1194
+ "<extra_id_35>",
1195
+ "<extra_id_36>",
1196
+ "<extra_id_37>",
1197
+ "<extra_id_38>",
1198
+ "<extra_id_39>",
1199
+ "<extra_id_40>",
1200
+ "<extra_id_41>",
1201
+ "<extra_id_42>",
1202
+ "<extra_id_43>",
1203
+ "<extra_id_44>",
1204
+ "<extra_id_45>",
1205
+ "<extra_id_46>",
1206
+ "<extra_id_47>",
1207
+ "<extra_id_48>",
1208
+ "<extra_id_49>",
1209
+ "<extra_id_50>",
1210
+ "<extra_id_51>",
1211
+ "<extra_id_52>",
1212
+ "<extra_id_53>",
1213
+ "<extra_id_54>",
1214
+ "<extra_id_55>",
1215
+ "<extra_id_56>",
1216
+ "<extra_id_57>",
1217
+ "<extra_id_58>",
1218
+ "<extra_id_59>",
1219
+ "<extra_id_60>",
1220
+ "<extra_id_61>",
1221
+ "<extra_id_62>",
1222
+ "<extra_id_63>",
1223
+ "<extra_id_64>",
1224
+ "<extra_id_65>",
1225
+ "<extra_id_66>",
1226
+ "<extra_id_67>",
1227
+ "<extra_id_68>",
1228
+ "<extra_id_69>",
1229
+ "<extra_id_70>",
1230
+ "<extra_id_71>",
1231
+ "<extra_id_72>",
1232
+ "<extra_id_73>",
1233
+ "<extra_id_74>",
1234
+ "<extra_id_75>",
1235
+ "<extra_id_76>",
1236
+ "<extra_id_77>",
1237
+ "<extra_id_78>",
1238
+ "<extra_id_79>",
1239
+ "<extra_id_80>",
1240
+ "<extra_id_81>",
1241
+ "<extra_id_82>",
1242
+ "<extra_id_83>",
1243
+ "<extra_id_84>",
1244
+ "<extra_id_85>",
1245
+ "<extra_id_86>",
1246
+ "<extra_id_87>",
1247
+ "<extra_id_88>",
1248
+ "<extra_id_89>",
1249
+ "<extra_id_90>",
1250
+ "<extra_id_91>",
1251
+ "<extra_id_92>",
1252
+ "<extra_id_93>",
1253
+ "<extra_id_94>",
1254
+ "<extra_id_95>",
1255
+ "<extra_id_96>",
1256
+ "<extra_id_97>",
1257
+ "<extra_id_98>",
1258
+ "<extra_id_99>",
1259
+ "<extra_id_100>",
1260
+ "<extra_id_101>",
1261
+ "<extra_id_102>",
1262
+ "<extra_id_103>",
1263
+ "<extra_id_104>",
1264
+ "<extra_id_105>",
1265
+ "<extra_id_106>",
1266
+ "<extra_id_107>",
1267
+ "<extra_id_108>",
1268
+ "<extra_id_109>",
1269
+ "<extra_id_110>",
1270
+ "<extra_id_111>",
1271
+ "<extra_id_112>",
1272
+ "<extra_id_113>",
1273
+ "<extra_id_114>",
1274
+ "<extra_id_115>",
1275
+ "<extra_id_116>",
1276
+ "<extra_id_117>",
1277
+ "<extra_id_118>",
1278
+ "<extra_id_119>",
1279
+ "<extra_id_120>",
1280
+ "<extra_id_121>",
1281
+ "<extra_id_122>",
1282
+ "<extra_id_123>",
1283
+ "<extra_id_124>"
1284
+ ],
1285
+ "is_local": false,
1286
+ "model_max_length": 1000000000000000019884624838656,
1287
+ "pad_token": "<pad>",
1288
+ "tokenizer_class": "ByT5Tokenizer",
1289
+ "unk_token": "<unk>"
1290
+ }