NickMarino hamdallah commited on
Commit
79bee84
·
0 Parent(s):

Duplicate from hamdallah/Sofelia-TTS

Browse files

Co-authored-by: jodah <hamdallah@users.noreply.huggingface.co>

.gitattributes ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
37
+ Sofelia.png filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,301 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ar
4
+ - en
5
+ license: apache-2.0
6
+ tags:
7
+ - text-to-speech
8
+ - tts
9
+ - audio
10
+ - speech
11
+ - palestinian-arabic
12
+ - arabic
13
+ - voice-cloning
14
+ - miratts
15
+ - sofelia
16
+ base_model: YatharthS/MiraTTS
17
+ library_name: transformers
18
+ pipeline_tag: text-to-speech
19
+ ---
20
+
21
+ <div style="text-align: center;">
22
+ <h1>🇵🇸 Sofelia-TTS 🇵🇸</h1>
23
+ <p><strong>Palestinian Arabic Text-to-Speech Model</strong></p>
24
+ <p><em>Palestine will be free</em> 🕊️</p>
25
+ <p><img style="margin: auto; width: 500px" src="https://huggingface.co/hamdallah/Sofelia-TTS/resolve/main/Sofelia.png" /></p>
26
+ </div>
27
+
28
+ ---
29
+
30
+ ## 🌟 Model Description
31
+
32
+ **Sofelia-TTS** is a fine-tuned Text-to-Speech (TTS) model specifically trained for **Palestinian Arabic dialect**. This model brings the beautiful sounds of Palestinian speech to AI, preserving and celebrating the linguistic heritage of Palestine.
33
+
34
+ Built on top of [YatharthS/MiraTTS](https://huggingface.co/YatharthS/MiraTTS), Sofelia-TTS captures the unique phonetic characteristics, intonation patterns, and prosody of Palestinian Arabic, making it ideal for:
35
+
36
+ - 🎙️ **Voice cloning** with Palestinian Arabic speech
37
+ - 📚 **Audiobook generation** in Palestinian dialect
38
+ - 🗣️ **Virtual assistants** that speak authentic Palestinian Arabic
39
+ - 🎓 **Educational tools** for learning and preserving the Palestinian dialect
40
+ - 🎬 **Content creation** for Palestinian media and storytelling
41
+
42
+ > **Dedicated to Palestine**: This model is a tribute to the resilience, culture, and spirit of the Palestinian people. May their voices be heard loud and clear across the world. 🇵🇸
43
+
44
+ ---
45
+
46
+ ## 🎯 Key Features
47
+
48
+ - ✅ **High-quality voice cloning**: Clone any voice with just a few seconds of reference audio
49
+ - ✅ **Palestinian Arabic dialect**: Authentic pronunciation and intonation
50
+ - ✅ **Fast inference**: Optimized for real-time generation
51
+ - ✅ **Flexible context**: Supports variable-length reference audio
52
+ - ✅ **Open source**: Free to use and improve
53
+
54
+ ---
55
+
56
+ ## 📊 Model Details
57
+
58
+ | **Attribute** | **Value** |
59
+ |---------------|-----------|
60
+ | **Model Type** | Text-to-Speech (TTS) |
61
+ | **Base Model** | YatharthS/MiraTTS |
62
+ | **Architecture** | Transformer-based Language Model + Audio Codec |
63
+ | **Training Language** | Palestinian Arabic (ar-PS) |
64
+ | **Dataset** | Private Dataset |
65
+ | **Sample Rate** | 16,000 Hz |
66
+ | **License** | Apache 2.0 |
67
+ | **Model Size** | ~0.6B parameters |
68
+ | **Precision** | BF16/FP32 |
69
+ | **Framework** | PyTorch + Transformers |
70
+
71
+ ---
72
+
73
+ ## 🚀 Quick Start
74
+
75
+ ### Installation
76
+
77
+ ```bash
78
+ # Install required packages
79
+ uv pip install git+https://github.com/ysharma3501/MiraTTS.git
80
+ ```
81
+
82
+ ### Usage (Python)
83
+
84
+ ```python
85
+ from mira.model import MiraTTS
86
+ from IPython.display import Audio
87
+ mira_tts = MiraTTS('hamdallah/Sofelia-TTS') ## downloads model from huggingface
88
+
89
+ file = "reference_file.wav" ## can be mp3/wav/ogg or anything that librosa supports
90
+ text = "مرحبا، كيف الحال؟ هذا نموذج للهجة الفلسطينية."
91
+
92
+ context_tokens = mira_tts.encode_audio(file)
93
+ audio = mira_tts.generate(text, context_tokens)
94
+
95
+ Audio(audio, rate=48000)
96
+ ```
97
+
98
+ ---
99
+
100
+ ## 🎤 Example Prompts
101
+
102
+ Try these Palestinian Arabic phrases:
103
+
104
+ ```python
105
+ # Greetings
106
+ "مرحبا، كيف حالك؟" # Hello, how are you?
107
+ "أهلا وسهلا فيك" # Welcome
108
+
109
+ # Common expressions
110
+ "يا سلام، هذا رائع" # Wow, this is amazing
111
+ "ما شاء الله" # Mashallah
112
+ "الله يعطيك العافية" # God give you wellness
113
+
114
+ # About Palestine
115
+ "فلسطين حرة على طول" # Palestine is free for ever
116
+ "القدس عاصمة فلسطين الأبدية" # Jerusalem is the eternal capital of Palestine
117
+ "سنعود يوماً إلى ديارنا" # We will return one day to our homes
118
+ ```
119
+
120
+ ---
121
+
122
+ ## 🎓 Training Details
123
+
124
+ ### Training Data
125
+
126
+ - **Dataset**: 400 Hours Palestinian Speech
127
+ - **Language**: Palestinian Arabic dialect
128
+ - **Hours of audio**: High-quality Palestinian speech recordings
129
+ - **Preprocessing**: Audio normalized and resampled to 16kHz
130
+
131
+ ### Training Configuration
132
+
133
+ | **Hyperparameter** | **Value** |
134
+ |--------------------|-----------|
135
+ | **Learning Rate** | 2e-4 (initial), 1e-5 (refinement) |
136
+ | **Batch Size** | 8 (effective: 2 per device × 4 accumulation steps) |
137
+ | **Training Steps** | 2000+ |
138
+ | **Warmup Steps** | 100 |
139
+ | **Max Audio Length** | 20-30 seconds |
140
+ | **Optimizer** | AdamW |
141
+ | **LR Scheduler** | Cosine with warmup |
142
+ | **Gradient Clipping** | 1.0 |
143
+ | **Precision** | BF16 (H100) / FP32 |
144
+ | **Hardware** | NVIDIA H100 / A100 GPU |
145
+
146
+ ### Training Process
147
+
148
+ The model was trained using a two-phase approach:
149
+
150
+ 1. **Foundation Phase**: High learning rate (2e-4) for initial adaptation to Palestinian Arabic
151
+ 2. **Refinement Phase**: Lower learning rate (1e-5) with NEFTune noise for stability and quality
152
+
153
+ ---
154
+
155
+ ## 📈 Model Performance
156
+
157
+ The model achieves:
158
+
159
+ - ✅ **Natural prosody** matching Palestinian Arabic speech patterns
160
+ - ✅ **Clear pronunciation** of Arabic phonemes
161
+ - ✅ **Voice similarity** to reference audio
162
+ - ✅ **Stable generation** without artifacts or repetitions
163
+ - ✅ **Fast inference** suitable for real-time applications
164
+
165
+ ---
166
+
167
+ ## 🛠️ Advanced Usage
168
+
169
+ ### Running the model using batching
170
+
171
+ ```python
172
+ file = "reference_file.wav" ## can be mp3/wav/ogg or anything that librosa supports
173
+ text = ["مرحبا، كيف حالك؟", "بتعرف إنه انا بقدر احكي فلسطيني و English مع بعض Without Errors."]
174
+
175
+ context_tokens = [mira_tts.encode_audio(file)]
176
+
177
+ audio = mira_tts.batch_generate(text, context_tokens)
178
+
179
+ Audio(audio, rate=48000)
180
+ ```
181
+
182
+ ### Adjusting Generation Parameters
183
+
184
+ ```python
185
+ # More creative/variable output
186
+
187
+ mira_tts.set_params(
188
+ top_p=0.95,
189
+ top_k=20,
190
+ temperature=0.01, # Higher = more variation
191
+ max_new_tokens=1024,
192
+ repetition_penalty=2.2,
193
+ min_p=0.05
194
+ )
195
+
196
+ ```
197
+
198
+ ---
199
+
200
+ ## 💡 Tips for Best Results
201
+
202
+ 1. **Reference Audio Quality**:
203
+ - Use clean audio without background noise
204
+ - 3-10 seconds of speech is ideal
205
+ - Ensure audio is 16kHz sample rate
206
+
207
+ 2. **Text Input**:
208
+ - Use proper Arabic script (not Arabizi/transliteration)
209
+ - Palestinian dialect works best
210
+ - Avoid very long sentences (split into shorter segments)
211
+
212
+ 3. **Generation Parameters**:
213
+ - `temperature=0.7`: Good default for natural speech
214
+ - `temperature=0.5`: More stable, less variation
215
+ - `temperature=0.9`: More expressive, more variation
216
+
217
+ ---
218
+
219
+ ## 🌍 About Palestinian Arabic
220
+
221
+ Palestinian Arabic is a Levantine Arabic dialect spoken by the Palestinian people. It has unique characteristics:
222
+
223
+ - **Phonology**: Preservation of Classical Arabic /q/ as glottal stop [ʔ]
224
+ - **Vocabulary**: Rich in Levantine and unique Palestinian terms
225
+ - **Intonation**: Distinctive melodic patterns
226
+ - **Regional Variants**: Urban (Jerusalem, Hebron) vs. Rural vs. Bedouin varieties
227
+
228
+ This model captures these linguistic features, making it authentic and representative of Palestinian speech.
229
+
230
+ ---
231
+
232
+ ## 🇵🇸 Message of Solidarity
233
+
234
+ This model is dedicated to the Palestinian people and their enduring struggle for freedom, dignity, and justice. Through technology, we preserve and celebrate Palestinian culture, language, and identity.
235
+
236
+ **Free Palestine** 🇵🇸
237
+
238
+ > *"We will not be erased. Our voices will echo through time, in every language model, every algorithm, every line of code. Palestine lives, and so does its voice."*
239
+
240
+ ---
241
+
242
+ ## 📜 License
243
+
244
+ This model is released under the **Apache 2.0 License**, making it free for:
245
+ - ✅ Commercial use
246
+ - ✅ Modification and distribution
247
+ - ✅ Private use
248
+ - ✅ Patent use
249
+
250
+ ---
251
+
252
+ ## 🙏 Acknowledgments
253
+
254
+ - **Base Model**: [YatharthS/MiraTTS](https://huggingface.co/YatharthS/MiraTTS) - Thank you for the excellent foundation
255
+ - **Dataset**: Palestinian Arabic speakers who contributed their voices
256
+ - **Community**: The open-source AI community for tools and support
257
+ - **Palestine**: For being the inspiration and purpose behind this work
258
+
259
+ ---
260
+
261
+ ## 📞 Contact & Support
262
+
263
+ - **Model Repository**: [hamdallah/Sofelia-TTS](https://huggingface.co/hamdallah/Sofelia-TTS)
264
+ - **Issues & Questions**: Use the Community tab or open an issue
265
+
266
+ ---
267
+
268
+ ## 🔗 Related Resources
269
+
270
+ - [YatharthS/MiraTTS](https://huggingface.co/YatharthS/MiraTTS) - Base model
271
+ - [ncodec](https://github.com/YatharthS/ncodec) - Audio codec library
272
+
273
+ ---
274
+
275
+ ## 📚 Citation
276
+
277
+ If you use this model in your research or projects, please cite:
278
+
279
+ ```bibtex
280
+ @misc{sofelia-tts-2026,
281
+ author = {Hamdallah},
282
+ title = {Sofelia-TTS: Palestinian Arabic Text-to-Speech Model},
283
+ year = {2026},
284
+ publisher = {Hugging Face},
285
+ journal = {Hugging Face Model Hub},
286
+ howpublished = {\url{https://huggingface.co/hamdallah/Sofelia-TTS}},
287
+ }
288
+ ```
289
+
290
+ ---
291
+
292
+ <div style="text-align: center; padding: 20px;">
293
+ <h2>🇵🇸 FREE PALESTINE 🇵🇸</h2>
294
+ <p><strong>تحيا فلسطين حرة أبية</strong></p>
295
+ <p><em>Long Live Free Palestine</em></p>
296
+ <p>🕊️ ✊ 🇵🇸</p>
297
+ </div>
298
+
299
+ ---
300
+
301
+ **Made with ❤️ for Palestine**
Sofelia.png ADDED

Git LFS Details

  • SHA256: bfafef32955049a2ff74e7f13c152712f6c04a63cec1581b40b0efbb678703ec
  • Pointer size: 131 Bytes
  • Size of remote file: 825 kB
added_tokens.json ADDED
The diff for this file is too large to render. See raw diff
 
chat_template.jinja ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {%- if tools %}
2
+ {{- '<|im_start|>system\n' }}
3
+ {%- if messages[0]['role'] == 'system' %}
4
+ {{- messages[0]['content'] }}
5
+ {%- else %}
6
+ {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}
7
+ {%- endif %}
8
+ {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
9
+ {%- for tool in tools %}
10
+ {{- "\n" }}
11
+ {{- tool | tojson }}
12
+ {%- endfor %}
13
+ {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
14
+ {%- else %}
15
+ {%- if messages[0]['role'] == 'system' %}
16
+ {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}
17
+ {%- else %}
18
+ {{- '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n' }}
19
+ {%- endif %}
20
+ {%- endif %}
21
+ {%- for message in messages %}
22
+ {%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}
23
+ {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
24
+ {%- elif message.role == "assistant" %}
25
+ {{- '<|im_start|>' + message.role }}
26
+ {%- if message.content %}
27
+ {{- '\n' + message.content }}
28
+ {%- endif %}
29
+ {%- for tool_call in message.tool_calls %}
30
+ {%- if tool_call.function is defined %}
31
+ {%- set tool_call = tool_call.function %}
32
+ {%- endif %}
33
+ {{- '\n<tool_call>\n{"name": "' }}
34
+ {{- tool_call.name }}
35
+ {{- '", "arguments": ' }}
36
+ {{- tool_call.arguments | tojson }}
37
+ {{- '}\n</tool_call>' }}
38
+ {%- endfor %}
39
+ {{- '<|im_end|>\n' }}
40
+ {%- elif message.role == "tool" %}
41
+ {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}
42
+ {{- '<|im_start|>user' }}
43
+ {%- endif %}
44
+ {{- '\n<tool_response>\n' }}
45
+ {{- message.content }}
46
+ {{- '\n</tool_response>' }}
47
+ {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
48
+ {{- '<|im_end|>\n' }}
49
+ {%- endif %}
50
+ {%- endif %}
51
+ {%- endfor %}
52
+ {%- if add_generation_prompt %}
53
+ {{- '<|im_start|>assistant\n' }}
54
+ {%- endif %}
config.json ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Qwen2ForCausalLM"
4
+ ],
5
+ "attention_dropout": 0.0,
6
+ "dtype": "bfloat16",
7
+ "eos_token_id": 151645,
8
+ "hidden_act": "silu",
9
+ "hidden_size": 896,
10
+ "initializer_range": 0.02,
11
+ "intermediate_size": 4864,
12
+ "layer_types": [
13
+ "full_attention",
14
+ "full_attention",
15
+ "full_attention",
16
+ "full_attention",
17
+ "full_attention",
18
+ "full_attention",
19
+ "full_attention",
20
+ "full_attention",
21
+ "full_attention",
22
+ "full_attention",
23
+ "full_attention",
24
+ "full_attention",
25
+ "full_attention",
26
+ "full_attention",
27
+ "full_attention",
28
+ "full_attention",
29
+ "full_attention",
30
+ "full_attention",
31
+ "full_attention",
32
+ "full_attention",
33
+ "full_attention",
34
+ "full_attention",
35
+ "full_attention",
36
+ "full_attention"
37
+ ],
38
+ "max_position_embeddings": 32768,
39
+ "max_window_layers": 21,
40
+ "model_type": "qwen2",
41
+ "num_attention_heads": 14,
42
+ "num_hidden_layers": 24,
43
+ "num_key_value_heads": 2,
44
+ "pad_token_id": 151643,
45
+ "rms_norm_eps": 1e-06,
46
+ "rope_scaling": null,
47
+ "rope_theta": 1000000.0,
48
+ "sliding_window": null,
49
+ "tie_word_embeddings": true,
50
+ "torch_dtype": "float32",
51
+ "transformers_version": "4.53.3",
52
+ "unsloth_version": "2026.1.2",
53
+ "use_cache": true,
54
+ "use_sliding_window": false,
55
+ "vocab_size": 166000
56
+ }
generation_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "eos_token_id": [
4
+ 151645
5
+ ],
6
+ "max_length": 32768,
7
+ "pad_token_id": 151643,
8
+ "transformers_version": "4.53.3"
9
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7e189c4b581b10e8b0267dc66e3818ea59f1cc272442dc4ea40fe92d270de3e0
3
+ size 2026568872
special_tokens_map.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|im_start|>",
4
+ "<|im_end|>",
5
+ "<|object_ref_start|>",
6
+ "<|object_ref_end|>",
7
+ "<|box_start|>",
8
+ "<|box_end|>",
9
+ "<|quad_start|>",
10
+ "<|quad_end|>",
11
+ "<|vision_start|>",
12
+ "<|vision_end|>",
13
+ "<|vision_pad|>",
14
+ "<|image_pad|>",
15
+ "<|video_pad|>"
16
+ ],
17
+ "eos_token": {
18
+ "content": "<|im_end|>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ },
24
+ "pad_token": {
25
+ "content": "<|endoftext|>",
26
+ "lstrip": false,
27
+ "normalized": false,
28
+ "rstrip": false,
29
+ "single_word": false
30
+ }
31
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cc90330cb541542195aa2d685938e459dae10965f1de965d222de674e5f63abd
3
+ size 14092303
tokenizer_config.json ADDED
The diff for this file is too large to render. See raw diff
 
vocab.json ADDED
The diff for this file is too large to render. See raw diff