KietRiu commited on
Commit
6629804
·
verified ·
1 Parent(s): 17a05c2

End of training

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": true,
4
+ "pooling_mode_mean_tokens": false,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,442 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - vi
4
+ tags:
5
+ - sentence-transformers
6
+ - sentence-similarity
7
+ - feature-extraction
8
+ - dense
9
+ - generated_from_trainer
10
+ - dataset_size:81409
11
+ - loss:TripletLoss
12
+ base_model: dangvantuan/vietnamese-document-embedding
13
+ widget:
14
+ - source_sentence: Đâu là lập luận tồi tệ nhất trên thế giới?
15
+ sentences:
16
+ - Một số ví dụ về phương tiện giao thông cũ và hiện đại là gì?
17
+ - Trận chiến nào trong lịch sử thế giới là tồi tệ nhất?
18
+ - Cuộc tranh luận tồi tệ nhất trên thế giới là gì?
19
+ - source_sentence: Nghị quyết năm mới 2017 của bạn là gì?
20
+ sentences:
21
+ - Bạn có nghĩ việc các thẩm phán luôn thực thi nguyên tắc loại trừ là quan trọng
22
+ hơn không?
23
+ - Quyết tâm của bạn cho năm 2017 là gì?
24
+ - Quyết tâm năm mới 2016 của bạn là gì?
25
+ - source_sentence: Làm thế nào để tôi vượt qua cuộc kiểm tra ma túy đá?
26
+ sentences:
27
+ - Bạn muốn Donald Trump hay Hillary Clinton trở thành TIỀM NĂNG?
28
+ - Tập thể dục có giúp vượt qua bài kiểm tra ma túy đá không?
29
+ - Liệu 0,2 gam meth có xuất hiện trong xét nghiệm nước tiểu 99 giờ sau khi tiêu
30
+ thụ không?
31
+ - source_sentence: Loạt phim về Người ngoài hành tinh cổ đại trên Kênh Lịch sử có
32
+ độ chính xác như thế nào?
33
+ sentences:
34
+ - Bạn nghĩ gì về loạt phim Người ngoài hành tinh cổ đại?
35
+ - Nếu Bắc Ireland, xứ Wales hoặc Scotland rời khỏi Vương quốc Anh, liệu lá cờ của
36
+ Vương quốc Anh có được giữ nguyên hay trở lại phiên bản trước đó?
37
+ - Người ngoài hành tinh cổ đại được chiếu trên Kênh Lịch sử có thật đến mức nào?
38
+ - source_sentence: Antivirus có phục hồi được các tập tin đã xóa không?
39
+ sentences:
40
+ - Cảm giác là con trai/con gái của cha mẹ đồng tính như thế nào?
41
+ - Làm cách nào để khôi phục các tập tin bị xóa vĩnh viễn?
42
+ - Làm thế nào phần mềm chống vi-rút phục hồi các tập tin đã xóa?
43
+ datasets:
44
+ - NghiemAbe/QQP_triplet
45
+ pipeline_tag: sentence-similarity
46
+ library_name: sentence-transformers
47
+ metrics:
48
+ - cosine_accuracy
49
+ model-index:
50
+ - name: SentenceTransformer based on dangvantuan/vietnamese-document-embedding
51
+ results:
52
+ - task:
53
+ type: triplet
54
+ name: Triplet
55
+ dataset:
56
+ name: Unknown
57
+ type: unknown
58
+ metrics:
59
+ - type: cosine_accuracy
60
+ value: 0.6684518456459045
61
+ name: Cosine Accuracy
62
+ ---
63
+
64
+ # SentenceTransformer based on dangvantuan/vietnamese-document-embedding
65
+
66
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [dangvantuan/vietnamese-document-embedding](https://huggingface.co/dangvantuan/vietnamese-document-embedding) on the [qqp_triplet](https://huggingface.co/datasets/NghiemAbe/QQP_triplet) dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
67
+
68
+ ## Model Details
69
+
70
+ ### Model Description
71
+ - **Model Type:** Sentence Transformer
72
+ - **Base model:** [dangvantuan/vietnamese-document-embedding](https://huggingface.co/dangvantuan/vietnamese-document-embedding) <!-- at revision 6fa4e2f8ed2d33120b0f4442cc81f8f973c3f56b -->
73
+ - **Maximum Sequence Length:** 8192 tokens
74
+ - **Output Dimensionality:** 768 dimensions
75
+ - **Similarity Function:** Cosine Similarity
76
+ - **Training Dataset:**
77
+ - [qqp_triplet](https://huggingface.co/datasets/NghiemAbe/QQP_triplet)
78
+ - **Language:** vi
79
+ <!-- - **License:** Unknown -->
80
+
81
+ ### Model Sources
82
+
83
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
84
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/huggingface/sentence-transformers)
85
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
86
+
87
+ ### Full Model Architecture
88
+
89
+ ```
90
+ SentenceTransformer(
91
+ (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False, 'architecture': 'VietnameseModel'})
92
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
93
+ (2): Normalize()
94
+ )
95
+ ```
96
+
97
+ ## Usage
98
+
99
+ ### Direct Usage (Sentence Transformers)
100
+
101
+ First install the Sentence Transformers library:
102
+
103
+ ```bash
104
+ pip install -U sentence-transformers
105
+ ```
106
+
107
+ Then you can load this model and run inference.
108
+ ```python
109
+ from sentence_transformers import SentenceTransformer
110
+
111
+ # Download from the 🤗 Hub
112
+ model = SentenceTransformer("KietRiu/vietnamese-document-embedding_FT_QQP")
113
+ # Run inference
114
+ sentences = [
115
+ 'Antivirus có phục hồi được các tập tin đã xóa không?',
116
+ 'Làm thế n��o phần mềm chống vi-rút phục hồi các tập tin đã xóa?',
117
+ 'Làm cách nào để khôi phục các tập tin bị xóa vĩnh viễn?',
118
+ ]
119
+ embeddings = model.encode(sentences)
120
+ print(embeddings.shape)
121
+ # [3, 768]
122
+
123
+ # Get the similarity scores for the embeddings
124
+ similarities = model.similarity(embeddings, embeddings)
125
+ print(similarities)
126
+ # tensor([[1.0000, 0.9915, 0.7706],
127
+ # [0.9915, 1.0000, 0.8112],
128
+ # [0.7706, 0.8112, 1.0000]])
129
+ ```
130
+
131
+ <!--
132
+ ### Direct Usage (Transformers)
133
+
134
+ <details><summary>Click to see the direct usage in Transformers</summary>
135
+
136
+ </details>
137
+ -->
138
+
139
+ <!--
140
+ ### Downstream Usage (Sentence Transformers)
141
+
142
+ You can finetune this model on your own dataset.
143
+
144
+ <details><summary>Click to expand</summary>
145
+
146
+ </details>
147
+ -->
148
+
149
+ <!--
150
+ ### Out-of-Scope Use
151
+
152
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
153
+ -->
154
+
155
+ ## Evaluation
156
+
157
+ ### Metrics
158
+
159
+ #### Triplet
160
+
161
+ * Evaluated with [<code>TripletEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.TripletEvaluator)
162
+
163
+ | Metric | Value |
164
+ |:--------------------|:-----------|
165
+ | **cosine_accuracy** | **0.6685** |
166
+
167
+ <!--
168
+ ## Bias, Risks and Limitations
169
+
170
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
171
+ -->
172
+
173
+ <!--
174
+ ### Recommendations
175
+
176
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
177
+ -->
178
+
179
+ ## Training Details
180
+
181
+ ### Training Dataset
182
+
183
+ #### qqp_triplet
184
+
185
+ * Dataset: [qqp_triplet](https://huggingface.co/datasets/NghiemAbe/QQP_triplet) at [a48ebfe](https://huggingface.co/datasets/NghiemAbe/QQP_triplet/tree/a48ebfea42995330c3ce7eb69f8786635d1a6494)
186
+ * Size: 81,409 training samples
187
+ * Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
188
+ * Approximate statistics based on the first 1000 samples:
189
+ | | anchor | positive | negative |
190
+ |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
191
+ | type | string | string | string |
192
+ | details | <ul><li>min: 7 tokens</li><li>mean: 17.34 tokens</li><li>max: 50 tokens</li></ul> | <ul><li>min: 7 tokens</li><li>mean: 17.34 tokens</li><li>max: 50 tokens</li></ul> | <ul><li>min: 7 tokens</li><li>mean: 18.08 tokens</li><li>max: 66 tokens</li></ul> |
193
+ * Samples:
194
+ | anchor | positive | negative |
195
+ |:-------------------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
196
+ | <code>Donald Trump mong đợi Mexico trả tiền cho bức tường biên giới do ông đề xuất như thế nào?</code> | <code>Làm thế nào Donald Trump có thể khiến Mexico trả tiền cho bức tường biên giới?</code> | <code>Điều gì sẽ xảy ra nếu bức tường của Trump hoàn toàn không phải là bức tường vật lý? Ông ấy nói Mexico sẽ trả tiền. Một số biện pháp ngăn chặn tài chính mà Mỹ có thể áp đặt là gì? Có thể được không?</code> |
197
+ | <code>Sự khác biệt giữa thực phẩm Trung Quốc và thực phẩm phương Tây là gì?</code> | <code>Sự khác biệt giữa thực phẩm phương Tây và Trung Quốc là gì?</code> | <code>Sự khác biệt giữa thực phẩm Trung Quốc và thực phẩm Nhật Bản là gì?</code> |
198
+ | <code>Làm cách nào tôi có thể đặt câu hỏi cho một người cụ thể tr��n Quora ngoài những câu hỏi được đề xuất?</code> | <code>Tôi muốn đặt câu hỏi cho một người cụ thể trên Quora, tôi phải làm gì?</code> | <code>Câu hỏi nào bạn có thể hỏi ai đó sẽ khơi dậy cuộc trò chuyện sâu sắc và thú vị?</code> |
199
+ * Loss: [<code>TripletLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#tripletloss) with these parameters:
200
+ ```json
201
+ {
202
+ "distance_metric": "TripletDistanceMetric.EUCLIDEAN",
203
+ "triplet_margin": 0.7
204
+ }
205
+ ```
206
+
207
+ ### Evaluation Dataset
208
+
209
+ #### qqp_triplet
210
+
211
+ * Dataset: [qqp_triplet](https://huggingface.co/datasets/NghiemAbe/QQP_triplet) at [a48ebfe](https://huggingface.co/datasets/NghiemAbe/QQP_triplet/tree/a48ebfea42995330c3ce7eb69f8786635d1a6494)
212
+ * Size: 20,353 evaluation samples
213
+ * Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
214
+ * Approximate statistics based on the first 1000 samples:
215
+ | | anchor | positive | negative |
216
+ |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
217
+ | type | string | string | string |
218
+ | details | <ul><li>min: 6 tokens</li><li>mean: 17.65 tokens</li><li>max: 56 tokens</li></ul> | <ul><li>min: 6 tokens</li><li>mean: 17.49 tokens</li><li>max: 53 tokens</li></ul> | <ul><li>min: 6 tokens</li><li>mean: 18.61 tokens</li><li>max: 78 tokens</li></ul> |
219
+ * Samples:
220
+ | anchor | positive | negative |
221
+ |:---------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
222
+ | <code>Có trang web nào khác tương tự như Quora không?</code> | <code>Trang web tương tự như Quora là gì?</code> | <code>Quora có phải là một công cụ tìm kiếm (hơn một số loại trang web khác) không?</code> |
223
+ | <code>Hanuman Chalisa có thực sự hiệu quả hay chỉ đơn thuần là một hệ thống niềm tin?</code> | <code>Việc đọc và nói Hanuman Chalisa đối với tất cả những người theo đạo Hindu có hiệu quả đến mức nào?</code> | <code>Tại sao chúng ta nên đọc Hanuman chalisa? Kết quả của nó là gì?</code> |
224
+ | <code>Mục đích thực sự của cuộc sống là gì?</code> | <code>Mục đích cuộc sống của bạn nên là gì?</code> | <code>Chúng ta có thực sự có mục đích nào đó trong cuộc sống không? Hay chúng ta tạo ra một mục đích để khiến bản thân cảm thấy mình có ý nghĩa trong thế giới vô cùng rộng lớn, hay để khiến bản thân cảm thấy rằng sự tồn tại của chúng ta trong thế giới rộng lớn là cần thiết?</code> |
225
+ * Loss: [<code>TripletLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#tripletloss) with these parameters:
226
+ ```json
227
+ {
228
+ "distance_metric": "TripletDistanceMetric.EUCLIDEAN",
229
+ "triplet_margin": 0.7
230
+ }
231
+ ```
232
+
233
+ ### Training Hyperparameters
234
+ #### Non-Default Hyperparameters
235
+
236
+ - `eval_strategy`: steps
237
+ - `per_device_train_batch_size`: 64
238
+ - `per_device_eval_batch_size`: 64
239
+ - `eval_accumulation_steps`: 1
240
+ - `learning_rate`: 2e-05
241
+ - `warmup_ratio`: 0.1
242
+ - `bf16`: True
243
+ - `load_best_model_at_end`: True
244
+ - `optim`: adamw_8bit
245
+ - `push_to_hub`: True
246
+ - `hub_model_id`: KietRiu/vietnamese-document-embedding_FT_QQP
247
+ - `batch_sampler`: no_duplicates
248
+
249
+ #### All Hyperparameters
250
+ <details><summary>Click to expand</summary>
251
+
252
+ - `overwrite_output_dir`: False
253
+ - `do_predict`: False
254
+ - `eval_strategy`: steps
255
+ - `prediction_loss_only`: True
256
+ - `per_device_train_batch_size`: 64
257
+ - `per_device_eval_batch_size`: 64
258
+ - `per_gpu_train_batch_size`: None
259
+ - `per_gpu_eval_batch_size`: None
260
+ - `gradient_accumulation_steps`: 1
261
+ - `eval_accumulation_steps`: 1
262
+ - `torch_empty_cache_steps`: None
263
+ - `learning_rate`: 2e-05
264
+ - `weight_decay`: 0.0
265
+ - `adam_beta1`: 0.9
266
+ - `adam_beta2`: 0.999
267
+ - `adam_epsilon`: 1e-08
268
+ - `max_grad_norm`: 1.0
269
+ - `num_train_epochs`: 3
270
+ - `max_steps`: -1
271
+ - `lr_scheduler_type`: linear
272
+ - `lr_scheduler_kwargs`: {}
273
+ - `warmup_ratio`: 0.1
274
+ - `warmup_steps`: 0
275
+ - `log_level`: passive
276
+ - `log_level_replica`: warning
277
+ - `log_on_each_node`: True
278
+ - `logging_nan_inf_filter`: True
279
+ - `save_safetensors`: True
280
+ - `save_on_each_node`: False
281
+ - `save_only_model`: False
282
+ - `restore_callback_states_from_checkpoint`: False
283
+ - `no_cuda`: False
284
+ - `use_cpu`: False
285
+ - `use_mps_device`: False
286
+ - `seed`: 42
287
+ - `data_seed`: None
288
+ - `jit_mode_eval`: False
289
+ - `bf16`: True
290
+ - `fp16`: False
291
+ - `fp16_opt_level`: O1
292
+ - `half_precision_backend`: auto
293
+ - `bf16_full_eval`: False
294
+ - `fp16_full_eval`: False
295
+ - `tf32`: None
296
+ - `local_rank`: 0
297
+ - `ddp_backend`: None
298
+ - `tpu_num_cores`: None
299
+ - `tpu_metrics_debug`: False
300
+ - `debug`: []
301
+ - `dataloader_drop_last`: False
302
+ - `dataloader_num_workers`: 0
303
+ - `dataloader_prefetch_factor`: None
304
+ - `past_index`: -1
305
+ - `disable_tqdm`: False
306
+ - `remove_unused_columns`: True
307
+ - `label_names`: None
308
+ - `load_best_model_at_end`: True
309
+ - `ignore_data_skip`: False
310
+ - `fsdp`: []
311
+ - `fsdp_min_num_params`: 0
312
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
313
+ - `fsdp_transformer_layer_cls_to_wrap`: None
314
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
315
+ - `parallelism_config`: None
316
+ - `deepspeed`: None
317
+ - `label_smoothing_factor`: 0.0
318
+ - `optim`: adamw_8bit
319
+ - `optim_args`: None
320
+ - `adafactor`: False
321
+ - `group_by_length`: False
322
+ - `length_column_name`: length
323
+ - `project`: huggingface
324
+ - `trackio_space_id`: trackio
325
+ - `ddp_find_unused_parameters`: None
326
+ - `ddp_bucket_cap_mb`: None
327
+ - `ddp_broadcast_buffers`: False
328
+ - `dataloader_pin_memory`: True
329
+ - `dataloader_persistent_workers`: False
330
+ - `skip_memory_metrics`: True
331
+ - `use_legacy_prediction_loop`: False
332
+ - `push_to_hub`: True
333
+ - `resume_from_checkpoint`: None
334
+ - `hub_model_id`: KietRiu/vietnamese-document-embedding_FT_QQP
335
+ - `hub_strategy`: every_save
336
+ - `hub_private_repo`: None
337
+ - `hub_always_push`: False
338
+ - `hub_revision`: None
339
+ - `gradient_checkpointing`: False
340
+ - `gradient_checkpointing_kwargs`: None
341
+ - `include_inputs_for_metrics`: False
342
+ - `include_for_metrics`: []
343
+ - `eval_do_concat_batches`: True
344
+ - `fp16_backend`: auto
345
+ - `push_to_hub_model_id`: None
346
+ - `push_to_hub_organization`: None
347
+ - `mp_parameters`:
348
+ - `auto_find_batch_size`: False
349
+ - `full_determinism`: False
350
+ - `torchdynamo`: None
351
+ - `ray_scope`: last
352
+ - `ddp_timeout`: 1800
353
+ - `torch_compile`: False
354
+ - `torch_compile_backend`: None
355
+ - `torch_compile_mode`: None
356
+ - `include_tokens_per_second`: False
357
+ - `include_num_input_tokens_seen`: no
358
+ - `neftune_noise_alpha`: None
359
+ - `optim_target_modules`: None
360
+ - `batch_eval_metrics`: False
361
+ - `eval_on_start`: False
362
+ - `use_liger_kernel`: False
363
+ - `liger_kernel_config`: None
364
+ - `eval_use_gather_object`: False
365
+ - `average_tokens_across_devices`: True
366
+ - `prompts`: None
367
+ - `batch_sampler`: no_duplicates
368
+ - `multi_dataset_batch_sampler`: proportional
369
+ - `router_mapping`: {}
370
+ - `learning_rate_mapping`: {}
371
+
372
+ </details>
373
+
374
+ ### Training Logs
375
+ | Epoch | Step | Training Loss | Validation Loss | cosine_accuracy |
376
+ |:----------:|:-------:|:-------------:|:---------------:|:---------------:|
377
+ | -1 | -1 | - | - | 0.9535 |
378
+ | **0.3928** | **500** | **0.7236** | **0.7945** | **0.6861** |
379
+ | 0.7855 | 1000 | 0.6335 | 0.7567 | 0.6616 |
380
+ | 1.1783 | 1500 | 0.6148 | 0.7505 | 0.6670 |
381
+ | 1.5711 | 2000 | 0.6028 | 0.7680 | 0.6837 |
382
+ | 1.9639 | 2500 | 0.593 | 0.7641 | 0.6759 |
383
+ | 2.3566 | 3000 | 0.5819 | 0.7465 | 0.6632 |
384
+ | 2.7494 | 3500 | 0.5757 | 0.7529 | 0.6685 |
385
+
386
+ * The bold row denotes the saved checkpoint.
387
+
388
+ ### Framework Versions
389
+ - Python: 3.12.12
390
+ - Sentence Transformers: 5.2.0
391
+ - Transformers: 4.57.3
392
+ - PyTorch: 2.9.1+cu128
393
+ - Accelerate: 1.12.0
394
+ - Datasets: 4.3.0
395
+ - Tokenizers: 0.22.1
396
+
397
+ ## Citation
398
+
399
+ ### BibTeX
400
+
401
+ #### Sentence Transformers
402
+ ```bibtex
403
+ @inproceedings{reimers-2019-sentence-bert,
404
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
405
+ author = "Reimers, Nils and Gurevych, Iryna",
406
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
407
+ month = "11",
408
+ year = "2019",
409
+ publisher = "Association for Computational Linguistics",
410
+ url = "https://arxiv.org/abs/1908.10084",
411
+ }
412
+ ```
413
+
414
+ #### TripletLoss
415
+ ```bibtex
416
+ @misc{hermans2017defense,
417
+ title={In Defense of the Triplet Loss for Person Re-Identification},
418
+ author={Alexander Hermans and Lucas Beyer and Bastian Leibe},
419
+ year={2017},
420
+ eprint={1703.07737},
421
+ archivePrefix={arXiv},
422
+ primaryClass={cs.CV}
423
+ }
424
+ ```
425
+
426
+ <!--
427
+ ## Glossary
428
+
429
+ *Clearly define terms in order to be accessible across audiences.*
430
+ -->
431
+
432
+ <!--
433
+ ## Model Card Authors
434
+
435
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
436
+ -->
437
+
438
+ <!--
439
+ ## Model Card Contact
440
+
441
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
442
+ -->
config_sentence_transformers.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "5.2.0",
4
+ "transformers": "4.57.3",
5
+ "pytorch": "2.9.1+cu128"
6
+ },
7
+ "prompts": {
8
+ "query": "",
9
+ "document": ""
10
+ },
11
+ "default_prompt_name": null,
12
+ "model_type": "SentenceTransformer",
13
+ "similarity_fn_name": "cosine"
14
+ }
configuration.py ADDED
@@ -0,0 +1,114 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # limitations under the License.
2
+ """ Vietnamese model configuration"""
3
+ from transformers.configuration_utils import PretrainedConfig
4
+ from transformers.utils import logging
5
+
6
+ logger = logging.get_logger(__name__)
7
+
8
+
9
+ class VietnameseConfig(PretrainedConfig):
10
+ r"""
11
+ This is the configuration class to store the configuration of a [`VietnameseModel`] or a [`TFVietnameseModel`]. It is used to
12
+ instantiate a Vietnamese model according to the specified arguments, defining the model architecture. Instantiating a
13
+ configuration with the defaults will yield a similar configuration to that of the Vietnamese
14
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
15
+ documentation from [`PretrainedConfig`] for more information.
16
+ Args:
17
+ vocab_size (`int`, *optional*, defaults to 30522):
18
+ Vocabulary size of the Vietnamese model. Defines the number of different tokens that can be represented by the
19
+ `inputs_ids` passed when calling [`VietnameseModel`] or [`TFVietnameseModel`].
20
+ hidden_size (`int`, *optional*, defaults to 768):
21
+ Dimensionality of the encoder layers and the pooler layer.
22
+ num_hidden_layers (`int`, *optional*, defaults to 12):
23
+ Number of hidden layers in the Transformer encoder.
24
+ num_attention_heads (`int`, *optional*, defaults to 12):
25
+ Number of attention heads for each attention layer in the Transformer encoder.
26
+ intermediate_size (`int`, *optional*, defaults to 3072):
27
+ Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
28
+ hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
29
+ The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
30
+ `"relu"`, `"silu"` and `"gelu_Vietnamese"` are supported.
31
+ hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
32
+ The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
33
+ attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
34
+ The dropout ratio for the attention probabilities.
35
+ max_position_embeddings (`int`, *optional*, defaults to 512):
36
+ The maximum sequence length that this model might ever be used with. Typically set this to something large
37
+ just in case (e.g., 512 or 1024 or 2048).
38
+ type_vocab_size (`int`, *optional*, defaults to 2):
39
+ The vocabulary size of the `token_type_ids` passed when calling [`VietnameseModel`] or [`TFVietnameseModel`].
40
+ initializer_range (`float`, *optional*, defaults to 0.02):
41
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
42
+ layer_norm_eps (`float`, *optional*, defaults to 1e-12):
43
+ The epsilon used by the layer normalization layers.
44
+ position_embedding_type (`str`, *optional*, defaults to `"rope"`):
45
+ Type of position embedding. Choose one of `"absolute"`, `"rope"`.
46
+ rope_theta (`float`, *optional*, defaults to 10000.0):
47
+ The base period of the RoPE embeddings.
48
+ rope_scaling (`Dict`, *optional*):
49
+ Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
50
+ strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is
51
+ `{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
52
+ `max_position_embeddings` to the expected new maximum. See the following thread for more information on how
53
+ these scaling strategies behave:
54
+ https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This is an
55
+ experimental feature, subject to breaking API changes in future versions.
56
+ classifier_dropout (`float`, *optional*):
57
+ The dropout ratio for the classification head.
58
+ Examples:
59
+ """
60
+
61
+ model_type = "Vietnamese"
62
+
63
+ def __init__(
64
+ self,
65
+ vocab_size=30528,
66
+ hidden_size=768,
67
+ num_hidden_layers=12,
68
+ num_attention_heads=12,
69
+ intermediate_size=3072,
70
+ hidden_act="gelu",
71
+ hidden_dropout_prob=0.1,
72
+ attention_probs_dropout_prob=0.0,
73
+ max_position_embeddings=2048,
74
+ type_vocab_size=1,
75
+ initializer_range=0.02,
76
+ layer_norm_type='layer_norm',
77
+ layer_norm_eps=1e-12,
78
+ # pad_token_id=0,
79
+ position_embedding_type="rope",
80
+ rope_theta=10000.0,
81
+ rope_scaling=None,
82
+ classifier_dropout=None,
83
+ pack_qkv=True,
84
+ unpad_inputs=False,
85
+ use_memory_efficient_attention=False,
86
+ logn_attention_scale=False,
87
+ logn_attention_clip1=False,
88
+ **kwargs,
89
+ ):
90
+ super().__init__(**kwargs)
91
+
92
+ self.vocab_size = vocab_size
93
+ self.hidden_size = hidden_size
94
+ self.num_hidden_layers = num_hidden_layers
95
+ self.num_attention_heads = num_attention_heads
96
+ self.hidden_act = hidden_act
97
+ self.intermediate_size = intermediate_size
98
+ self.hidden_dropout_prob = hidden_dropout_prob
99
+ self.attention_probs_dropout_prob = attention_probs_dropout_prob
100
+ self.max_position_embeddings = max_position_embeddings
101
+ self.type_vocab_size = type_vocab_size
102
+ self.initializer_range = initializer_range
103
+ self.layer_norm_type = layer_norm_type
104
+ self.layer_norm_eps = layer_norm_eps
105
+ self.position_embedding_type = position_embedding_type
106
+ self.rope_theta = rope_theta
107
+ self.rope_scaling = rope_scaling
108
+ self.classifier_dropout = classifier_dropout
109
+
110
+ self.pack_qkv = pack_qkv
111
+ self.unpad_inputs = unpad_inputs
112
+ self.use_memory_efficient_attention = use_memory_efficient_attention
113
+ self.logn_attention_scale = logn_attention_scale
114
+ self.logn_attention_clip1 = logn_attention_clip1
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:19be82f92d8337349792ce619003cd738412395d40933bec470320ccfe5c33e4
3
  size 1221487872
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:129e01cfccbb835ca34ae8e83b7a16b8c81373225f0edffc11cda8bcb7d32694
3
  size 1221487872
modeling.py ADDED
@@ -0,0 +1,1319 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """PyTorch Vietnamese model."""
2
+ import math
3
+ from dataclasses import dataclass
4
+ from typing import List, Optional, Tuple, Union
5
+
6
+ import torch
7
+ import torch.utils.checkpoint
8
+ from torch import nn
9
+
10
+ from transformers.activations import ACT2FN
11
+ from transformers.modeling_outputs import (
12
+ BaseModelOutput,
13
+ BaseModelOutputWithPooling,
14
+ MaskedLMOutput,
15
+ MultipleChoiceModelOutput,
16
+ QuestionAnsweringModelOutput,
17
+ SequenceClassifierOutput,
18
+ ModelOutput,
19
+ )
20
+ from transformers.modeling_utils import PreTrainedModel
21
+ from transformers.utils import logging
22
+
23
+ try:
24
+ import xformers.ops as xops
25
+ except ImportError as e:
26
+ xops = None
27
+
28
+ from .configuration import VietnameseConfig
29
+
30
+
31
+ logger = logging.get_logger(__name__)
32
+
33
+
34
+ # Adapted from https://github.com/HazyResearch/flash-attention/blob/main/flash_attn/bert_padding.py
35
+ # Which was adapted from https://github.com/mlcommons/training_results_v1.1/blob/main/NVIDIA/benchmarks/bert/implementations/pytorch/padding.py
36
+ class IndexFirstAxis(torch.autograd.Function):
37
+ @staticmethod
38
+ def forward(ctx, input, indices):
39
+ ctx.save_for_backward(indices)
40
+ assert input.ndim >= 2
41
+ ctx.first_axis_dim, other_shape = input.shape[0], input.shape[1:]
42
+ second_dim = other_shape.numel()
43
+ return torch.gather(
44
+ input.view(ctx.first_axis_dim, second_dim),
45
+ 0,
46
+ indices.unsqueeze(-1).expand(indices.size(0), second_dim)
47
+ ).reshape(-1, *other_shape)
48
+
49
+ @staticmethod
50
+ def backward(ctx, grad_output):
51
+ (indices,) = ctx.saved_tensors
52
+ assert grad_output.ndim >= 2
53
+ other_shape = grad_output.shape[1:]
54
+ grad_output = grad_output.view(grad_output.size(0), other_shape.numel())
55
+ grad_input = torch.zeros(
56
+ [ctx.first_axis_dim, grad_output.shape[1]],
57
+ device=grad_output.device,
58
+ dtype=grad_output.dtype,
59
+ )
60
+ grad_input.scatter_(
61
+ 0, indices.unsqueeze(-1).expand(indices.size(0), grad_output.size(1)), grad_output
62
+ )
63
+ return grad_input.reshape(ctx.first_axis_dim, *other_shape), None
64
+
65
+
66
+ index_first_axis = IndexFirstAxis.apply
67
+
68
+
69
+ def unpad_input(hidden_states, attention_mask=None, indices=None):
70
+ """
71
+ Arguments:
72
+ hidden_states: (batch, seqlen, ...)
73
+ attention_mask: (batch, seqlen), bool / int, 1 means valid and 0 means not valid.
74
+ indices: (total_nnz), the indices of non-masked tokens from the flattened input sequence.
75
+ Return:
76
+ hidden_states: (total_nnz, ...), where total_nnz = number of tokens in selected in attention_mask.
77
+ """
78
+ if indices is None:
79
+ assert attention_mask is not None
80
+ indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
81
+
82
+ hidden_states = hidden_states.view(-1, *hidden_states.shape[2:])
83
+ return index_first_axis(hidden_states, indices)
84
+
85
+
86
+ class IndexPutFirstAxis(torch.autograd.Function):
87
+ @staticmethod
88
+ def forward(
89
+ ctx,
90
+ values: torch.Tensor,
91
+ indices: torch.Tensor,
92
+ first_axis_dim
93
+ ) -> torch.Tensor:
94
+ ctx.save_for_backward(indices)
95
+ assert indices.ndim == 1
96
+ assert values.ndim >= 2
97
+ output = torch.zeros(
98
+ first_axis_dim, *values.shape[1:], device=values.device, dtype=values.dtype
99
+ )
100
+ output[indices] = values
101
+ return output
102
+
103
+ @staticmethod
104
+ def backward(ctx, grad_output: torch.Tensor) -> Tuple[torch.Tensor, None, None]:
105
+ indices, = ctx.saved_tensors
106
+ grad_values = grad_output[indices]
107
+ return grad_values, None, None
108
+
109
+
110
+ index_put_first_axis = IndexPutFirstAxis.apply
111
+
112
+
113
+ def pad_input(inputs: torch.Tensor, indices: torch.Tensor, batch: int, seqlen: int) -> torch.Tensor:
114
+ """Add padding to sequences.
115
+ Arguments:
116
+ inputs: (total_nnz, ...), where total_nnz = number of tokens in selected in attention_mask.
117
+ indices: (total_nnz), `indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()`
118
+ batch: int batch_size
119
+ seqlen: int max sequence length
120
+ Returns:
121
+ inputs: (batch, seqlen, ...)
122
+ """
123
+ output = index_put_first_axis(inputs, indices, batch * seqlen)
124
+ return output.view(batch, seqlen, *inputs.shape[1:])
125
+
126
+
127
+ def rotate_half(x):
128
+ """Rotates half the hidden dims of the input."""
129
+ x1 = x[..., : x.shape[-1] // 2]
130
+ x2 = x[..., x.shape[-1] // 2 :]
131
+ return torch.cat((-x2, x1), dim=-1)
132
+
133
+
134
+ def apply_rotary_pos_emb(q, k, cos, sin):
135
+ """Applies Rotary Position Embedding to the query and key tensors.
136
+ Args:
137
+ q (`torch.Tensor`): The query tensor.
138
+ k (`torch.Tensor`): The key tensor.
139
+ cos (`torch.Tensor`): The cosine part of the rotary embedding.
140
+ sin (`torch.Tensor`): The sine part of the rotary embedding.
141
+ Returns:
142
+ `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
143
+ """
144
+ cos, sin = cos.to(q.dtype), sin.to(q.dtype)
145
+ q_embed = (q * cos) + (rotate_half(q) * sin)
146
+ k_embed = (k * cos) + (rotate_half(k) * sin)
147
+ return q_embed, k_embed
148
+
149
+
150
+ class RotaryEmbedding(torch.nn.Module):
151
+ def __init__(self, dim, max_position_embeddings=512, base=10000.0, device=None):
152
+ super().__init__()
153
+
154
+ self.dim = dim
155
+ self.max_position_embeddings = max_position_embeddings
156
+ self.base = base
157
+ inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
158
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
159
+
160
+ self._set_cos_sin_cache(
161
+ seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
162
+ )
163
+
164
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
165
+ self.max_seq_len_cached = seq_len
166
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.float32)
167
+
168
+ freqs = torch.einsum("i,j->ij", t, self.inv_freq)
169
+ emb = torch.cat((freqs, freqs), dim=-1)
170
+ self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
171
+ self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
172
+
173
+ def forward(self, x, seq_len=None):
174
+ if seq_len > self.max_seq_len_cached:
175
+ self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)
176
+
177
+ return (
178
+ self.cos_cached[:seq_len, ...].to(dtype=x.dtype),
179
+ self.sin_cached[:seq_len, ...].to(dtype=x.dtype),
180
+ )
181
+
182
+
183
+ class NTKScalingRotaryEmbedding(RotaryEmbedding):
184
+ """RotaryEmbedding extended with fixed and mixed NTK scaling. https://kexue.fm/archives/9706 """
185
+
186
+ def __init__(self, dim, max_position_embeddings=512, base=10000, device=None, scaling_factor=1.0, mixed_b=None):
187
+ self.scaling_factor = scaling_factor
188
+ self.mixed_b = mixed_b
189
+ super().__init__(dim, max_position_embeddings, base, device)
190
+ max_position_embeddings = max_position_embeddings * self.scaling_factor
191
+ self._set_cos_sin_cache(max_position_embeddings, self.inv_freq.device, torch.get_default_dtype())
192
+
193
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
194
+ self.max_seq_len_cached = seq_len
195
+
196
+ if seq_len > self.max_position_embeddings:
197
+ base = self.base * (self.scaling_factor if self.mixed_b is None else 1)
198
+ inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
199
+
200
+ if self.mixed_b is None:
201
+ inv_freq = inv_freq / self.scaling_factor ** (2 / self.dim)
202
+ else:
203
+ a = torch.tensor(self.scaling_factor).log() / (self.dim / 2) ** self.mixed_b
204
+ lambda_1_m = (a * torch.arange(1, self.dim // 2 + 1).float().to(device) ** self.mixed_b).exp()
205
+ inv_freq = inv_freq / lambda_1_m
206
+
207
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
208
+
209
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.float32)
210
+
211
+ freqs = torch.einsum("i,j->ij", t, self.inv_freq)
212
+ emb = torch.cat((freqs, freqs), dim=-1)
213
+ self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
214
+ self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
215
+
216
+
217
+ class RMSNorm(nn.Module):
218
+ def __init__(self, hidden_size, eps=1e-6):
219
+ """
220
+ RMSNorm is equivalent to T5LayerNorm
221
+ """
222
+ super().__init__()
223
+ self.weight = nn.Parameter(torch.ones(hidden_size))
224
+ self.variance_epsilon = eps
225
+
226
+ def forward(self, hidden_states):
227
+ input_dtype = hidden_states.dtype
228
+ hidden_states = hidden_states.to(torch.float32)
229
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
230
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
231
+ return self.weight * hidden_states.to(input_dtype)
232
+
233
+
234
+ LAYER_NORM = {
235
+ 'layer_norm': nn.LayerNorm,
236
+ 'rms_norm': RMSNorm
237
+ }
238
+
239
+
240
+ class VietnameseEmbeddings(nn.Module):
241
+ """
242
+ Embedding and Unpadding.
243
+ """
244
+
245
+ def __init__(self, config: VietnameseConfig):
246
+ super().__init__()
247
+ self.padding_idx = config.pad_token_id
248
+ self.word_embeddings = nn.Embedding(
249
+ config.vocab_size, config.hidden_size, padding_idx=self.padding_idx
250
+ )
251
+
252
+ self.position_embedding_type = config.position_embedding_type
253
+ if self.position_embedding_type == 'absolute':
254
+ self.position_embeddings = nn.Embedding(
255
+ config.max_position_embeddings, config.hidden_size, padding_idx=self.padding_idx
256
+ )
257
+ elif self.position_embedding_type == 'rope':
258
+ self._init_rope(config)
259
+ else:
260
+ raise ValueError
261
+
262
+ self.type_vocab_size = config.type_vocab_size
263
+ if self.type_vocab_size > 0:
264
+ self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
265
+
266
+ self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
267
+ self.dropout = nn.Dropout(config.hidden_dropout_prob)
268
+ self.register_buffer(
269
+ "position_ids", torch.arange(config.max_position_embeddings), persistent=False
270
+ )
271
+
272
+ def _init_rope(self, config):
273
+ kwargs = dict(
274
+ dim=int(config.hidden_size / config.num_attention_heads),
275
+ max_position_embeddings=config.max_position_embeddings,
276
+ base=config.rope_theta
277
+ )
278
+ if config.rope_scaling is None:
279
+ self.rotary_emb = RotaryEmbedding(**kwargs)
280
+ else:
281
+ kwargs.update(scaling_factor=config.rope_scaling["factor"])
282
+ scaling_type = config.rope_scaling["type"]
283
+ if scaling_type == 'ntk':
284
+ kwargs.update(mixed_b=config.rope_scaling.get('mixed_b', None))
285
+ self.rotary_emb = NTKScalingRotaryEmbedding(**kwargs)
286
+ else:
287
+ raise ValueError(f"Unknown RoPE scaling type {scaling_type}")
288
+
289
+ def forward(
290
+ self,
291
+ unpad_inputs: bool,
292
+ input_ids: Optional[torch.Tensor] = None,
293
+ attention_mask: Optional[torch.Tensor] = None,
294
+ length: Optional[List[int]] = None,
295
+ token_type_ids: Optional[torch.Tensor] = None,
296
+ position_ids: Optional[torch.Tensor] = None,
297
+ inputs_embeds: Optional[torch.Tensor] = None,
298
+ ) -> Tuple[torch.Tensor, torch.Tensor, Optional[Tuple], Optional[List[int]]]:
299
+ if inputs_embeds is None:
300
+ device, input_shape = input_ids.device, input_ids.shape
301
+ else:
302
+ device, input_shape = inputs_embeds.device, inputs_embeds.shape[:2]
303
+ batch_size, seq_length = input_shape
304
+
305
+ if attention_mask is None:
306
+ attention_mask = torch.ones(input_shape, device=device)
307
+ if length is not None:
308
+ for i, l in enumerate(length):
309
+ attention_mask[i, l:] = 0
310
+
311
+ if unpad_inputs:
312
+ attention_mask_bool = attention_mask.bool()
313
+ if length is None:
314
+ length = attention_mask.sum(-1).tolist()
315
+
316
+ if inputs_embeds is None:
317
+ if unpad_inputs:
318
+ input_ids = input_ids[attention_mask_bool].unsqueeze(0)
319
+ inputs_embeds = self.word_embeddings(input_ids)
320
+ else:
321
+ if unpad_inputs:
322
+ inputs_embeds = inputs_embeds[attention_mask_bool].unsqueeze(0)
323
+ embeddings = inputs_embeds
324
+
325
+ if position_ids is None:
326
+ if seq_length > self.position_ids.size(0):
327
+ self.register_buffer(
328
+ "position_ids", torch.arange(seq_length, device=embeddings.device), persistent=False
329
+ )
330
+ if unpad_inputs:
331
+ position_ids = torch.cat([self.position_ids[:l] for l in length]).unsqueeze(0)
332
+ else:
333
+ position_ids = self.position_ids[:seq_length].expand(batch_size, -1)
334
+ elif unpad_inputs:
335
+ position_ids = position_ids[attention_mask_bool].unsqueeze(0)
336
+
337
+ if self.position_embedding_type == 'rope':
338
+ rope_cos, rope_sin = self.rotary_emb(inputs_embeds, seq_len=seq_length)
339
+ rope_cos = rope_cos[position_ids].unsqueeze(2)
340
+ rope_sin = rope_sin[position_ids].unsqueeze(2)
341
+ rope_embeds = rope_cos, rope_sin
342
+ else:
343
+ rope_embeds = None
344
+
345
+ if self.type_vocab_size > 0:
346
+ if token_type_ids is None:
347
+ token_type_ids = position_ids.mul(0)
348
+ else:
349
+ if self.type_vocab_size < 2:
350
+ token_type_ids.mul_(0)
351
+ if unpad_inputs:
352
+ token_type_ids = token_type_ids[attention_mask_bool].unsqueeze(0)
353
+
354
+ token_type_embeddings = self.token_type_embeddings(token_type_ids)
355
+ embeddings = embeddings + token_type_embeddings
356
+
357
+ if self.position_embedding_type == "absolute":
358
+ position_embeddings = self.position_embeddings(position_ids)
359
+ embeddings = embeddings + position_embeddings
360
+
361
+ embeddings = self.LayerNorm(embeddings)
362
+ embeddings = self.dropout(embeddings)
363
+
364
+ return embeddings, attention_mask, rope_embeds, length
365
+
366
+
367
+ class VietnameseAttention(nn.Module):
368
+ def __init__(self, config: VietnameseConfig, pack_qkv=None, use_memory_efficient_attention=None):
369
+ super().__init__()
370
+ self.config = config
371
+ if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
372
+ raise ValueError(
373
+ f"The hidden size ({config.hidden_size}) is not a multiple of the number of attention "
374
+ f"heads ({config.num_attention_heads})"
375
+ )
376
+
377
+ self.hidden_size = config.hidden_size
378
+ self.num_attention_heads = config.num_attention_heads
379
+ self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
380
+ self.all_head_size = self.num_attention_heads * self.attention_head_size
381
+
382
+ if pack_qkv is None:
383
+ pack_qkv = config.pack_qkv
384
+ self.pack_qkv = pack_qkv
385
+
386
+ if self.pack_qkv:
387
+ self.qkv_proj = nn.Linear(config.hidden_size, self.all_head_size * 3, bias=True)
388
+ else:
389
+ self.q_proj = nn.Linear(config.hidden_size, self.all_head_size, bias=True)
390
+ self.k_proj = nn.Linear(config.hidden_size, self.all_head_size, bias=True)
391
+ self.v_proj = nn.Linear(config.hidden_size, self.all_head_size, bias=True)
392
+
393
+ self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
394
+ self.o_proj = nn.Linear(config.hidden_size, config.hidden_size, bias=True)
395
+
396
+ if use_memory_efficient_attention is None:
397
+ use_memory_efficient_attention = self.config.use_memory_efficient_attention
398
+ self.use_memory_efficient_attention = use_memory_efficient_attention
399
+ self.memory_efficient_attention = None if xops is None else xops.memory_efficient_attention
400
+ if self.use_memory_efficient_attention:
401
+ assert self.memory_efficient_attention is not None, 'please install xformers'
402
+
403
+ def forward(
404
+ self,
405
+ hidden_states: torch.Tensor,
406
+ attention_bias: torch.FloatTensor,
407
+ rope_embeds: Optional[Tuple[torch.FloatTensor, torch.FloatTensor]] = None,
408
+ padding_inputs: Optional[Tuple] = None,
409
+ attention_scale: Optional[torch.FloatTensor] = None,
410
+ head_mask: Optional[torch.FloatTensor] = None,
411
+ output_attentions: Optional[bool] = False,
412
+ qkv_inputs: Optional[Tuple] = None,
413
+ ) -> Tuple[torch.Tensor, ...]:
414
+ shape_hd = (self.num_attention_heads, self.attention_head_size)
415
+ if self.pack_qkv and qkv_inputs is None:
416
+ qkv_pack = self.qkv_proj(hidden_states).split(self.all_head_size, dim=-1)
417
+ else:
418
+ if qkv_inputs is None:
419
+ qkv_inputs = (hidden_states, hidden_states, hidden_states)
420
+ qkv_pack = [
421
+ getattr(self, n + '_proj')(s) for s, n in zip(qkv_inputs, 'qkv')
422
+ ]
423
+ query_states, key_states, value_states = [t.view(t.shape[:-1] + shape_hd) for t in qkv_pack]
424
+
425
+ if self.config.position_embedding_type == 'rope':
426
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, *rope_embeds)
427
+
428
+ dtype = query_states.dtype
429
+
430
+ if self.config.logn_attention_scale and attention_scale is not None:
431
+ query_states = query_states * attention_scale.to(dtype)
432
+
433
+ if padding_inputs is not None:
434
+ query_states = pad_input(query_states.squeeze(), *padding_inputs)
435
+ key_states = pad_input(key_states.squeeze(), *padding_inputs)
436
+ value_states = pad_input(value_states.squeeze(), *padding_inputs)
437
+
438
+ if self.use_memory_efficient_attention:
439
+ assert self.memory_efficient_attention is not None, "xformers is not loaded"
440
+ assert output_attentions is False, "memory_efficient_attention do not output attentions"
441
+ assert head_mask is None, "Not support yet"
442
+ attention_probs = None
443
+ if torch.is_tensor(attention_bias):
444
+ attention_bias = attention_bias.to(dtype)
445
+ context_layer = self.memory_efficient_attention(
446
+ query_states,
447
+ key_states,
448
+ value_states,
449
+ attn_bias=attention_bias,
450
+ p=self.dropout.p
451
+ )
452
+ else:
453
+ if output_attentions and isinstance(self, VietnameseSdpaAttention):
454
+ raise RuntimeError("SDPA do not output attentions")
455
+ context_layer, attention_probs = self._attention(
456
+ query_states, key_states, value_states, attention_bias, head_mask
457
+ )
458
+
459
+ if padding_inputs is not None:
460
+ context_layer = unpad_input(context_layer, indices=padding_inputs[0])
461
+
462
+ new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
463
+ context_layer = context_layer.view(new_context_layer_shape)
464
+
465
+ attn_output = self.o_proj(context_layer)
466
+
467
+ outputs = (attn_output, attention_probs) if output_attentions else (attn_output,)
468
+ return outputs
469
+
470
+ def _attention(self, query_states, key_states, value_states, attention_bias, head_mask):
471
+ query_states = query_states.transpose(1, 2)
472
+ key_states = key_states.transpose(1, 2)
473
+ value_states = value_states.transpose(1, 2)
474
+ attention_scores = torch.matmul(query_states, key_states.transpose(-1, -2))
475
+
476
+ attention_scores = attention_scores / math.sqrt(self.attention_head_size)
477
+ if attention_bias is not None:
478
+ attention_scores = attention_scores + attention_bias
479
+
480
+ attention_probs = nn.functional.softmax(attention_scores, dim=-1)
481
+
482
+ if self.dropout.p > 0:
483
+ attention_probs = self.dropout(attention_probs)
484
+
485
+ if head_mask is not None:
486
+ attention_probs = attention_probs * head_mask
487
+
488
+ context_layer = torch.matmul(attention_probs, value_states)
489
+
490
+ context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
491
+ return context_layer, attention_probs
492
+
493
+
494
+ class VietnameseSdpaAttention(VietnameseAttention):
495
+ """
496
+ Vietnamese attention module using torch.nn.functional.scaled_dot_product_attention. This module inherits from
497
+ `VietnameseAttention` as the weights of the module stays untouched. The only changes are on the forward pass to adapt to
498
+ SDPA API.
499
+ """
500
+ def __init__(self, config: VietnameseConfig, **kwargs):
501
+ super().__init__(config, **kwargs)
502
+
503
+ def _attention(self, query_states, key_states, value_states, attention_bias, head_mask):
504
+ attn_output = torch.nn.functional.scaled_dot_product_attention(
505
+ query_states.transpose(1, 2),
506
+ key_states.transpose(1, 2),
507
+ value_states.transpose(1, 2),
508
+ attn_mask=attention_bias,
509
+ dropout_p=self.dropout.p if self.training else 0.0,
510
+ )
511
+ attn_output = attn_output.permute(0, 2, 1, 3).contiguous()
512
+ return attn_output, None
513
+
514
+
515
+ Vietnamese_ATTENTION_CLASSES = {
516
+ "eager": VietnameseAttention,
517
+ "sdpa": VietnameseSdpaAttention,
518
+ }
519
+
520
+
521
+ class VietnameseGatedMLP(nn.Module):
522
+ """
523
+ GLU Variants Improve Transformer.
524
+ """
525
+
526
+ def __init__(self, config: VietnameseConfig):
527
+ super().__init__()
528
+ self.intermediate_size = config.intermediate_size
529
+ self.up_gate_proj = nn.Linear(config.hidden_size, self.intermediate_size * 2, bias=False)
530
+ self.down_proj = nn.Linear(self.intermediate_size, config.hidden_size, bias=True)
531
+ self.act_fn = ACT2FN[config.hidden_act]
532
+ if config.hidden_dropout_prob > 0:
533
+ self.hidden_dropout = nn.Dropout(config.hidden_dropout_prob)
534
+ else:
535
+ self.hidden_dropout = None
536
+
537
+ def forward(self, hidden_states):
538
+ up_gate = self.up_gate_proj(hidden_states)
539
+ up_states, gate = torch.split(up_gate, self.intermediate_size, dim=-1)
540
+ gate = self.act_fn(gate)
541
+ gated_states = gate * up_states
542
+ if self.hidden_dropout is not None:
543
+ gated_states = self.hidden_dropout(gated_states)
544
+ down_states = self.down_proj(gated_states)
545
+ return down_states
546
+
547
+
548
+ class VietnameseLayer(nn.Module):
549
+ def __init__(
550
+ self,
551
+ config: VietnameseConfig,
552
+ pack_qkv=None,
553
+ use_memory_efficient_attention=None,
554
+ attn_implementation=None
555
+ ):
556
+ super().__init__()
557
+ if attn_implementation is None:
558
+ attn_implementation = config._attn_implementation
559
+ if use_memory_efficient_attention is None:
560
+ use_memory_efficient_attention = config.use_memory_efficient_attention
561
+ if use_memory_efficient_attention:
562
+ if attn_implementation != 'eager':
563
+ logger.warning_once(f"Override {attn_implementation=} to 'eager' as {use_memory_efficient_attention=}")
564
+ attn_implementation = 'eager'
565
+ self.attention = Vietnamese_ATTENTION_CLASSES[attn_implementation](
566
+ config, pack_qkv=pack_qkv, use_memory_efficient_attention=use_memory_efficient_attention
567
+ )
568
+ self.mlp = VietnameseGatedMLP(config)
569
+
570
+ ln_class = LAYER_NORM[config.layer_norm_type]
571
+ self.attn_ln = ln_class(config.hidden_size, eps=config.layer_norm_eps)
572
+ self.mlp_ln = ln_class(config.hidden_size, eps=config.layer_norm_eps)
573
+
574
+ if config.hidden_dropout_prob > 0:
575
+ self.hidden_dropout = nn.Dropout(config.hidden_dropout_prob)
576
+ else:
577
+ self.hidden_dropout = None
578
+
579
+ def forward(
580
+ self,
581
+ hidden_states: torch.Tensor,
582
+ attention_bias: torch.FloatTensor,
583
+ rope_embeds: Optional[Tuple[torch.FloatTensor, torch.FloatTensor]] = None,
584
+ padding_inputs: Optional[Tuple] = None,
585
+ attention_scale: Optional[torch.FloatTensor] = None,
586
+ subset_indices: Optional[torch.LongTensor] = None,
587
+ head_mask: Optional[torch.FloatTensor] = None,
588
+ output_attentions: Optional[bool] = False,
589
+ qkv_inputs: Optional[Tuple] = None,
590
+ ) -> Tuple[torch.Tensor, ...]:
591
+ residual = hidden_states if qkv_inputs is None else qkv_inputs[0]
592
+ attention_outputs = self.attention(
593
+ hidden_states,
594
+ attention_bias,
595
+ rope_embeds,
596
+ padding_inputs,
597
+ attention_scale,
598
+ head_mask,
599
+ output_attentions=output_attentions,
600
+ qkv_inputs=qkv_inputs,
601
+ )
602
+ hidden_states = attention_outputs[0]
603
+ if self.hidden_dropout is not None:
604
+ hidden_states = self.hidden_dropout(hidden_states)
605
+ hidden_states = residual + hidden_states
606
+
607
+ if subset_indices is not None:
608
+ hidden_states = hidden_states[subset_indices]
609
+
610
+ hidden_states = self.attn_ln(hidden_states)
611
+
612
+ residual = hidden_states
613
+ hidden_states = self.mlp(hidden_states)
614
+ if self.hidden_dropout is not None:
615
+ hidden_states = self.hidden_dropout(hidden_states)
616
+ hidden_states = residual + hidden_states
617
+ hidden_states = self.mlp_ln(hidden_states)
618
+
619
+ outputs = (hidden_states,) + attention_outputs[1:]
620
+ return outputs
621
+
622
+
623
+ class VietnameseEncoder(nn.Module):
624
+ def __init__(self, config):
625
+ super().__init__()
626
+ self.config = config
627
+ self.layer = nn.ModuleList([VietnameseLayer(config) for _ in range(config.num_hidden_layers)])
628
+ self.gradient_checkpointing = False
629
+
630
+ def forward(
631
+ self,
632
+ hidden_states: torch.Tensor,
633
+ attention_bias: Optional[torch.FloatTensor] = None,
634
+ rope_embeds: Optional[Tuple[torch.FloatTensor, torch.FloatTensor]] = None,
635
+ padding_inputs: Optional[Tuple] = None,
636
+ attention_scale: Optional[torch.FloatTensor] = None,
637
+ subset_indices: Optional[torch.LongTensor] = None,
638
+ head_mask: Optional[torch.FloatTensor] = None,
639
+ output_attentions: Optional[bool] = False,
640
+ output_hidden_states: Optional[bool] = False,
641
+ return_dict: Optional[bool] = True,
642
+ ) -> Union[Tuple[torch.Tensor], BaseModelOutput]:
643
+ all_hidden_states = () if output_hidden_states else None
644
+ all_self_attentions = () if output_attentions else None
645
+
646
+ for i, layer_module in enumerate(self.layer):
647
+ if output_hidden_states:
648
+ all_hidden_states = all_hidden_states + (hidden_states,)
649
+
650
+ if i >= len(self.layer) - 1:
651
+ layer_subset_indices = subset_indices
652
+ else:
653
+ layer_subset_indices = None
654
+
655
+ layer_head_mask = head_mask[i] if head_mask is not None else None
656
+
657
+ if self.gradient_checkpointing and self.training:
658
+ layer_outputs = self._gradient_checkpointing_func(
659
+ layer_module.__call__,
660
+ hidden_states,
661
+ attention_bias,
662
+ rope_embeds,
663
+ padding_inputs,
664
+ attention_scale,
665
+ layer_subset_indices,
666
+ layer_head_mask,
667
+ )
668
+ else:
669
+ layer_outputs = layer_module(
670
+ hidden_states,
671
+ attention_bias,
672
+ rope_embeds,
673
+ padding_inputs,
674
+ attention_scale,
675
+ layer_subset_indices,
676
+ layer_head_mask,
677
+ output_attentions,
678
+ )
679
+
680
+ hidden_states = layer_outputs[0]
681
+ if output_attentions:
682
+ all_self_attentions = all_self_attentions + (layer_outputs[1],)
683
+
684
+ if output_hidden_states:
685
+ all_hidden_states = all_hidden_states + (hidden_states,)
686
+
687
+ if not return_dict:
688
+ return tuple(
689
+ v
690
+ for v in [
691
+ hidden_states,
692
+ all_hidden_states,
693
+ all_self_attentions,
694
+ ]
695
+ if v is not None
696
+ )
697
+ return BaseModelOutput(
698
+ last_hidden_state=hidden_states,
699
+ hidden_states=all_hidden_states,
700
+ attentions=all_self_attentions,
701
+ )
702
+
703
+
704
+ class VietnamesePooler(nn.Module):
705
+ def __init__(self, config):
706
+ super().__init__()
707
+ self.dense = nn.Linear(config.hidden_size, config.hidden_size)
708
+ self.activation = nn.Tanh()
709
+
710
+ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
711
+ first_token_tensor = hidden_states[:, 0]
712
+ pooled_output = self.dense(first_token_tensor)
713
+ pooled_output = self.activation(pooled_output)
714
+ return pooled_output
715
+
716
+
717
+ class VietnamesePreTrainedModel(PreTrainedModel):
718
+ """
719
+ An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
720
+ models.
721
+ """
722
+
723
+ config_class = VietnameseConfig
724
+ base_model_prefix = "Vietnamese"
725
+ supports_gradient_checkpointing = True
726
+ _supports_sdpa = True
727
+
728
+ def _init_weights(self, module):
729
+ """Initialize the weights"""
730
+ if isinstance(module, nn.Linear):
731
+ module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
732
+ if module.bias is not None:
733
+ module.bias.data.zero_()
734
+ elif isinstance(module, nn.Embedding):
735
+ module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
736
+ if module.padding_idx is not None:
737
+ module.weight.data[module.padding_idx].zero_()
738
+ elif isinstance(module, nn.LayerNorm):
739
+ module.bias.data.zero_()
740
+ module.weight.data.fill_(1.0)
741
+
742
+
743
+ class VietnameseModel(VietnamesePreTrainedModel):
744
+ """
745
+ The bare Vietnamese Model transformer outputting raw hidden-states without any specific head on top.
746
+ """
747
+
748
+ def __init__(self, config: VietnameseConfig, add_pooling_layer=False):
749
+ super().__init__(config)
750
+ self.config = config
751
+
752
+ self.embeddings = VietnameseEmbeddings(config)
753
+ self.encoder = VietnameseEncoder(config)
754
+
755
+ self.pooler = VietnamesePooler(config) if add_pooling_layer else None
756
+
757
+ self.post_init()
758
+
759
+ def get_input_embeddings(self):
760
+ return self.embeddings.word_embeddings
761
+
762
+ def set_input_embeddings(self, value):
763
+ self.embeddings.word_embeddings = value
764
+
765
+ def forward(
766
+ self,
767
+ input_ids: Optional[torch.Tensor] = None,
768
+ attention_mask: Optional[torch.Tensor] = None,
769
+ length: Optional[List[int]] = None,
770
+ subset_indices: Optional[torch.LongTensor] = None,
771
+ token_type_ids: Optional[torch.Tensor] = None,
772
+ position_ids: Optional[torch.Tensor] = None,
773
+ head_mask: Optional[torch.Tensor] = None,
774
+ inputs_embeds: Optional[torch.Tensor] = None,
775
+ output_attentions: Optional[bool] = None,
776
+ output_hidden_states: Optional[bool] = None,
777
+ return_dict: Optional[bool] = None,
778
+ unpad_inputs: Optional[bool] = None,
779
+ ) -> Union[Tuple[torch.Tensor], BaseModelOutputWithPooling]:
780
+ r"""
781
+ length (`list` of length `batch_size`, *optional*):
782
+ If is `None`, return padded `last_hidden_state`.
783
+ subset_indices ():
784
+ pass
785
+ unpad_inputs (`bool`, *optional*):
786
+ pass
787
+ """
788
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
789
+ output_hidden_states = (
790
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
791
+ )
792
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
793
+ unpad_inputs = unpad_inputs if unpad_inputs is not None else self.config.unpad_inputs
794
+ output_padded = length is None
795
+
796
+ if input_ids is not None and inputs_embeds is not None:
797
+ raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
798
+ elif input_ids is not None:
799
+ self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)
800
+ input_shape = input_ids.size()
801
+ elif inputs_embeds is not None:
802
+ input_shape = inputs_embeds.size()[:-1]
803
+ else:
804
+ raise ValueError("You have to specify either input_ids or inputs_embeds")
805
+
806
+ (embedding_output, attention_mask, rope_embeds, length) = self.embeddings(
807
+ unpad_inputs,
808
+ input_ids=input_ids,
809
+ attention_mask=attention_mask,
810
+ length=length,
811
+ token_type_ids=token_type_ids,
812
+ position_ids=position_ids,
813
+ inputs_embeds=inputs_embeds
814
+ )
815
+
816
+ batch_size, seq_length = input_shape
817
+ if unpad_inputs and self.config.use_memory_efficient_attention:
818
+ attention_bias = xops.fmha.attn_bias.BlockDiagonalMask.from_seqlens(length)
819
+ else:
820
+ attention_bias = self.get_extended_attention_mask(attention_mask, input_shape)
821
+ if self.config.use_memory_efficient_attention:
822
+ attention_bias = attention_bias.expand(-1, self.config.num_attention_heads, seq_length, -1)
823
+
824
+ padding_inputs = None
825
+ if unpad_inputs and (output_padded or not self.config.use_memory_efficient_attention):
826
+ indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
827
+ if not self.config.use_memory_efficient_attention:
828
+ padding_inputs = (indices, *input_shape)
829
+
830
+ attention_scale = None
831
+ if self.config.logn_attention_scale:
832
+ logger.warning_once("TODO: logn_attention_scale")
833
+
834
+ encoder_outputs = self.encoder(
835
+ embedding_output,
836
+ attention_bias=attention_bias,
837
+ rope_embeds=rope_embeds,
838
+ padding_inputs=padding_inputs,
839
+ attention_scale=attention_scale,
840
+ subset_indices=subset_indices,
841
+ head_mask=head_mask,
842
+ output_attentions=output_attentions,
843
+ output_hidden_states=output_hidden_states,
844
+ return_dict=return_dict,
845
+ )
846
+ sequence_output = encoder_outputs[0]
847
+ if unpad_inputs and output_padded:
848
+ sequence_output = pad_input(
849
+ sequence_output.squeeze(), indices, batch_size, seq_length
850
+ )
851
+
852
+ pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
853
+
854
+ if not return_dict:
855
+ return (sequence_output, pooled_output) + encoder_outputs[1:]
856
+
857
+ return BaseModelOutputWithPooling(
858
+ last_hidden_state=sequence_output,
859
+ pooler_output=pooled_output,
860
+ hidden_states=encoder_outputs.hidden_states,
861
+ attentions=encoder_outputs.attentions,
862
+ )
863
+
864
+
865
+ class VietnameseLMPredictionHead(nn.Module):
866
+ def __init__(self, config):
867
+ super().__init__()
868
+ self.dense = nn.Linear(config.hidden_size, config.hidden_size)
869
+ self.transform_act_fn = ACT2FN[config.hidden_act]
870
+ self.norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
871
+
872
+ self.decoder = nn.Linear(config.hidden_size, config.vocab_size)
873
+
874
+ def forward(self, hidden_states):
875
+ hidden_states = self.dense(hidden_states)
876
+ hidden_states = self.transform_act_fn(hidden_states)
877
+ hidden_states = self.norm(hidden_states)
878
+ hidden_states = self.decoder(hidden_states)
879
+ return hidden_states
880
+
881
+
882
+ class VietnameseForMaskedLM(VietnamesePreTrainedModel):
883
+ _tied_weights_keys = ["lm_head.decoder.bias", "lm_head.decoder.weight"]
884
+
885
+ def __init__(self, config: VietnameseConfig):
886
+ super().__init__(config)
887
+ self.Vietnamese = VietnameseModel(config, add_pooling_layer=False)
888
+ self.lm_head = VietnameseLMPredictionHead(config)
889
+ self.loss_fct = nn.CrossEntropyLoss()
890
+
891
+ self.post_init()
892
+
893
+ def get_output_embeddings(self):
894
+ return self.lm_head.decoder
895
+
896
+ def set_output_embeddings(self, new_embeddings):
897
+ self.lm_head.decoder = new_embeddings
898
+
899
+ def forward(
900
+ self,
901
+ input_ids: Optional[torch.Tensor] = None,
902
+ attention_mask: Optional[torch.Tensor] = None,
903
+ token_type_ids: Optional[torch.Tensor] = None,
904
+ position_ids: Optional[torch.Tensor] = None,
905
+ head_mask: Optional[torch.Tensor] = None,
906
+ inputs_embeds: Optional[torch.Tensor] = None,
907
+ labels: Optional[torch.Tensor] = None,
908
+ output_attentions: Optional[bool] = None,
909
+ output_hidden_states: Optional[bool] = None,
910
+ return_dict: Optional[bool] = None,
911
+ unpad_inputs: Optional[bool] = None,
912
+ ) -> Union[Tuple[torch.Tensor], MaskedLMOutput]:
913
+ r"""
914
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
915
+ Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
916
+ config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), the
917
+ loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`
918
+ """
919
+
920
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
921
+
922
+ if labels is None or not self.Vietnamese.config.unpad_inputs:
923
+ length = None
924
+ subset_indices = None
925
+ else:
926
+ length = attention_mask.sum(-1).tolist()
927
+ labels = labels[attention_mask.bool()].unsqueeze(0)
928
+ subset_indices = labels > -100
929
+
930
+ outputs = self.Vietnamese(
931
+ input_ids,
932
+ attention_mask=attention_mask,
933
+ length=length,
934
+ subset_indices=subset_indices,
935
+ token_type_ids=token_type_ids,
936
+ position_ids=position_ids,
937
+ head_mask=head_mask,
938
+ inputs_embeds=inputs_embeds,
939
+ output_attentions=output_attentions,
940
+ output_hidden_states=output_hidden_states,
941
+ return_dict=return_dict,
942
+ unpad_inputs=unpad_inputs,
943
+ )
944
+
945
+ sequence_output = outputs[0]
946
+ prediction_scores = self.lm_head(sequence_output)
947
+
948
+ masked_lm_loss = None
949
+ if labels is not None:
950
+ if subset_indices is None:
951
+ mask = attention_mask.bool()
952
+ prediction_scores = prediction_scores[mask]
953
+ labels = labels[mask]
954
+ else:
955
+ labels = labels[subset_indices]
956
+ masked_lm_loss = self.loss_fct(prediction_scores, labels)
957
+
958
+ if not return_dict:
959
+ output = (prediction_scores,) + outputs[2:]
960
+ return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output
961
+
962
+ return MaskedLMOutput(
963
+ loss=masked_lm_loss,
964
+ logits=prediction_scores,
965
+ hidden_states=outputs.hidden_states,
966
+ attentions=outputs.attentions,
967
+ )
968
+
969
+
970
+ class VietnameseForSequenceClassification(VietnamesePreTrainedModel):
971
+ def __init__(self, config):
972
+ super().__init__(config)
973
+ self.num_labels = config.num_labels
974
+ self.config = config
975
+
976
+ self.Vietnamese = VietnameseModel(config, add_pooling_layer=True)
977
+ classifier_dropout = (
978
+ config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
979
+ )
980
+ self.dropout = nn.Dropout(classifier_dropout)
981
+ self.classifier = nn.Linear(config.hidden_size, config.num_labels)
982
+
983
+ self.post_init()
984
+
985
+ def forward(
986
+ self,
987
+ input_ids: Optional[torch.Tensor] = None,
988
+ attention_mask: Optional[torch.Tensor] = None,
989
+ token_type_ids: Optional[torch.Tensor] = None,
990
+ position_ids: Optional[torch.Tensor] = None,
991
+ head_mask: Optional[torch.Tensor] = None,
992
+ inputs_embeds: Optional[torch.Tensor] = None,
993
+ labels: Optional[torch.Tensor] = None,
994
+ output_attentions: Optional[bool] = None,
995
+ output_hidden_states: Optional[bool] = None,
996
+ return_dict: Optional[bool] = None,
997
+ unpad_inputs: Optional[bool] = None,
998
+ ) -> Union[Tuple[torch.Tensor], SequenceClassifierOutput]:
999
+ r"""
1000
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1001
+ Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
1002
+ config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
1003
+ `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
1004
+ """
1005
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1006
+
1007
+ outputs = self.Vietnamese(
1008
+ input_ids,
1009
+ attention_mask=attention_mask,
1010
+ token_type_ids=token_type_ids,
1011
+ position_ids=position_ids,
1012
+ head_mask=head_mask,
1013
+ inputs_embeds=inputs_embeds,
1014
+ output_attentions=output_attentions,
1015
+ output_hidden_states=output_hidden_states,
1016
+ return_dict=return_dict,
1017
+ unpad_inputs=unpad_inputs,
1018
+ )
1019
+
1020
+ pooled_output = outputs[1]
1021
+
1022
+ pooled_output = self.dropout(pooled_output)
1023
+ logits = self.classifier(pooled_output)
1024
+
1025
+ loss = None
1026
+ if labels is not None:
1027
+ if self.config.problem_type is None:
1028
+ if self.num_labels == 1:
1029
+ self.config.problem_type = "regression"
1030
+ elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
1031
+ self.config.problem_type = "single_label_classification"
1032
+ else:
1033
+ self.config.problem_type = "multi_label_classification"
1034
+
1035
+ if self.config.problem_type == "regression":
1036
+ loss_fct = nn.MSELoss()
1037
+ if self.num_labels == 1:
1038
+ loss = loss_fct(logits.squeeze(), labels.squeeze())
1039
+ else:
1040
+ loss = loss_fct(logits, labels)
1041
+ elif self.config.problem_type == "single_label_classification":
1042
+ loss_fct = nn.CrossEntropyLoss()
1043
+ loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
1044
+ elif self.config.problem_type == "multi_label_classification":
1045
+ loss_fct = nn.BCEWithLogitsLoss()
1046
+ loss = loss_fct(logits, labels)
1047
+
1048
+ if not return_dict:
1049
+ output = (logits,) + outputs[2:]
1050
+ return ((loss,) + output) if loss is not None else output
1051
+
1052
+ return SequenceClassifierOutput(
1053
+ loss=loss,
1054
+ logits=logits,
1055
+ hidden_states=outputs.hidden_states,
1056
+ attentions=outputs.attentions,
1057
+ )
1058
+
1059
+
1060
+ class VietnameseForMultipleChoice(VietnamesePreTrainedModel):
1061
+ def __init__(self, config):
1062
+ super().__init__(config)
1063
+
1064
+ self.Vietnamese = VietnameseModel(config, add_pooling_layer=True)
1065
+ classifier_dropout = (
1066
+ config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
1067
+ )
1068
+ self.dropout = nn.Dropout(classifier_dropout)
1069
+ self.classifier = nn.Linear(config.hidden_size, 1)
1070
+
1071
+ self.post_init()
1072
+
1073
+ def forward(
1074
+ self,
1075
+ input_ids: Optional[torch.Tensor] = None,
1076
+ attention_mask: Optional[torch.Tensor] = None,
1077
+ token_type_ids: Optional[torch.Tensor] = None,
1078
+ position_ids: Optional[torch.Tensor] = None,
1079
+ head_mask: Optional[torch.Tensor] = None,
1080
+ inputs_embeds: Optional[torch.Tensor] = None,
1081
+ labels: Optional[torch.Tensor] = None,
1082
+ output_attentions: Optional[bool] = None,
1083
+ output_hidden_states: Optional[bool] = None,
1084
+ return_dict: Optional[bool] = None,
1085
+ unpad_inputs: Optional[bool] = None,
1086
+ ) -> Union[Tuple[torch.Tensor], MultipleChoiceModelOutput]:
1087
+ r"""
1088
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1089
+ Labels for computing the multiple choice classification loss. Indices should be in `[0, ...,
1090
+ num_choices-1]` where `num_choices` is the size of the second dimension of the input tensors. (See
1091
+ `input_ids` above)
1092
+ """
1093
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1094
+ num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
1095
+
1096
+ input_ids = input_ids.view(-1, input_ids.size(-1)) if input_ids is not None else None
1097
+ attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None
1098
+ token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None
1099
+ position_ids = position_ids.view(-1, position_ids.size(-1)) if position_ids is not None else None
1100
+ inputs_embeds = (
1101
+ inputs_embeds.view(-1, inputs_embeds.size(-2), inputs_embeds.size(-1))
1102
+ if inputs_embeds is not None
1103
+ else None
1104
+ )
1105
+
1106
+ outputs = self.Vietnamese(
1107
+ input_ids,
1108
+ attention_mask=attention_mask,
1109
+ token_type_ids=token_type_ids,
1110
+ position_ids=position_ids,
1111
+ head_mask=head_mask,
1112
+ inputs_embeds=inputs_embeds,
1113
+ output_attentions=output_attentions,
1114
+ output_hidden_states=output_hidden_states,
1115
+ return_dict=return_dict,
1116
+ unpad_inputs=unpad_inputs,
1117
+ )
1118
+
1119
+ pooled_output = outputs[1]
1120
+
1121
+ pooled_output = self.dropout(pooled_output)
1122
+ logits = self.classifier(pooled_output)
1123
+ reshaped_logits = logits.view(-1, num_choices)
1124
+
1125
+ loss = None
1126
+ if labels is not None:
1127
+ loss_fct = nn.CrossEntropyLoss()
1128
+ loss = loss_fct(reshaped_logits, labels)
1129
+
1130
+ if not return_dict:
1131
+ output = (reshaped_logits,) + outputs[2:]
1132
+ return ((loss,) + output) if loss is not None else output
1133
+
1134
+ return MultipleChoiceModelOutput(
1135
+ loss=loss,
1136
+ logits=reshaped_logits,
1137
+ hidden_states=outputs.hidden_states,
1138
+ attentions=outputs.attentions,
1139
+ )
1140
+
1141
+
1142
+ @dataclass
1143
+ class VietnameseTokenClassifierOutput(ModelOutput):
1144
+ loss: Optional[torch.FloatTensor] = None
1145
+ logits: torch.FloatTensor = None
1146
+ last_hidden_state: torch.FloatTensor = None
1147
+ hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
1148
+ attentions: Optional[Tuple[torch.FloatTensor, ...]] = None
1149
+
1150
+
1151
+ class VietnameseForTokenClassification(VietnamesePreTrainedModel):
1152
+ def __init__(self, config):
1153
+ super().__init__(config)
1154
+ self.num_labels = config.num_labels
1155
+
1156
+ self.Vietnamese = VietnameseModel(config, add_pooling_layer=False)
1157
+ classifier_dropout = (
1158
+ config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
1159
+ )
1160
+ self.dropout = nn.Dropout(classifier_dropout)
1161
+ self.classifier = nn.Linear(config.hidden_size, config.num_labels)
1162
+
1163
+ self.post_init()
1164
+
1165
+ def forward(
1166
+ self,
1167
+ input_ids: Optional[torch.Tensor] = None,
1168
+ attention_mask: Optional[torch.Tensor] = None,
1169
+ token_type_ids: Optional[torch.Tensor] = None,
1170
+ position_ids: Optional[torch.Tensor] = None,
1171
+ head_mask: Optional[torch.Tensor] = None,
1172
+ inputs_embeds: Optional[torch.Tensor] = None,
1173
+ labels: Optional[torch.Tensor] = None,
1174
+ output_attentions: Optional[bool] = None,
1175
+ output_hidden_states: Optional[bool] = None,
1176
+ return_dict: Optional[bool] = None,
1177
+ unpad_inputs: Optional[bool] = None,
1178
+ ) -> Union[Tuple[torch.Tensor], VietnameseTokenClassifierOutput]:
1179
+ r"""
1180
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1181
+ Labels for computing the token classification loss. Indices should be in `[0, ..., config.num_labels - 1]`.
1182
+ """
1183
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1184
+
1185
+ outputs = self.Vietnamese(
1186
+ input_ids,
1187
+ attention_mask=attention_mask,
1188
+ token_type_ids=token_type_ids,
1189
+ position_ids=position_ids,
1190
+ head_mask=head_mask,
1191
+ inputs_embeds=inputs_embeds,
1192
+ output_attentions=output_attentions,
1193
+ output_hidden_states=output_hidden_states,
1194
+ return_dict=return_dict,
1195
+ unpad_inputs=unpad_inputs,
1196
+ )
1197
+
1198
+ sequence_output = outputs[0]
1199
+
1200
+ sequence_output = self.dropout(sequence_output)
1201
+ logits = self.classifier(sequence_output)
1202
+
1203
+ loss = None
1204
+ if labels is not None:
1205
+ loss_fct = nn.CrossEntropyLoss()
1206
+ loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
1207
+
1208
+ if not return_dict:
1209
+ output = (logits,) + outputs[2:]
1210
+ return ((loss,) + output) if loss is not None else output
1211
+
1212
+ return VietnameseTokenClassifierOutput(
1213
+ loss=loss,
1214
+ logits=logits,
1215
+ last_hidden_state=sequence_output,
1216
+ hidden_states=outputs.hidden_states,
1217
+ attentions=outputs.attentions,
1218
+ )
1219
+
1220
+
1221
+ class VietnameseForQuestionAnswering(VietnamesePreTrainedModel):
1222
+ def __init__(self, config):
1223
+ super().__init__(config)
1224
+ self.num_labels = config.num_labels
1225
+
1226
+ self.Vietnamese = VietnameseModel(config, add_pooling_layer=False)
1227
+ self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)
1228
+
1229
+ self.post_init()
1230
+
1231
+ def forward(
1232
+ self,
1233
+ input_ids: Optional[torch.Tensor] = None,
1234
+ attention_mask: Optional[torch.Tensor] = None,
1235
+ token_type_ids: Optional[torch.Tensor] = None,
1236
+ position_ids: Optional[torch.Tensor] = None,
1237
+ head_mask: Optional[torch.Tensor] = None,
1238
+ inputs_embeds: Optional[torch.Tensor] = None,
1239
+ start_positions: Optional[torch.Tensor] = None,
1240
+ end_positions: Optional[torch.Tensor] = None,
1241
+ output_attentions: Optional[bool] = None,
1242
+ output_hidden_states: Optional[bool] = None,
1243
+ return_dict: Optional[bool] = None,
1244
+ unpad_inputs: Optional[bool] = None,
1245
+ ) -> Union[Tuple[torch.Tensor], QuestionAnsweringModelOutput]:
1246
+ r"""
1247
+ start_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1248
+ Labels for position (index) of the start of the labelled span for computing the token classification loss.
1249
+ Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
1250
+ are not taken into account for computing the loss.
1251
+ end_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1252
+ Labels for position (index) of the end of the labelled span for computing the token classification loss.
1253
+ Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
1254
+ are not taken into account for computing the loss.
1255
+ """
1256
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1257
+
1258
+ outputs = self.Vietnamese(
1259
+ input_ids,
1260
+ attention_mask=attention_mask,
1261
+ token_type_ids=token_type_ids,
1262
+ position_ids=position_ids,
1263
+ head_mask=head_mask,
1264
+ inputs_embeds=inputs_embeds,
1265
+ output_attentions=output_attentions,
1266
+ output_hidden_states=output_hidden_states,
1267
+ return_dict=return_dict,
1268
+ unpad_inputs=unpad_inputs,
1269
+ )
1270
+
1271
+ sequence_output = outputs[0]
1272
+
1273
+ logits = self.qa_outputs(sequence_output)
1274
+ start_logits, end_logits = logits.split(1, dim=-1)
1275
+ start_logits = start_logits.squeeze(-1).contiguous()
1276
+ end_logits = end_logits.squeeze(-1).contiguous()
1277
+
1278
+ total_loss = None
1279
+ if start_positions is not None and end_positions is not None:
1280
+ if len(start_positions.size()) > 1:
1281
+ start_positions = start_positions.squeeze(-1)
1282
+ if len(end_positions.size()) > 1:
1283
+ end_positions = end_positions.squeeze(-1)
1284
+ ignored_index = start_logits.size(1)
1285
+ start_positions = start_positions.clamp(0, ignored_index)
1286
+ end_positions = end_positions.clamp(0, ignored_index)
1287
+
1288
+ loss_fct = nn.CrossEntropyLoss(ignore_index=ignored_index)
1289
+ start_loss = loss_fct(start_logits, start_positions)
1290
+ end_loss = loss_fct(end_logits, end_positions)
1291
+ total_loss = (start_loss + end_loss) / 2
1292
+
1293
+ if not return_dict:
1294
+ output = (start_logits, end_logits) + outputs[2:]
1295
+ return ((total_loss,) + output) if total_loss is not None else output
1296
+
1297
+ return QuestionAnsweringModelOutput(
1298
+ loss=total_loss,
1299
+ start_logits=start_logits,
1300
+ end_logits=end_logits,
1301
+ hidden_states=outputs.hidden_states,
1302
+ attentions=outputs.attentions,
1303
+ )
1304
+
1305
+
1306
+
1307
+
1308
+ def create_position_ids_from_input_ids(input_ids, padding_idx, past_key_values_length=0):
1309
+ """
1310
+ Replace non-padding symbols with their position numbers. Position numbers begin at padding_idx+1. Padding symbols
1311
+ are ignored. This is modified from fairseq's `utils.make_positions`.
1312
+ Args:
1313
+ x: torch.Tensor x:
1314
+ Returns: torch.Tensor
1315
+ """
1316
+ # The series of casts and type-conversions here are carefully balanced to both work with ONNX export and XLA.
1317
+ mask = input_ids.ne(padding_idx).int()
1318
+ incremental_indices = (torch.cumsum(mask, dim=1).type_as(mask) + past_key_values_length) * mask
1319
+ return incremental_indices.long() + padding_idx
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 8192,
3
+ "do_lower_case": false
4
+ }