Sentence Similarity
sentence-transformers
Safetensors
Vietnamese
feature-extraction
dense
Generated from Trainer
dataset_size:6765
loss:TripletLoss
custom_code
Instructions to use TTHDZ/finetuned_vietnamese-document-embedding with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use TTHDZ/finetuned_vietnamese-document-embedding with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("TTHDZ/finetuned_vietnamese-document-embedding", trust_remote_code=True) sentences = [ "Có bao nhiêu nhóm chỉ tiêu cấp điện chính được quy định", "Public_521\n|<image_4>|Ví dụ:\n\n_Hình 1.12: Các loại nét vẽ_", "Public_024\nCÁC CHỈ TIÊU CHẤT LƯỢNG CỦA THỰC PHẨM\nChất lượng thực phẩm được đánh giá dựa trên nhiều chỉ tiêu. Có thể phân loại theo hai cách:", "Public_049\nKiểu và cấu trúc dữ liệu\nKiểu dữ liệu\n**Kiểu dữ liệu ( _a data type_ )** là một tên hay từ khóa dùng để chỉ tập các đối tượng dữ liệu cùng các phép toán trên nó. Ví dụ trong C++, từ khóa _int_ dùng để chỉ tập các số nguyên có độ lớn biểu diễn bằng 2 byte (tùy thuộc vào các compiler) cùng với các phép toán số học, các phép toán so sánh, các phép toán cấp bít, các phép toán dịch chuyển bit. Từ khóa _float_ dùng để chỉ tập các số thực có độ chính xác đơn có độ lớn được biểu diễn bằng 4 byte (tùy thuộc vào các compiler) cùng với các phép toán số học, các phép toán so sánh. Không có phép lấy phần dư, các phép toán thao tác cấp bít với kiểu dữ liệu _float_. Kiểu dữ liệu được chia thành hai loại kiểu dữ liệu cơ bản hay còn gọi là kiểu dữ liệu nguyên thủy và các kiểu dữ liệu do người dùng định nghĩa.\n**Kiểu dữ liệu nguyên thủy** ( _primitive data types_ ) các kiểu dữ liệu được định nghĩa bởi hệ thống ( _system defined data type_ ) được gọi là các kiểu dữ liệu nguyên thủy. Thông thường, các ngôn ngữ lập trình cung cấp ba kiểu dữ liệu nguyên thủy đó là ký tự ( _character_ ), số ( _numeric_ ), và kiểu logic ( _bool_ ). Kiểu dữ liệu ký tự được chia thành hai loại ký tự ASCII ( _char_ ) và ký tự unicode ( _wchar_t_ ). Kiểu dữ liệu số cũng được chia thành hai loại: số kiểu số nguyên ( _integer_ ) và kiểu số thực ( _real_ ). Kiểu số nguyên được chia thành ba loại: số nguyên nhỏ ( _int_ ), số nguyên lớn ( _long_ ), số nguyên rất lớn ( _long long_ ). Kiểu số thực được chia làm hai loại: số thực có độ chính xác đơn ( _float_ ) và số thực có độ chính xác kép ( _double_ ). Dữ liệu kiểu bool chỉ định nghĩa bộ hai giá trị đúng ( _true_ ) và sai ( _false_ ).\nĐương nhiên, hai từ khóa khác nhau đại diện cho hai kiểu dữ liệu khác nhau. Quan sát này chỉ mang tính hình thức vì ta có thể quan sát được bằng mắt. Sự khác biệt bên trong giữa các kiểu dữ liệu là không gian bộ nhớ dùng để biểu diễn kiểu và các phép toán dành cho mỗi biến thuộc kiểu. Không gian nhớ dành cho kiểu phụ thuộc vào compiler của ngôn ngữ lập trình và hệ thống máy tính ta đang sử dụng. Chẳng hạn, kiểu dữ liệu _int_ một số compiler dùng 2 byte biểu diễn, một số compiler dùng 4 byte để biểu diễn. Các phép toán lấy phần dư ( _modulo_ ), dịch chuyển bít ( _bit operations_ ) định nghĩa cho các số _int_ , _long_ nhưng không định nghĩa cho các số _float_ và _double_. Để xác định độ lớn của kiểu ta có thể sử dụng hàm _sizeof_ ( _tên kiểu_ ). Ví dụ dưới đây dùng để xác định không gian nhớ dành cho kiểu.\n<table>\n<colgroup>\n<col/>\n</colgroup>\n<tbody>\n<tr>\n<td><p>//<strong>Ví dụ 1.1.</strong> Xác định kích cỡ bộ nhớ biểu diễn\nkiểu</p>\n<p>#include <iostream> using namespace std; int main(void){</p>\n<p>cout<<\"KÍCH CỠ KIỂU CƠ BẢN\"<<endl; cout<<\"Kích cỡ\nkiểu bool:\"<<sizeof(bool)<<endl;</p>\n<p>cout<<\" Kích cỡ kiểu\nchar:\"<<sizeof(char)<<endl;</p>\n<p>cout<<\" Kích cỡ kiểu\nwchar_t:\"<<sizeof(wchar_t)<<endl; cout<<\" Kích cỡ kiểu\nint:\"<<sizeof(int)<<endl; cout<<\" Kích cỡ kiểu\nlong:\"<<sizeof(long)<<endl; cout<<\" Kích cỡ kiểu long\nlong:\"<<sizeof(long long)<<endl;</p>\n<p>cout<<\" Kích cỡ kiểu float:\"<<sizeof(float)<<endl;\ncout<<\" Kích cỡ kiểu\ndouble:\"<<sizeof(double)<<endl;</p>\n<p>}</p></td>\n</tr>\n</tbody>\n</table> \n**Kiểu dữ liệu do người dùng định nghĩa** ( _user defined data types_ ) là các kiểu dữ liệu được do người dùng xây dựng bằng cách tổ hợp các kiểu dữ liệu nguyên thủy theo một nguyên tắc nào đó. Chẳng hạn, kiểu mảng ( _array_ ) là dãy có thứ tự các phần tử ( _các biến_ ) có cùng chung một kiểu dữ liệu được tổ chức liên tục nhau trong bộ nhớ. Kiếu xâu ký tự (string) là một mảng mỗi phần tử là một ký tự và có ký tự kết thúc là ‘\\0’. Như vậy, các kiểu dữ liệu không thuộc các kiểu dữ liệu nguyên thủy như mảng, cấu trúc, file đều được xem là các kiểu dữ liệu do người dùng định nghĩa.\n**Cấu trúc dữ liệu ( _data structure_ )** là phương pháp biểu diễn các đối tượng ở thế giới thực thành một đối tượng dữ liệu được tổ chức và lưu trữ trong máy tính để có thể sử lý một cách hiệu quả. Theo nghĩa này, mảng ( _array_ ), danh sách liên kết ( _linked list_ ), ngăn xếp ( _stack_ ), hàng đợi ( _queue_ ), cây ( _tree_ ), đồ thị ( _graph_ )… đều được gọi là các cấu trúc dữ liệu. Dựa vào biểu diễn của các cấu trúc dữ liệu, khoa học máy tính chia các cấu trúc dữ liệu thành hai loại: các cấu trúc dữ liệu tuyến tính ( _linear data structures_ ) và các cấu trúc dữ liệu không tuyến tính ( _non-linear data structures_ ). Một cấu trúc dữ liệu được gọi là tuyến tính nếu việc truy cập các phần tử được thực hiện tuần tự nhưng không nhất thiết được tổ chức liên tục. Điều này có nghĩa, các cấu trúc dữ liệu mảng, danh sách liên kết đơn, danh sách liên kết kép đều là các cấu trúc dữ liệu tuyến tính. Một cấu trúc dữ liệu được gọi là không tuyến tính nếu các phần tử của nó được tổ chức và truy cập không tuần tự. Theo nghĩa này, các cấu trúc dữ liệu cây, graph đều là các cấu trúc dữ liệu không tuyến tính.\n**Cấu trúc dữ liệu trừu tượng ( _Abstract Data types: ADTs_ )** là phương pháp kết hợp giữa cấu trúc dữ liệu cùng với các phép toán trên dữ liệu cụ thể của cấu trúc dữ liệu. Như vậy, mỗi kiểu dữ liệu ADTs bao gồm hai thành phần:\n * _Biểu diễn cấu trúc dữ liệu_.\n * _Xây dựng các phép toán trên dữ liệu_ _cụ thể_ _của cấu trúc dữ liệu._\nTheo nghĩa này các cấu trúc dữ liệu danh sách liên kết ( _linked list_ ), ngăn xếp ( _stack_ ), hàng đợi ( _queue_ ), hàng đợi ưu tiên ( _priority queue_ ), cây nhị phân ( _binary tree_ ), đồ thị ( _graph_ ) đều là _ADTs_. Mỗi cấu trúc dữ liệu cụ thể cùng các thao tác trên nó sẽ được trình bày trong những chương tiếp theo của tài liệu.\nĐối với mỗi cấu trúc dữ liệu trừu tượng, ta cần quan tâm và nắm bắt được những vấn đề sau:\n * **Định nghĩa** : nhằm xác định rõ cấu trúc dữ liệu ADTs ta đang quan tâm đến là gì.\n * **Biểu diễn** : nhằm định hình nên cấu trúc dữ liệu ADTs.\n * **Thao tác (phép toán)** : những thao tác và phép toán nào được cài đặt trên cấu trúc dữ liệu ADTs.\n * **Ứng dụng** : sử dụng cấu trúc dữ liệu ADTs để giải quyết lớp những bài toán nào trong khoa học máy tính." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
Add new SentenceTransformer model
Browse files- .gitattributes +1 -0
- 1_Pooling/config.json +10 -0
- README.md +0 -0
- config.json +49 -0
- config_sentence_transformers.json +14 -0
- configuration.py +114 -0
- model.safetensors +3 -0
- modeling.py +1319 -0
- modules.json +20 -0
- sentence_bert_config.json +4 -0
- special_tokens_map.json +51 -0
- tokenizer.json +3 -0
- tokenizer_config.json +62 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
tokenizer.json filter=lfs diff=lfs merge=lfs -text
|
1_Pooling/config.json
ADDED
|
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"word_embedding_dimension": 768,
|
| 3 |
+
"pooling_mode_cls_token": true,
|
| 4 |
+
"pooling_mode_mean_tokens": false,
|
| 5 |
+
"pooling_mode_max_tokens": false,
|
| 6 |
+
"pooling_mode_mean_sqrt_len_tokens": false,
|
| 7 |
+
"pooling_mode_weightedmean_tokens": false,
|
| 8 |
+
"pooling_mode_lasttoken": false,
|
| 9 |
+
"include_prompt": true
|
| 10 |
+
}
|
README.md
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
config.json
ADDED
|
@@ -0,0 +1,49 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"architectures": [
|
| 3 |
+
"VietnameseModel"
|
| 4 |
+
],
|
| 5 |
+
"attention_probs_dropout_prob": 0.0,
|
| 6 |
+
"auto_map": {
|
| 7 |
+
"AutoConfig": "configuration.VietnameseConfig",
|
| 8 |
+
"AutoModel": "modeling.VietnameseModel",
|
| 9 |
+
"AutoModelForMaskedLM": "dangvantuan/Vietnamese_impl--modeling.VietnameseForMaskedLM",
|
| 10 |
+
"AutoModelForMultipleChoice": "dangvantuan/Vietnamese_impl--modeling.VietnameseForMultipleChoice",
|
| 11 |
+
"AutoModelForQuestionAnswering": "dangvantuan/Vietnamese_impl--modeling.VietnameseForQuestionAnswering",
|
| 12 |
+
"AutoModelForSequenceClassification": "dangvantuan/Vietnamese_impl--modeling.VietnameseForSequenceClassification",
|
| 13 |
+
"AutoModelForTokenClassification": "dangvantuan/Vietnamese_impl--modeling.VietnameseForTokenClassification"
|
| 14 |
+
},
|
| 15 |
+
"classifier_dropout": 0.0,
|
| 16 |
+
"dtype": "float32",
|
| 17 |
+
"hidden_act": "gelu",
|
| 18 |
+
"hidden_dropout_prob": 0.1,
|
| 19 |
+
"hidden_size": 768,
|
| 20 |
+
"id2label": {
|
| 21 |
+
"0": "LABEL_0"
|
| 22 |
+
},
|
| 23 |
+
"initializer_range": 0.02,
|
| 24 |
+
"intermediate_size": 3072,
|
| 25 |
+
"label2id": {
|
| 26 |
+
"LABEL_0": 0
|
| 27 |
+
},
|
| 28 |
+
"layer_norm_eps": 1e-12,
|
| 29 |
+
"layer_norm_type": "layer_norm",
|
| 30 |
+
"logn_attention_clip1": false,
|
| 31 |
+
"logn_attention_scale": false,
|
| 32 |
+
"max_position_embeddings": 8192,
|
| 33 |
+
"model_type": "Vietnamese",
|
| 34 |
+
"num_attention_heads": 12,
|
| 35 |
+
"num_hidden_layers": 12,
|
| 36 |
+
"pack_qkv": true,
|
| 37 |
+
"pad_token_id": 1,
|
| 38 |
+
"position_embedding_type": "rope",
|
| 39 |
+
"rope_scaling": {
|
| 40 |
+
"factor": 8.0,
|
| 41 |
+
"type": "ntk"
|
| 42 |
+
},
|
| 43 |
+
"rope_theta": 20000,
|
| 44 |
+
"transformers_version": "4.57.1",
|
| 45 |
+
"type_vocab_size": 1,
|
| 46 |
+
"unpad_inputs": false,
|
| 47 |
+
"use_memory_efficient_attention": false,
|
| 48 |
+
"vocab_size": 250048
|
| 49 |
+
}
|
config_sentence_transformers.json
ADDED
|
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"__version__": {
|
| 3 |
+
"sentence_transformers": "5.1.0",
|
| 4 |
+
"transformers": "4.57.1",
|
| 5 |
+
"pytorch": "2.7.0+cu126"
|
| 6 |
+
},
|
| 7 |
+
"prompts": {
|
| 8 |
+
"query": "",
|
| 9 |
+
"document": ""
|
| 10 |
+
},
|
| 11 |
+
"default_prompt_name": null,
|
| 12 |
+
"model_type": "SentenceTransformer",
|
| 13 |
+
"similarity_fn_name": "cosine"
|
| 14 |
+
}
|
configuration.py
ADDED
|
@@ -0,0 +1,114 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# limitations under the License.
|
| 2 |
+
""" Vietnamese model configuration"""
|
| 3 |
+
from transformers.configuration_utils import PretrainedConfig
|
| 4 |
+
from transformers.utils import logging
|
| 5 |
+
|
| 6 |
+
logger = logging.get_logger(__name__)
|
| 7 |
+
|
| 8 |
+
|
| 9 |
+
class VietnameseConfig(PretrainedConfig):
|
| 10 |
+
r"""
|
| 11 |
+
This is the configuration class to store the configuration of a [`VietnameseModel`] or a [`TFVietnameseModel`]. It is used to
|
| 12 |
+
instantiate a Vietnamese model according to the specified arguments, defining the model architecture. Instantiating a
|
| 13 |
+
configuration with the defaults will yield a similar configuration to that of the Vietnamese
|
| 14 |
+
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
| 15 |
+
documentation from [`PretrainedConfig`] for more information.
|
| 16 |
+
Args:
|
| 17 |
+
vocab_size (`int`, *optional*, defaults to 30522):
|
| 18 |
+
Vocabulary size of the Vietnamese model. Defines the number of different tokens that can be represented by the
|
| 19 |
+
`inputs_ids` passed when calling [`VietnameseModel`] or [`TFVietnameseModel`].
|
| 20 |
+
hidden_size (`int`, *optional*, defaults to 768):
|
| 21 |
+
Dimensionality of the encoder layers and the pooler layer.
|
| 22 |
+
num_hidden_layers (`int`, *optional*, defaults to 12):
|
| 23 |
+
Number of hidden layers in the Transformer encoder.
|
| 24 |
+
num_attention_heads (`int`, *optional*, defaults to 12):
|
| 25 |
+
Number of attention heads for each attention layer in the Transformer encoder.
|
| 26 |
+
intermediate_size (`int`, *optional*, defaults to 3072):
|
| 27 |
+
Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
|
| 28 |
+
hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
|
| 29 |
+
The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
|
| 30 |
+
`"relu"`, `"silu"` and `"gelu_Vietnamese"` are supported.
|
| 31 |
+
hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
|
| 32 |
+
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
|
| 33 |
+
attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
|
| 34 |
+
The dropout ratio for the attention probabilities.
|
| 35 |
+
max_position_embeddings (`int`, *optional*, defaults to 512):
|
| 36 |
+
The maximum sequence length that this model might ever be used with. Typically set this to something large
|
| 37 |
+
just in case (e.g., 512 or 1024 or 2048).
|
| 38 |
+
type_vocab_size (`int`, *optional*, defaults to 2):
|
| 39 |
+
The vocabulary size of the `token_type_ids` passed when calling [`VietnameseModel`] or [`TFVietnameseModel`].
|
| 40 |
+
initializer_range (`float`, *optional*, defaults to 0.02):
|
| 41 |
+
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
| 42 |
+
layer_norm_eps (`float`, *optional*, defaults to 1e-12):
|
| 43 |
+
The epsilon used by the layer normalization layers.
|
| 44 |
+
position_embedding_type (`str`, *optional*, defaults to `"rope"`):
|
| 45 |
+
Type of position embedding. Choose one of `"absolute"`, `"rope"`.
|
| 46 |
+
rope_theta (`float`, *optional*, defaults to 10000.0):
|
| 47 |
+
The base period of the RoPE embeddings.
|
| 48 |
+
rope_scaling (`Dict`, *optional*):
|
| 49 |
+
Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
|
| 50 |
+
strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is
|
| 51 |
+
`{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
|
| 52 |
+
`max_position_embeddings` to the expected new maximum. See the following thread for more information on how
|
| 53 |
+
these scaling strategies behave:
|
| 54 |
+
https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This is an
|
| 55 |
+
experimental feature, subject to breaking API changes in future versions.
|
| 56 |
+
classifier_dropout (`float`, *optional*):
|
| 57 |
+
The dropout ratio for the classification head.
|
| 58 |
+
Examples:
|
| 59 |
+
"""
|
| 60 |
+
|
| 61 |
+
model_type = "Vietnamese"
|
| 62 |
+
|
| 63 |
+
def __init__(
|
| 64 |
+
self,
|
| 65 |
+
vocab_size=30528,
|
| 66 |
+
hidden_size=768,
|
| 67 |
+
num_hidden_layers=12,
|
| 68 |
+
num_attention_heads=12,
|
| 69 |
+
intermediate_size=3072,
|
| 70 |
+
hidden_act="gelu",
|
| 71 |
+
hidden_dropout_prob=0.1,
|
| 72 |
+
attention_probs_dropout_prob=0.0,
|
| 73 |
+
max_position_embeddings=2048,
|
| 74 |
+
type_vocab_size=1,
|
| 75 |
+
initializer_range=0.02,
|
| 76 |
+
layer_norm_type='layer_norm',
|
| 77 |
+
layer_norm_eps=1e-12,
|
| 78 |
+
# pad_token_id=0,
|
| 79 |
+
position_embedding_type="rope",
|
| 80 |
+
rope_theta=10000.0,
|
| 81 |
+
rope_scaling=None,
|
| 82 |
+
classifier_dropout=None,
|
| 83 |
+
pack_qkv=True,
|
| 84 |
+
unpad_inputs=False,
|
| 85 |
+
use_memory_efficient_attention=False,
|
| 86 |
+
logn_attention_scale=False,
|
| 87 |
+
logn_attention_clip1=False,
|
| 88 |
+
**kwargs,
|
| 89 |
+
):
|
| 90 |
+
super().__init__(**kwargs)
|
| 91 |
+
|
| 92 |
+
self.vocab_size = vocab_size
|
| 93 |
+
self.hidden_size = hidden_size
|
| 94 |
+
self.num_hidden_layers = num_hidden_layers
|
| 95 |
+
self.num_attention_heads = num_attention_heads
|
| 96 |
+
self.hidden_act = hidden_act
|
| 97 |
+
self.intermediate_size = intermediate_size
|
| 98 |
+
self.hidden_dropout_prob = hidden_dropout_prob
|
| 99 |
+
self.attention_probs_dropout_prob = attention_probs_dropout_prob
|
| 100 |
+
self.max_position_embeddings = max_position_embeddings
|
| 101 |
+
self.type_vocab_size = type_vocab_size
|
| 102 |
+
self.initializer_range = initializer_range
|
| 103 |
+
self.layer_norm_type = layer_norm_type
|
| 104 |
+
self.layer_norm_eps = layer_norm_eps
|
| 105 |
+
self.position_embedding_type = position_embedding_type
|
| 106 |
+
self.rope_theta = rope_theta
|
| 107 |
+
self.rope_scaling = rope_scaling
|
| 108 |
+
self.classifier_dropout = classifier_dropout
|
| 109 |
+
|
| 110 |
+
self.pack_qkv = pack_qkv
|
| 111 |
+
self.unpad_inputs = unpad_inputs
|
| 112 |
+
self.use_memory_efficient_attention = use_memory_efficient_attention
|
| 113 |
+
self.logn_attention_scale = logn_attention_scale
|
| 114 |
+
self.logn_attention_clip1 = logn_attention_clip1
|
model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:961b791b1b5766c399dbba5395713647d6859f636f8498f473f7ab28ed6392d0
|
| 3 |
+
size 1221487872
|
modeling.py
ADDED
|
@@ -0,0 +1,1319 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""PyTorch Vietnamese model."""
|
| 2 |
+
import math
|
| 3 |
+
from dataclasses import dataclass
|
| 4 |
+
from typing import List, Optional, Tuple, Union
|
| 5 |
+
|
| 6 |
+
import torch
|
| 7 |
+
import torch.utils.checkpoint
|
| 8 |
+
from torch import nn
|
| 9 |
+
|
| 10 |
+
from transformers.activations import ACT2FN
|
| 11 |
+
from transformers.modeling_outputs import (
|
| 12 |
+
BaseModelOutput,
|
| 13 |
+
BaseModelOutputWithPooling,
|
| 14 |
+
MaskedLMOutput,
|
| 15 |
+
MultipleChoiceModelOutput,
|
| 16 |
+
QuestionAnsweringModelOutput,
|
| 17 |
+
SequenceClassifierOutput,
|
| 18 |
+
ModelOutput,
|
| 19 |
+
)
|
| 20 |
+
from transformers.modeling_utils import PreTrainedModel
|
| 21 |
+
from transformers.utils import logging
|
| 22 |
+
|
| 23 |
+
try:
|
| 24 |
+
import xformers.ops as xops
|
| 25 |
+
except ImportError as e:
|
| 26 |
+
xops = None
|
| 27 |
+
|
| 28 |
+
from .configuration import VietnameseConfig
|
| 29 |
+
|
| 30 |
+
|
| 31 |
+
logger = logging.get_logger(__name__)
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
# Adapted from https://github.com/HazyResearch/flash-attention/blob/main/flash_attn/bert_padding.py
|
| 35 |
+
# Which was adapted from https://github.com/mlcommons/training_results_v1.1/blob/main/NVIDIA/benchmarks/bert/implementations/pytorch/padding.py
|
| 36 |
+
class IndexFirstAxis(torch.autograd.Function):
|
| 37 |
+
@staticmethod
|
| 38 |
+
def forward(ctx, input, indices):
|
| 39 |
+
ctx.save_for_backward(indices)
|
| 40 |
+
assert input.ndim >= 2
|
| 41 |
+
ctx.first_axis_dim, other_shape = input.shape[0], input.shape[1:]
|
| 42 |
+
second_dim = other_shape.numel()
|
| 43 |
+
return torch.gather(
|
| 44 |
+
input.view(ctx.first_axis_dim, second_dim),
|
| 45 |
+
0,
|
| 46 |
+
indices.unsqueeze(-1).expand(indices.size(0), second_dim)
|
| 47 |
+
).reshape(-1, *other_shape)
|
| 48 |
+
|
| 49 |
+
@staticmethod
|
| 50 |
+
def backward(ctx, grad_output):
|
| 51 |
+
(indices,) = ctx.saved_tensors
|
| 52 |
+
assert grad_output.ndim >= 2
|
| 53 |
+
other_shape = grad_output.shape[1:]
|
| 54 |
+
grad_output = grad_output.view(grad_output.size(0), other_shape.numel())
|
| 55 |
+
grad_input = torch.zeros(
|
| 56 |
+
[ctx.first_axis_dim, grad_output.shape[1]],
|
| 57 |
+
device=grad_output.device,
|
| 58 |
+
dtype=grad_output.dtype,
|
| 59 |
+
)
|
| 60 |
+
grad_input.scatter_(
|
| 61 |
+
0, indices.unsqueeze(-1).expand(indices.size(0), grad_output.size(1)), grad_output
|
| 62 |
+
)
|
| 63 |
+
return grad_input.reshape(ctx.first_axis_dim, *other_shape), None
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
index_first_axis = IndexFirstAxis.apply
|
| 67 |
+
|
| 68 |
+
|
| 69 |
+
def unpad_input(hidden_states, attention_mask=None, indices=None):
|
| 70 |
+
"""
|
| 71 |
+
Arguments:
|
| 72 |
+
hidden_states: (batch, seqlen, ...)
|
| 73 |
+
attention_mask: (batch, seqlen), bool / int, 1 means valid and 0 means not valid.
|
| 74 |
+
indices: (total_nnz), the indices of non-masked tokens from the flattened input sequence.
|
| 75 |
+
Return:
|
| 76 |
+
hidden_states: (total_nnz, ...), where total_nnz = number of tokens in selected in attention_mask.
|
| 77 |
+
"""
|
| 78 |
+
if indices is None:
|
| 79 |
+
assert attention_mask is not None
|
| 80 |
+
indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
|
| 81 |
+
|
| 82 |
+
hidden_states = hidden_states.view(-1, *hidden_states.shape[2:])
|
| 83 |
+
return index_first_axis(hidden_states, indices)
|
| 84 |
+
|
| 85 |
+
|
| 86 |
+
class IndexPutFirstAxis(torch.autograd.Function):
|
| 87 |
+
@staticmethod
|
| 88 |
+
def forward(
|
| 89 |
+
ctx,
|
| 90 |
+
values: torch.Tensor,
|
| 91 |
+
indices: torch.Tensor,
|
| 92 |
+
first_axis_dim
|
| 93 |
+
) -> torch.Tensor:
|
| 94 |
+
ctx.save_for_backward(indices)
|
| 95 |
+
assert indices.ndim == 1
|
| 96 |
+
assert values.ndim >= 2
|
| 97 |
+
output = torch.zeros(
|
| 98 |
+
first_axis_dim, *values.shape[1:], device=values.device, dtype=values.dtype
|
| 99 |
+
)
|
| 100 |
+
output[indices] = values
|
| 101 |
+
return output
|
| 102 |
+
|
| 103 |
+
@staticmethod
|
| 104 |
+
def backward(ctx, grad_output: torch.Tensor) -> Tuple[torch.Tensor, None, None]:
|
| 105 |
+
indices, = ctx.saved_tensors
|
| 106 |
+
grad_values = grad_output[indices]
|
| 107 |
+
return grad_values, None, None
|
| 108 |
+
|
| 109 |
+
|
| 110 |
+
index_put_first_axis = IndexPutFirstAxis.apply
|
| 111 |
+
|
| 112 |
+
|
| 113 |
+
def pad_input(inputs: torch.Tensor, indices: torch.Tensor, batch: int, seqlen: int) -> torch.Tensor:
|
| 114 |
+
"""Add padding to sequences.
|
| 115 |
+
Arguments:
|
| 116 |
+
inputs: (total_nnz, ...), where total_nnz = number of tokens in selected in attention_mask.
|
| 117 |
+
indices: (total_nnz), `indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()`
|
| 118 |
+
batch: int batch_size
|
| 119 |
+
seqlen: int max sequence length
|
| 120 |
+
Returns:
|
| 121 |
+
inputs: (batch, seqlen, ...)
|
| 122 |
+
"""
|
| 123 |
+
output = index_put_first_axis(inputs, indices, batch * seqlen)
|
| 124 |
+
return output.view(batch, seqlen, *inputs.shape[1:])
|
| 125 |
+
|
| 126 |
+
|
| 127 |
+
def rotate_half(x):
|
| 128 |
+
"""Rotates half the hidden dims of the input."""
|
| 129 |
+
x1 = x[..., : x.shape[-1] // 2]
|
| 130 |
+
x2 = x[..., x.shape[-1] // 2 :]
|
| 131 |
+
return torch.cat((-x2, x1), dim=-1)
|
| 132 |
+
|
| 133 |
+
|
| 134 |
+
def apply_rotary_pos_emb(q, k, cos, sin):
|
| 135 |
+
"""Applies Rotary Position Embedding to the query and key tensors.
|
| 136 |
+
Args:
|
| 137 |
+
q (`torch.Tensor`): The query tensor.
|
| 138 |
+
k (`torch.Tensor`): The key tensor.
|
| 139 |
+
cos (`torch.Tensor`): The cosine part of the rotary embedding.
|
| 140 |
+
sin (`torch.Tensor`): The sine part of the rotary embedding.
|
| 141 |
+
Returns:
|
| 142 |
+
`tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
|
| 143 |
+
"""
|
| 144 |
+
cos, sin = cos.to(q.dtype), sin.to(q.dtype)
|
| 145 |
+
q_embed = (q * cos) + (rotate_half(q) * sin)
|
| 146 |
+
k_embed = (k * cos) + (rotate_half(k) * sin)
|
| 147 |
+
return q_embed, k_embed
|
| 148 |
+
|
| 149 |
+
|
| 150 |
+
class RotaryEmbedding(torch.nn.Module):
|
| 151 |
+
def __init__(self, dim, max_position_embeddings=512, base=10000.0, device=None):
|
| 152 |
+
super().__init__()
|
| 153 |
+
|
| 154 |
+
self.dim = dim
|
| 155 |
+
self.max_position_embeddings = max_position_embeddings
|
| 156 |
+
self.base = base
|
| 157 |
+
inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
|
| 158 |
+
self.register_buffer("inv_freq", inv_freq, persistent=False)
|
| 159 |
+
|
| 160 |
+
self._set_cos_sin_cache(
|
| 161 |
+
seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
|
| 162 |
+
)
|
| 163 |
+
|
| 164 |
+
def _set_cos_sin_cache(self, seq_len, device, dtype):
|
| 165 |
+
self.max_seq_len_cached = seq_len
|
| 166 |
+
t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.float32)
|
| 167 |
+
|
| 168 |
+
freqs = torch.einsum("i,j->ij", t, self.inv_freq)
|
| 169 |
+
emb = torch.cat((freqs, freqs), dim=-1)
|
| 170 |
+
self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
|
| 171 |
+
self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
|
| 172 |
+
|
| 173 |
+
def forward(self, x, seq_len=None):
|
| 174 |
+
if seq_len > self.max_seq_len_cached:
|
| 175 |
+
self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)
|
| 176 |
+
|
| 177 |
+
return (
|
| 178 |
+
self.cos_cached[:seq_len, ...].to(dtype=x.dtype),
|
| 179 |
+
self.sin_cached[:seq_len, ...].to(dtype=x.dtype),
|
| 180 |
+
)
|
| 181 |
+
|
| 182 |
+
|
| 183 |
+
class NTKScalingRotaryEmbedding(RotaryEmbedding):
|
| 184 |
+
"""RotaryEmbedding extended with fixed and mixed NTK scaling. https://kexue.fm/archives/9706 """
|
| 185 |
+
|
| 186 |
+
def __init__(self, dim, max_position_embeddings=512, base=10000, device=None, scaling_factor=1.0, mixed_b=None):
|
| 187 |
+
self.scaling_factor = scaling_factor
|
| 188 |
+
self.mixed_b = mixed_b
|
| 189 |
+
super().__init__(dim, max_position_embeddings, base, device)
|
| 190 |
+
max_position_embeddings = max_position_embeddings * self.scaling_factor
|
| 191 |
+
self._set_cos_sin_cache(max_position_embeddings, self.inv_freq.device, torch.get_default_dtype())
|
| 192 |
+
|
| 193 |
+
def _set_cos_sin_cache(self, seq_len, device, dtype):
|
| 194 |
+
self.max_seq_len_cached = seq_len
|
| 195 |
+
|
| 196 |
+
if seq_len > self.max_position_embeddings:
|
| 197 |
+
base = self.base * (self.scaling_factor if self.mixed_b is None else 1)
|
| 198 |
+
inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
|
| 199 |
+
|
| 200 |
+
if self.mixed_b is None:
|
| 201 |
+
inv_freq = inv_freq / self.scaling_factor ** (2 / self.dim)
|
| 202 |
+
else:
|
| 203 |
+
a = torch.tensor(self.scaling_factor).log() / (self.dim / 2) ** self.mixed_b
|
| 204 |
+
lambda_1_m = (a * torch.arange(1, self.dim // 2 + 1).float().to(device) ** self.mixed_b).exp()
|
| 205 |
+
inv_freq = inv_freq / lambda_1_m
|
| 206 |
+
|
| 207 |
+
self.register_buffer("inv_freq", inv_freq, persistent=False)
|
| 208 |
+
|
| 209 |
+
t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.float32)
|
| 210 |
+
|
| 211 |
+
freqs = torch.einsum("i,j->ij", t, self.inv_freq)
|
| 212 |
+
emb = torch.cat((freqs, freqs), dim=-1)
|
| 213 |
+
self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
|
| 214 |
+
self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
|
| 215 |
+
|
| 216 |
+
|
| 217 |
+
class RMSNorm(nn.Module):
|
| 218 |
+
def __init__(self, hidden_size, eps=1e-6):
|
| 219 |
+
"""
|
| 220 |
+
RMSNorm is equivalent to T5LayerNorm
|
| 221 |
+
"""
|
| 222 |
+
super().__init__()
|
| 223 |
+
self.weight = nn.Parameter(torch.ones(hidden_size))
|
| 224 |
+
self.variance_epsilon = eps
|
| 225 |
+
|
| 226 |
+
def forward(self, hidden_states):
|
| 227 |
+
input_dtype = hidden_states.dtype
|
| 228 |
+
hidden_states = hidden_states.to(torch.float32)
|
| 229 |
+
variance = hidden_states.pow(2).mean(-1, keepdim=True)
|
| 230 |
+
hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
|
| 231 |
+
return self.weight * hidden_states.to(input_dtype)
|
| 232 |
+
|
| 233 |
+
|
| 234 |
+
LAYER_NORM = {
|
| 235 |
+
'layer_norm': nn.LayerNorm,
|
| 236 |
+
'rms_norm': RMSNorm
|
| 237 |
+
}
|
| 238 |
+
|
| 239 |
+
|
| 240 |
+
class VietnameseEmbeddings(nn.Module):
|
| 241 |
+
"""
|
| 242 |
+
Embedding and Unpadding.
|
| 243 |
+
"""
|
| 244 |
+
|
| 245 |
+
def __init__(self, config: VietnameseConfig):
|
| 246 |
+
super().__init__()
|
| 247 |
+
self.padding_idx = config.pad_token_id
|
| 248 |
+
self.word_embeddings = nn.Embedding(
|
| 249 |
+
config.vocab_size, config.hidden_size, padding_idx=self.padding_idx
|
| 250 |
+
)
|
| 251 |
+
|
| 252 |
+
self.position_embedding_type = config.position_embedding_type
|
| 253 |
+
if self.position_embedding_type == 'absolute':
|
| 254 |
+
self.position_embeddings = nn.Embedding(
|
| 255 |
+
config.max_position_embeddings, config.hidden_size, padding_idx=self.padding_idx
|
| 256 |
+
)
|
| 257 |
+
elif self.position_embedding_type == 'rope':
|
| 258 |
+
self._init_rope(config)
|
| 259 |
+
else:
|
| 260 |
+
raise ValueError
|
| 261 |
+
|
| 262 |
+
self.type_vocab_size = config.type_vocab_size
|
| 263 |
+
if self.type_vocab_size > 0:
|
| 264 |
+
self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
|
| 265 |
+
|
| 266 |
+
self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
|
| 267 |
+
self.dropout = nn.Dropout(config.hidden_dropout_prob)
|
| 268 |
+
self.register_buffer(
|
| 269 |
+
"position_ids", torch.arange(config.max_position_embeddings), persistent=False
|
| 270 |
+
)
|
| 271 |
+
|
| 272 |
+
def _init_rope(self, config):
|
| 273 |
+
kwargs = dict(
|
| 274 |
+
dim=int(config.hidden_size / config.num_attention_heads),
|
| 275 |
+
max_position_embeddings=config.max_position_embeddings,
|
| 276 |
+
base=config.rope_theta
|
| 277 |
+
)
|
| 278 |
+
if config.rope_scaling is None:
|
| 279 |
+
self.rotary_emb = RotaryEmbedding(**kwargs)
|
| 280 |
+
else:
|
| 281 |
+
kwargs.update(scaling_factor=config.rope_scaling["factor"])
|
| 282 |
+
scaling_type = config.rope_scaling["type"]
|
| 283 |
+
if scaling_type == 'ntk':
|
| 284 |
+
kwargs.update(mixed_b=config.rope_scaling.get('mixed_b', None))
|
| 285 |
+
self.rotary_emb = NTKScalingRotaryEmbedding(**kwargs)
|
| 286 |
+
else:
|
| 287 |
+
raise ValueError(f"Unknown RoPE scaling type {scaling_type}")
|
| 288 |
+
|
| 289 |
+
def forward(
|
| 290 |
+
self,
|
| 291 |
+
unpad_inputs: bool,
|
| 292 |
+
input_ids: Optional[torch.Tensor] = None,
|
| 293 |
+
attention_mask: Optional[torch.Tensor] = None,
|
| 294 |
+
length: Optional[List[int]] = None,
|
| 295 |
+
token_type_ids: Optional[torch.Tensor] = None,
|
| 296 |
+
position_ids: Optional[torch.Tensor] = None,
|
| 297 |
+
inputs_embeds: Optional[torch.Tensor] = None,
|
| 298 |
+
) -> Tuple[torch.Tensor, torch.Tensor, Optional[Tuple], Optional[List[int]]]:
|
| 299 |
+
if inputs_embeds is None:
|
| 300 |
+
device, input_shape = input_ids.device, input_ids.shape
|
| 301 |
+
else:
|
| 302 |
+
device, input_shape = inputs_embeds.device, inputs_embeds.shape[:2]
|
| 303 |
+
batch_size, seq_length = input_shape
|
| 304 |
+
|
| 305 |
+
if attention_mask is None:
|
| 306 |
+
attention_mask = torch.ones(input_shape, device=device)
|
| 307 |
+
if length is not None:
|
| 308 |
+
for i, l in enumerate(length):
|
| 309 |
+
attention_mask[i, l:] = 0
|
| 310 |
+
|
| 311 |
+
if unpad_inputs:
|
| 312 |
+
attention_mask_bool = attention_mask.bool()
|
| 313 |
+
if length is None:
|
| 314 |
+
length = attention_mask.sum(-1).tolist()
|
| 315 |
+
|
| 316 |
+
if inputs_embeds is None:
|
| 317 |
+
if unpad_inputs:
|
| 318 |
+
input_ids = input_ids[attention_mask_bool].unsqueeze(0)
|
| 319 |
+
inputs_embeds = self.word_embeddings(input_ids)
|
| 320 |
+
else:
|
| 321 |
+
if unpad_inputs:
|
| 322 |
+
inputs_embeds = inputs_embeds[attention_mask_bool].unsqueeze(0)
|
| 323 |
+
embeddings = inputs_embeds
|
| 324 |
+
|
| 325 |
+
if position_ids is None:
|
| 326 |
+
if seq_length > self.position_ids.size(0):
|
| 327 |
+
self.register_buffer(
|
| 328 |
+
"position_ids", torch.arange(seq_length, device=embeddings.device), persistent=False
|
| 329 |
+
)
|
| 330 |
+
if unpad_inputs:
|
| 331 |
+
position_ids = torch.cat([self.position_ids[:l] for l in length]).unsqueeze(0)
|
| 332 |
+
else:
|
| 333 |
+
position_ids = self.position_ids[:seq_length].expand(batch_size, -1)
|
| 334 |
+
elif unpad_inputs:
|
| 335 |
+
position_ids = position_ids[attention_mask_bool].unsqueeze(0)
|
| 336 |
+
|
| 337 |
+
if self.position_embedding_type == 'rope':
|
| 338 |
+
rope_cos, rope_sin = self.rotary_emb(inputs_embeds, seq_len=seq_length)
|
| 339 |
+
rope_cos = rope_cos[position_ids].unsqueeze(2)
|
| 340 |
+
rope_sin = rope_sin[position_ids].unsqueeze(2)
|
| 341 |
+
rope_embeds = rope_cos, rope_sin
|
| 342 |
+
else:
|
| 343 |
+
rope_embeds = None
|
| 344 |
+
|
| 345 |
+
if self.type_vocab_size > 0:
|
| 346 |
+
if token_type_ids is None:
|
| 347 |
+
token_type_ids = position_ids.mul(0)
|
| 348 |
+
else:
|
| 349 |
+
if self.type_vocab_size < 2:
|
| 350 |
+
token_type_ids.mul_(0)
|
| 351 |
+
if unpad_inputs:
|
| 352 |
+
token_type_ids = token_type_ids[attention_mask_bool].unsqueeze(0)
|
| 353 |
+
|
| 354 |
+
token_type_embeddings = self.token_type_embeddings(token_type_ids)
|
| 355 |
+
embeddings = embeddings + token_type_embeddings
|
| 356 |
+
|
| 357 |
+
if self.position_embedding_type == "absolute":
|
| 358 |
+
position_embeddings = self.position_embeddings(position_ids)
|
| 359 |
+
embeddings = embeddings + position_embeddings
|
| 360 |
+
|
| 361 |
+
embeddings = self.LayerNorm(embeddings)
|
| 362 |
+
embeddings = self.dropout(embeddings)
|
| 363 |
+
|
| 364 |
+
return embeddings, attention_mask, rope_embeds, length
|
| 365 |
+
|
| 366 |
+
|
| 367 |
+
class VietnameseAttention(nn.Module):
|
| 368 |
+
def __init__(self, config: VietnameseConfig, pack_qkv=None, use_memory_efficient_attention=None):
|
| 369 |
+
super().__init__()
|
| 370 |
+
self.config = config
|
| 371 |
+
if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
|
| 372 |
+
raise ValueError(
|
| 373 |
+
f"The hidden size ({config.hidden_size}) is not a multiple of the number of attention "
|
| 374 |
+
f"heads ({config.num_attention_heads})"
|
| 375 |
+
)
|
| 376 |
+
|
| 377 |
+
self.hidden_size = config.hidden_size
|
| 378 |
+
self.num_attention_heads = config.num_attention_heads
|
| 379 |
+
self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
|
| 380 |
+
self.all_head_size = self.num_attention_heads * self.attention_head_size
|
| 381 |
+
|
| 382 |
+
if pack_qkv is None:
|
| 383 |
+
pack_qkv = config.pack_qkv
|
| 384 |
+
self.pack_qkv = pack_qkv
|
| 385 |
+
|
| 386 |
+
if self.pack_qkv:
|
| 387 |
+
self.qkv_proj = nn.Linear(config.hidden_size, self.all_head_size * 3, bias=True)
|
| 388 |
+
else:
|
| 389 |
+
self.q_proj = nn.Linear(config.hidden_size, self.all_head_size, bias=True)
|
| 390 |
+
self.k_proj = nn.Linear(config.hidden_size, self.all_head_size, bias=True)
|
| 391 |
+
self.v_proj = nn.Linear(config.hidden_size, self.all_head_size, bias=True)
|
| 392 |
+
|
| 393 |
+
self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
|
| 394 |
+
self.o_proj = nn.Linear(config.hidden_size, config.hidden_size, bias=True)
|
| 395 |
+
|
| 396 |
+
if use_memory_efficient_attention is None:
|
| 397 |
+
use_memory_efficient_attention = self.config.use_memory_efficient_attention
|
| 398 |
+
self.use_memory_efficient_attention = use_memory_efficient_attention
|
| 399 |
+
self.memory_efficient_attention = None if xops is None else xops.memory_efficient_attention
|
| 400 |
+
if self.use_memory_efficient_attention:
|
| 401 |
+
assert self.memory_efficient_attention is not None, 'please install xformers'
|
| 402 |
+
|
| 403 |
+
def forward(
|
| 404 |
+
self,
|
| 405 |
+
hidden_states: torch.Tensor,
|
| 406 |
+
attention_bias: torch.FloatTensor,
|
| 407 |
+
rope_embeds: Optional[Tuple[torch.FloatTensor, torch.FloatTensor]] = None,
|
| 408 |
+
padding_inputs: Optional[Tuple] = None,
|
| 409 |
+
attention_scale: Optional[torch.FloatTensor] = None,
|
| 410 |
+
head_mask: Optional[torch.FloatTensor] = None,
|
| 411 |
+
output_attentions: Optional[bool] = False,
|
| 412 |
+
qkv_inputs: Optional[Tuple] = None,
|
| 413 |
+
) -> Tuple[torch.Tensor, ...]:
|
| 414 |
+
shape_hd = (self.num_attention_heads, self.attention_head_size)
|
| 415 |
+
if self.pack_qkv and qkv_inputs is None:
|
| 416 |
+
qkv_pack = self.qkv_proj(hidden_states).split(self.all_head_size, dim=-1)
|
| 417 |
+
else:
|
| 418 |
+
if qkv_inputs is None:
|
| 419 |
+
qkv_inputs = (hidden_states, hidden_states, hidden_states)
|
| 420 |
+
qkv_pack = [
|
| 421 |
+
getattr(self, n + '_proj')(s) for s, n in zip(qkv_inputs, 'qkv')
|
| 422 |
+
]
|
| 423 |
+
query_states, key_states, value_states = [t.view(t.shape[:-1] + shape_hd) for t in qkv_pack]
|
| 424 |
+
|
| 425 |
+
if self.config.position_embedding_type == 'rope':
|
| 426 |
+
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, *rope_embeds)
|
| 427 |
+
|
| 428 |
+
dtype = query_states.dtype
|
| 429 |
+
|
| 430 |
+
if self.config.logn_attention_scale and attention_scale is not None:
|
| 431 |
+
query_states = query_states * attention_scale.to(dtype)
|
| 432 |
+
|
| 433 |
+
if padding_inputs is not None:
|
| 434 |
+
query_states = pad_input(query_states.squeeze(), *padding_inputs)
|
| 435 |
+
key_states = pad_input(key_states.squeeze(), *padding_inputs)
|
| 436 |
+
value_states = pad_input(value_states.squeeze(), *padding_inputs)
|
| 437 |
+
|
| 438 |
+
if self.use_memory_efficient_attention:
|
| 439 |
+
assert self.memory_efficient_attention is not None, "xformers is not loaded"
|
| 440 |
+
assert output_attentions is False, "memory_efficient_attention do not output attentions"
|
| 441 |
+
assert head_mask is None, "Not support yet"
|
| 442 |
+
attention_probs = None
|
| 443 |
+
if torch.is_tensor(attention_bias):
|
| 444 |
+
attention_bias = attention_bias.to(dtype)
|
| 445 |
+
context_layer = self.memory_efficient_attention(
|
| 446 |
+
query_states,
|
| 447 |
+
key_states,
|
| 448 |
+
value_states,
|
| 449 |
+
attn_bias=attention_bias,
|
| 450 |
+
p=self.dropout.p
|
| 451 |
+
)
|
| 452 |
+
else:
|
| 453 |
+
if output_attentions and isinstance(self, VietnameseSdpaAttention):
|
| 454 |
+
raise RuntimeError("SDPA do not output attentions")
|
| 455 |
+
context_layer, attention_probs = self._attention(
|
| 456 |
+
query_states, key_states, value_states, attention_bias, head_mask
|
| 457 |
+
)
|
| 458 |
+
|
| 459 |
+
if padding_inputs is not None:
|
| 460 |
+
context_layer = unpad_input(context_layer, indices=padding_inputs[0])
|
| 461 |
+
|
| 462 |
+
new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
|
| 463 |
+
context_layer = context_layer.view(new_context_layer_shape)
|
| 464 |
+
|
| 465 |
+
attn_output = self.o_proj(context_layer)
|
| 466 |
+
|
| 467 |
+
outputs = (attn_output, attention_probs) if output_attentions else (attn_output,)
|
| 468 |
+
return outputs
|
| 469 |
+
|
| 470 |
+
def _attention(self, query_states, key_states, value_states, attention_bias, head_mask):
|
| 471 |
+
query_states = query_states.transpose(1, 2)
|
| 472 |
+
key_states = key_states.transpose(1, 2)
|
| 473 |
+
value_states = value_states.transpose(1, 2)
|
| 474 |
+
attention_scores = torch.matmul(query_states, key_states.transpose(-1, -2))
|
| 475 |
+
|
| 476 |
+
attention_scores = attention_scores / math.sqrt(self.attention_head_size)
|
| 477 |
+
if attention_bias is not None:
|
| 478 |
+
attention_scores = attention_scores + attention_bias
|
| 479 |
+
|
| 480 |
+
attention_probs = nn.functional.softmax(attention_scores, dim=-1)
|
| 481 |
+
|
| 482 |
+
if self.dropout.p > 0:
|
| 483 |
+
attention_probs = self.dropout(attention_probs)
|
| 484 |
+
|
| 485 |
+
if head_mask is not None:
|
| 486 |
+
attention_probs = attention_probs * head_mask
|
| 487 |
+
|
| 488 |
+
context_layer = torch.matmul(attention_probs, value_states)
|
| 489 |
+
|
| 490 |
+
context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
|
| 491 |
+
return context_layer, attention_probs
|
| 492 |
+
|
| 493 |
+
|
| 494 |
+
class VietnameseSdpaAttention(VietnameseAttention):
|
| 495 |
+
"""
|
| 496 |
+
Vietnamese attention module using torch.nn.functional.scaled_dot_product_attention. This module inherits from
|
| 497 |
+
`VietnameseAttention` as the weights of the module stays untouched. The only changes are on the forward pass to adapt to
|
| 498 |
+
SDPA API.
|
| 499 |
+
"""
|
| 500 |
+
def __init__(self, config: VietnameseConfig, **kwargs):
|
| 501 |
+
super().__init__(config, **kwargs)
|
| 502 |
+
|
| 503 |
+
def _attention(self, query_states, key_states, value_states, attention_bias, head_mask):
|
| 504 |
+
attn_output = torch.nn.functional.scaled_dot_product_attention(
|
| 505 |
+
query_states.transpose(1, 2),
|
| 506 |
+
key_states.transpose(1, 2),
|
| 507 |
+
value_states.transpose(1, 2),
|
| 508 |
+
attn_mask=attention_bias,
|
| 509 |
+
dropout_p=self.dropout.p if self.training else 0.0,
|
| 510 |
+
)
|
| 511 |
+
attn_output = attn_output.permute(0, 2, 1, 3).contiguous()
|
| 512 |
+
return attn_output, None
|
| 513 |
+
|
| 514 |
+
|
| 515 |
+
Vietnamese_ATTENTION_CLASSES = {
|
| 516 |
+
"eager": VietnameseAttention,
|
| 517 |
+
"sdpa": VietnameseSdpaAttention,
|
| 518 |
+
}
|
| 519 |
+
|
| 520 |
+
|
| 521 |
+
class VietnameseGatedMLP(nn.Module):
|
| 522 |
+
"""
|
| 523 |
+
GLU Variants Improve Transformer.
|
| 524 |
+
"""
|
| 525 |
+
|
| 526 |
+
def __init__(self, config: VietnameseConfig):
|
| 527 |
+
super().__init__()
|
| 528 |
+
self.intermediate_size = config.intermediate_size
|
| 529 |
+
self.up_gate_proj = nn.Linear(config.hidden_size, self.intermediate_size * 2, bias=False)
|
| 530 |
+
self.down_proj = nn.Linear(self.intermediate_size, config.hidden_size, bias=True)
|
| 531 |
+
self.act_fn = ACT2FN[config.hidden_act]
|
| 532 |
+
if config.hidden_dropout_prob > 0:
|
| 533 |
+
self.hidden_dropout = nn.Dropout(config.hidden_dropout_prob)
|
| 534 |
+
else:
|
| 535 |
+
self.hidden_dropout = None
|
| 536 |
+
|
| 537 |
+
def forward(self, hidden_states):
|
| 538 |
+
up_gate = self.up_gate_proj(hidden_states)
|
| 539 |
+
up_states, gate = torch.split(up_gate, self.intermediate_size, dim=-1)
|
| 540 |
+
gate = self.act_fn(gate)
|
| 541 |
+
gated_states = gate * up_states
|
| 542 |
+
if self.hidden_dropout is not None:
|
| 543 |
+
gated_states = self.hidden_dropout(gated_states)
|
| 544 |
+
down_states = self.down_proj(gated_states)
|
| 545 |
+
return down_states
|
| 546 |
+
|
| 547 |
+
|
| 548 |
+
class VietnameseLayer(nn.Module):
|
| 549 |
+
def __init__(
|
| 550 |
+
self,
|
| 551 |
+
config: VietnameseConfig,
|
| 552 |
+
pack_qkv=None,
|
| 553 |
+
use_memory_efficient_attention=None,
|
| 554 |
+
attn_implementation=None
|
| 555 |
+
):
|
| 556 |
+
super().__init__()
|
| 557 |
+
if attn_implementation is None:
|
| 558 |
+
attn_implementation = config._attn_implementation
|
| 559 |
+
if use_memory_efficient_attention is None:
|
| 560 |
+
use_memory_efficient_attention = config.use_memory_efficient_attention
|
| 561 |
+
if use_memory_efficient_attention:
|
| 562 |
+
if attn_implementation != 'eager':
|
| 563 |
+
logger.warning_once(f"Override {attn_implementation=} to 'eager' as {use_memory_efficient_attention=}")
|
| 564 |
+
attn_implementation = 'eager'
|
| 565 |
+
self.attention = Vietnamese_ATTENTION_CLASSES[attn_implementation](
|
| 566 |
+
config, pack_qkv=pack_qkv, use_memory_efficient_attention=use_memory_efficient_attention
|
| 567 |
+
)
|
| 568 |
+
self.mlp = VietnameseGatedMLP(config)
|
| 569 |
+
|
| 570 |
+
ln_class = LAYER_NORM[config.layer_norm_type]
|
| 571 |
+
self.attn_ln = ln_class(config.hidden_size, eps=config.layer_norm_eps)
|
| 572 |
+
self.mlp_ln = ln_class(config.hidden_size, eps=config.layer_norm_eps)
|
| 573 |
+
|
| 574 |
+
if config.hidden_dropout_prob > 0:
|
| 575 |
+
self.hidden_dropout = nn.Dropout(config.hidden_dropout_prob)
|
| 576 |
+
else:
|
| 577 |
+
self.hidden_dropout = None
|
| 578 |
+
|
| 579 |
+
def forward(
|
| 580 |
+
self,
|
| 581 |
+
hidden_states: torch.Tensor,
|
| 582 |
+
attention_bias: torch.FloatTensor,
|
| 583 |
+
rope_embeds: Optional[Tuple[torch.FloatTensor, torch.FloatTensor]] = None,
|
| 584 |
+
padding_inputs: Optional[Tuple] = None,
|
| 585 |
+
attention_scale: Optional[torch.FloatTensor] = None,
|
| 586 |
+
subset_indices: Optional[torch.LongTensor] = None,
|
| 587 |
+
head_mask: Optional[torch.FloatTensor] = None,
|
| 588 |
+
output_attentions: Optional[bool] = False,
|
| 589 |
+
qkv_inputs: Optional[Tuple] = None,
|
| 590 |
+
) -> Tuple[torch.Tensor, ...]:
|
| 591 |
+
residual = hidden_states if qkv_inputs is None else qkv_inputs[0]
|
| 592 |
+
attention_outputs = self.attention(
|
| 593 |
+
hidden_states,
|
| 594 |
+
attention_bias,
|
| 595 |
+
rope_embeds,
|
| 596 |
+
padding_inputs,
|
| 597 |
+
attention_scale,
|
| 598 |
+
head_mask,
|
| 599 |
+
output_attentions=output_attentions,
|
| 600 |
+
qkv_inputs=qkv_inputs,
|
| 601 |
+
)
|
| 602 |
+
hidden_states = attention_outputs[0]
|
| 603 |
+
if self.hidden_dropout is not None:
|
| 604 |
+
hidden_states = self.hidden_dropout(hidden_states)
|
| 605 |
+
hidden_states = residual + hidden_states
|
| 606 |
+
|
| 607 |
+
if subset_indices is not None:
|
| 608 |
+
hidden_states = hidden_states[subset_indices]
|
| 609 |
+
|
| 610 |
+
hidden_states = self.attn_ln(hidden_states)
|
| 611 |
+
|
| 612 |
+
residual = hidden_states
|
| 613 |
+
hidden_states = self.mlp(hidden_states)
|
| 614 |
+
if self.hidden_dropout is not None:
|
| 615 |
+
hidden_states = self.hidden_dropout(hidden_states)
|
| 616 |
+
hidden_states = residual + hidden_states
|
| 617 |
+
hidden_states = self.mlp_ln(hidden_states)
|
| 618 |
+
|
| 619 |
+
outputs = (hidden_states,) + attention_outputs[1:]
|
| 620 |
+
return outputs
|
| 621 |
+
|
| 622 |
+
|
| 623 |
+
class VietnameseEncoder(nn.Module):
|
| 624 |
+
def __init__(self, config):
|
| 625 |
+
super().__init__()
|
| 626 |
+
self.config = config
|
| 627 |
+
self.layer = nn.ModuleList([VietnameseLayer(config) for _ in range(config.num_hidden_layers)])
|
| 628 |
+
self.gradient_checkpointing = False
|
| 629 |
+
|
| 630 |
+
def forward(
|
| 631 |
+
self,
|
| 632 |
+
hidden_states: torch.Tensor,
|
| 633 |
+
attention_bias: Optional[torch.FloatTensor] = None,
|
| 634 |
+
rope_embeds: Optional[Tuple[torch.FloatTensor, torch.FloatTensor]] = None,
|
| 635 |
+
padding_inputs: Optional[Tuple] = None,
|
| 636 |
+
attention_scale: Optional[torch.FloatTensor] = None,
|
| 637 |
+
subset_indices: Optional[torch.LongTensor] = None,
|
| 638 |
+
head_mask: Optional[torch.FloatTensor] = None,
|
| 639 |
+
output_attentions: Optional[bool] = False,
|
| 640 |
+
output_hidden_states: Optional[bool] = False,
|
| 641 |
+
return_dict: Optional[bool] = True,
|
| 642 |
+
) -> Union[Tuple[torch.Tensor], BaseModelOutput]:
|
| 643 |
+
all_hidden_states = () if output_hidden_states else None
|
| 644 |
+
all_self_attentions = () if output_attentions else None
|
| 645 |
+
|
| 646 |
+
for i, layer_module in enumerate(self.layer):
|
| 647 |
+
if output_hidden_states:
|
| 648 |
+
all_hidden_states = all_hidden_states + (hidden_states,)
|
| 649 |
+
|
| 650 |
+
if i >= len(self.layer) - 1:
|
| 651 |
+
layer_subset_indices = subset_indices
|
| 652 |
+
else:
|
| 653 |
+
layer_subset_indices = None
|
| 654 |
+
|
| 655 |
+
layer_head_mask = head_mask[i] if head_mask is not None else None
|
| 656 |
+
|
| 657 |
+
if self.gradient_checkpointing and self.training:
|
| 658 |
+
layer_outputs = self._gradient_checkpointing_func(
|
| 659 |
+
layer_module.__call__,
|
| 660 |
+
hidden_states,
|
| 661 |
+
attention_bias,
|
| 662 |
+
rope_embeds,
|
| 663 |
+
padding_inputs,
|
| 664 |
+
attention_scale,
|
| 665 |
+
layer_subset_indices,
|
| 666 |
+
layer_head_mask,
|
| 667 |
+
)
|
| 668 |
+
else:
|
| 669 |
+
layer_outputs = layer_module(
|
| 670 |
+
hidden_states,
|
| 671 |
+
attention_bias,
|
| 672 |
+
rope_embeds,
|
| 673 |
+
padding_inputs,
|
| 674 |
+
attention_scale,
|
| 675 |
+
layer_subset_indices,
|
| 676 |
+
layer_head_mask,
|
| 677 |
+
output_attentions,
|
| 678 |
+
)
|
| 679 |
+
|
| 680 |
+
hidden_states = layer_outputs[0]
|
| 681 |
+
if output_attentions:
|
| 682 |
+
all_self_attentions = all_self_attentions + (layer_outputs[1],)
|
| 683 |
+
|
| 684 |
+
if output_hidden_states:
|
| 685 |
+
all_hidden_states = all_hidden_states + (hidden_states,)
|
| 686 |
+
|
| 687 |
+
if not return_dict:
|
| 688 |
+
return tuple(
|
| 689 |
+
v
|
| 690 |
+
for v in [
|
| 691 |
+
hidden_states,
|
| 692 |
+
all_hidden_states,
|
| 693 |
+
all_self_attentions,
|
| 694 |
+
]
|
| 695 |
+
if v is not None
|
| 696 |
+
)
|
| 697 |
+
return BaseModelOutput(
|
| 698 |
+
last_hidden_state=hidden_states,
|
| 699 |
+
hidden_states=all_hidden_states,
|
| 700 |
+
attentions=all_self_attentions,
|
| 701 |
+
)
|
| 702 |
+
|
| 703 |
+
|
| 704 |
+
class VietnamesePooler(nn.Module):
|
| 705 |
+
def __init__(self, config):
|
| 706 |
+
super().__init__()
|
| 707 |
+
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
|
| 708 |
+
self.activation = nn.Tanh()
|
| 709 |
+
|
| 710 |
+
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
|
| 711 |
+
first_token_tensor = hidden_states[:, 0]
|
| 712 |
+
pooled_output = self.dense(first_token_tensor)
|
| 713 |
+
pooled_output = self.activation(pooled_output)
|
| 714 |
+
return pooled_output
|
| 715 |
+
|
| 716 |
+
|
| 717 |
+
class VietnamesePreTrainedModel(PreTrainedModel):
|
| 718 |
+
"""
|
| 719 |
+
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
|
| 720 |
+
models.
|
| 721 |
+
"""
|
| 722 |
+
|
| 723 |
+
config_class = VietnameseConfig
|
| 724 |
+
base_model_prefix = "Vietnamese"
|
| 725 |
+
supports_gradient_checkpointing = True
|
| 726 |
+
_supports_sdpa = True
|
| 727 |
+
|
| 728 |
+
def _init_weights(self, module):
|
| 729 |
+
"""Initialize the weights"""
|
| 730 |
+
if isinstance(module, nn.Linear):
|
| 731 |
+
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
|
| 732 |
+
if module.bias is not None:
|
| 733 |
+
module.bias.data.zero_()
|
| 734 |
+
elif isinstance(module, nn.Embedding):
|
| 735 |
+
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
|
| 736 |
+
if module.padding_idx is not None:
|
| 737 |
+
module.weight.data[module.padding_idx].zero_()
|
| 738 |
+
elif isinstance(module, nn.LayerNorm):
|
| 739 |
+
module.bias.data.zero_()
|
| 740 |
+
module.weight.data.fill_(1.0)
|
| 741 |
+
|
| 742 |
+
|
| 743 |
+
class VietnameseModel(VietnamesePreTrainedModel):
|
| 744 |
+
"""
|
| 745 |
+
The bare Vietnamese Model transformer outputting raw hidden-states without any specific head on top.
|
| 746 |
+
"""
|
| 747 |
+
|
| 748 |
+
def __init__(self, config: VietnameseConfig, add_pooling_layer=False):
|
| 749 |
+
super().__init__(config)
|
| 750 |
+
self.config = config
|
| 751 |
+
|
| 752 |
+
self.embeddings = VietnameseEmbeddings(config)
|
| 753 |
+
self.encoder = VietnameseEncoder(config)
|
| 754 |
+
|
| 755 |
+
self.pooler = VietnamesePooler(config) if add_pooling_layer else None
|
| 756 |
+
|
| 757 |
+
self.post_init()
|
| 758 |
+
|
| 759 |
+
def get_input_embeddings(self):
|
| 760 |
+
return self.embeddings.word_embeddings
|
| 761 |
+
|
| 762 |
+
def set_input_embeddings(self, value):
|
| 763 |
+
self.embeddings.word_embeddings = value
|
| 764 |
+
|
| 765 |
+
def forward(
|
| 766 |
+
self,
|
| 767 |
+
input_ids: Optional[torch.Tensor] = None,
|
| 768 |
+
attention_mask: Optional[torch.Tensor] = None,
|
| 769 |
+
length: Optional[List[int]] = None,
|
| 770 |
+
subset_indices: Optional[torch.LongTensor] = None,
|
| 771 |
+
token_type_ids: Optional[torch.Tensor] = None,
|
| 772 |
+
position_ids: Optional[torch.Tensor] = None,
|
| 773 |
+
head_mask: Optional[torch.Tensor] = None,
|
| 774 |
+
inputs_embeds: Optional[torch.Tensor] = None,
|
| 775 |
+
output_attentions: Optional[bool] = None,
|
| 776 |
+
output_hidden_states: Optional[bool] = None,
|
| 777 |
+
return_dict: Optional[bool] = None,
|
| 778 |
+
unpad_inputs: Optional[bool] = None,
|
| 779 |
+
) -> Union[Tuple[torch.Tensor], BaseModelOutputWithPooling]:
|
| 780 |
+
r"""
|
| 781 |
+
length (`list` of length `batch_size`, *optional*):
|
| 782 |
+
If is `None`, return padded `last_hidden_state`.
|
| 783 |
+
subset_indices ():
|
| 784 |
+
pass
|
| 785 |
+
unpad_inputs (`bool`, *optional*):
|
| 786 |
+
pass
|
| 787 |
+
"""
|
| 788 |
+
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
|
| 789 |
+
output_hidden_states = (
|
| 790 |
+
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
|
| 791 |
+
)
|
| 792 |
+
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
|
| 793 |
+
unpad_inputs = unpad_inputs if unpad_inputs is not None else self.config.unpad_inputs
|
| 794 |
+
output_padded = length is None
|
| 795 |
+
|
| 796 |
+
if input_ids is not None and inputs_embeds is not None:
|
| 797 |
+
raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
|
| 798 |
+
elif input_ids is not None:
|
| 799 |
+
self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)
|
| 800 |
+
input_shape = input_ids.size()
|
| 801 |
+
elif inputs_embeds is not None:
|
| 802 |
+
input_shape = inputs_embeds.size()[:-1]
|
| 803 |
+
else:
|
| 804 |
+
raise ValueError("You have to specify either input_ids or inputs_embeds")
|
| 805 |
+
|
| 806 |
+
(embedding_output, attention_mask, rope_embeds, length) = self.embeddings(
|
| 807 |
+
unpad_inputs,
|
| 808 |
+
input_ids=input_ids,
|
| 809 |
+
attention_mask=attention_mask,
|
| 810 |
+
length=length,
|
| 811 |
+
token_type_ids=token_type_ids,
|
| 812 |
+
position_ids=position_ids,
|
| 813 |
+
inputs_embeds=inputs_embeds
|
| 814 |
+
)
|
| 815 |
+
|
| 816 |
+
batch_size, seq_length = input_shape
|
| 817 |
+
if unpad_inputs and self.config.use_memory_efficient_attention:
|
| 818 |
+
attention_bias = xops.fmha.attn_bias.BlockDiagonalMask.from_seqlens(length)
|
| 819 |
+
else:
|
| 820 |
+
attention_bias = self.get_extended_attention_mask(attention_mask, input_shape)
|
| 821 |
+
if self.config.use_memory_efficient_attention:
|
| 822 |
+
attention_bias = attention_bias.expand(-1, self.config.num_attention_heads, seq_length, -1)
|
| 823 |
+
|
| 824 |
+
padding_inputs = None
|
| 825 |
+
if unpad_inputs and (output_padded or not self.config.use_memory_efficient_attention):
|
| 826 |
+
indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
|
| 827 |
+
if not self.config.use_memory_efficient_attention:
|
| 828 |
+
padding_inputs = (indices, *input_shape)
|
| 829 |
+
|
| 830 |
+
attention_scale = None
|
| 831 |
+
if self.config.logn_attention_scale:
|
| 832 |
+
logger.warning_once("TODO: logn_attention_scale")
|
| 833 |
+
|
| 834 |
+
encoder_outputs = self.encoder(
|
| 835 |
+
embedding_output,
|
| 836 |
+
attention_bias=attention_bias,
|
| 837 |
+
rope_embeds=rope_embeds,
|
| 838 |
+
padding_inputs=padding_inputs,
|
| 839 |
+
attention_scale=attention_scale,
|
| 840 |
+
subset_indices=subset_indices,
|
| 841 |
+
head_mask=head_mask,
|
| 842 |
+
output_attentions=output_attentions,
|
| 843 |
+
output_hidden_states=output_hidden_states,
|
| 844 |
+
return_dict=return_dict,
|
| 845 |
+
)
|
| 846 |
+
sequence_output = encoder_outputs[0]
|
| 847 |
+
if unpad_inputs and output_padded:
|
| 848 |
+
sequence_output = pad_input(
|
| 849 |
+
sequence_output.squeeze(), indices, batch_size, seq_length
|
| 850 |
+
)
|
| 851 |
+
|
| 852 |
+
pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
|
| 853 |
+
|
| 854 |
+
if not return_dict:
|
| 855 |
+
return (sequence_output, pooled_output) + encoder_outputs[1:]
|
| 856 |
+
|
| 857 |
+
return BaseModelOutputWithPooling(
|
| 858 |
+
last_hidden_state=sequence_output,
|
| 859 |
+
pooler_output=pooled_output,
|
| 860 |
+
hidden_states=encoder_outputs.hidden_states,
|
| 861 |
+
attentions=encoder_outputs.attentions,
|
| 862 |
+
)
|
| 863 |
+
|
| 864 |
+
|
| 865 |
+
class VietnameseLMPredictionHead(nn.Module):
|
| 866 |
+
def __init__(self, config):
|
| 867 |
+
super().__init__()
|
| 868 |
+
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
|
| 869 |
+
self.transform_act_fn = ACT2FN[config.hidden_act]
|
| 870 |
+
self.norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
|
| 871 |
+
|
| 872 |
+
self.decoder = nn.Linear(config.hidden_size, config.vocab_size)
|
| 873 |
+
|
| 874 |
+
def forward(self, hidden_states):
|
| 875 |
+
hidden_states = self.dense(hidden_states)
|
| 876 |
+
hidden_states = self.transform_act_fn(hidden_states)
|
| 877 |
+
hidden_states = self.norm(hidden_states)
|
| 878 |
+
hidden_states = self.decoder(hidden_states)
|
| 879 |
+
return hidden_states
|
| 880 |
+
|
| 881 |
+
|
| 882 |
+
class VietnameseForMaskedLM(VietnamesePreTrainedModel):
|
| 883 |
+
_tied_weights_keys = ["lm_head.decoder.bias", "lm_head.decoder.weight"]
|
| 884 |
+
|
| 885 |
+
def __init__(self, config: VietnameseConfig):
|
| 886 |
+
super().__init__(config)
|
| 887 |
+
self.Vietnamese = VietnameseModel(config, add_pooling_layer=False)
|
| 888 |
+
self.lm_head = VietnameseLMPredictionHead(config)
|
| 889 |
+
self.loss_fct = nn.CrossEntropyLoss()
|
| 890 |
+
|
| 891 |
+
self.post_init()
|
| 892 |
+
|
| 893 |
+
def get_output_embeddings(self):
|
| 894 |
+
return self.lm_head.decoder
|
| 895 |
+
|
| 896 |
+
def set_output_embeddings(self, new_embeddings):
|
| 897 |
+
self.lm_head.decoder = new_embeddings
|
| 898 |
+
|
| 899 |
+
def forward(
|
| 900 |
+
self,
|
| 901 |
+
input_ids: Optional[torch.Tensor] = None,
|
| 902 |
+
attention_mask: Optional[torch.Tensor] = None,
|
| 903 |
+
token_type_ids: Optional[torch.Tensor] = None,
|
| 904 |
+
position_ids: Optional[torch.Tensor] = None,
|
| 905 |
+
head_mask: Optional[torch.Tensor] = None,
|
| 906 |
+
inputs_embeds: Optional[torch.Tensor] = None,
|
| 907 |
+
labels: Optional[torch.Tensor] = None,
|
| 908 |
+
output_attentions: Optional[bool] = None,
|
| 909 |
+
output_hidden_states: Optional[bool] = None,
|
| 910 |
+
return_dict: Optional[bool] = None,
|
| 911 |
+
unpad_inputs: Optional[bool] = None,
|
| 912 |
+
) -> Union[Tuple[torch.Tensor], MaskedLMOutput]:
|
| 913 |
+
r"""
|
| 914 |
+
labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
|
| 915 |
+
Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
|
| 916 |
+
config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), the
|
| 917 |
+
loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`
|
| 918 |
+
"""
|
| 919 |
+
|
| 920 |
+
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
|
| 921 |
+
|
| 922 |
+
if labels is None or not self.Vietnamese.config.unpad_inputs:
|
| 923 |
+
length = None
|
| 924 |
+
subset_indices = None
|
| 925 |
+
else:
|
| 926 |
+
length = attention_mask.sum(-1).tolist()
|
| 927 |
+
labels = labels[attention_mask.bool()].unsqueeze(0)
|
| 928 |
+
subset_indices = labels > -100
|
| 929 |
+
|
| 930 |
+
outputs = self.Vietnamese(
|
| 931 |
+
input_ids,
|
| 932 |
+
attention_mask=attention_mask,
|
| 933 |
+
length=length,
|
| 934 |
+
subset_indices=subset_indices,
|
| 935 |
+
token_type_ids=token_type_ids,
|
| 936 |
+
position_ids=position_ids,
|
| 937 |
+
head_mask=head_mask,
|
| 938 |
+
inputs_embeds=inputs_embeds,
|
| 939 |
+
output_attentions=output_attentions,
|
| 940 |
+
output_hidden_states=output_hidden_states,
|
| 941 |
+
return_dict=return_dict,
|
| 942 |
+
unpad_inputs=unpad_inputs,
|
| 943 |
+
)
|
| 944 |
+
|
| 945 |
+
sequence_output = outputs[0]
|
| 946 |
+
prediction_scores = self.lm_head(sequence_output)
|
| 947 |
+
|
| 948 |
+
masked_lm_loss = None
|
| 949 |
+
if labels is not None:
|
| 950 |
+
if subset_indices is None:
|
| 951 |
+
mask = attention_mask.bool()
|
| 952 |
+
prediction_scores = prediction_scores[mask]
|
| 953 |
+
labels = labels[mask]
|
| 954 |
+
else:
|
| 955 |
+
labels = labels[subset_indices]
|
| 956 |
+
masked_lm_loss = self.loss_fct(prediction_scores, labels)
|
| 957 |
+
|
| 958 |
+
if not return_dict:
|
| 959 |
+
output = (prediction_scores,) + outputs[2:]
|
| 960 |
+
return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output
|
| 961 |
+
|
| 962 |
+
return MaskedLMOutput(
|
| 963 |
+
loss=masked_lm_loss,
|
| 964 |
+
logits=prediction_scores,
|
| 965 |
+
hidden_states=outputs.hidden_states,
|
| 966 |
+
attentions=outputs.attentions,
|
| 967 |
+
)
|
| 968 |
+
|
| 969 |
+
|
| 970 |
+
class VietnameseForSequenceClassification(VietnamesePreTrainedModel):
|
| 971 |
+
def __init__(self, config):
|
| 972 |
+
super().__init__(config)
|
| 973 |
+
self.num_labels = config.num_labels
|
| 974 |
+
self.config = config
|
| 975 |
+
|
| 976 |
+
self.Vietnamese = VietnameseModel(config, add_pooling_layer=True)
|
| 977 |
+
classifier_dropout = (
|
| 978 |
+
config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
|
| 979 |
+
)
|
| 980 |
+
self.dropout = nn.Dropout(classifier_dropout)
|
| 981 |
+
self.classifier = nn.Linear(config.hidden_size, config.num_labels)
|
| 982 |
+
|
| 983 |
+
self.post_init()
|
| 984 |
+
|
| 985 |
+
def forward(
|
| 986 |
+
self,
|
| 987 |
+
input_ids: Optional[torch.Tensor] = None,
|
| 988 |
+
attention_mask: Optional[torch.Tensor] = None,
|
| 989 |
+
token_type_ids: Optional[torch.Tensor] = None,
|
| 990 |
+
position_ids: Optional[torch.Tensor] = None,
|
| 991 |
+
head_mask: Optional[torch.Tensor] = None,
|
| 992 |
+
inputs_embeds: Optional[torch.Tensor] = None,
|
| 993 |
+
labels: Optional[torch.Tensor] = None,
|
| 994 |
+
output_attentions: Optional[bool] = None,
|
| 995 |
+
output_hidden_states: Optional[bool] = None,
|
| 996 |
+
return_dict: Optional[bool] = None,
|
| 997 |
+
unpad_inputs: Optional[bool] = None,
|
| 998 |
+
) -> Union[Tuple[torch.Tensor], SequenceClassifierOutput]:
|
| 999 |
+
r"""
|
| 1000 |
+
labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
|
| 1001 |
+
Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
|
| 1002 |
+
config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
|
| 1003 |
+
`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
|
| 1004 |
+
"""
|
| 1005 |
+
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
|
| 1006 |
+
|
| 1007 |
+
outputs = self.Vietnamese(
|
| 1008 |
+
input_ids,
|
| 1009 |
+
attention_mask=attention_mask,
|
| 1010 |
+
token_type_ids=token_type_ids,
|
| 1011 |
+
position_ids=position_ids,
|
| 1012 |
+
head_mask=head_mask,
|
| 1013 |
+
inputs_embeds=inputs_embeds,
|
| 1014 |
+
output_attentions=output_attentions,
|
| 1015 |
+
output_hidden_states=output_hidden_states,
|
| 1016 |
+
return_dict=return_dict,
|
| 1017 |
+
unpad_inputs=unpad_inputs,
|
| 1018 |
+
)
|
| 1019 |
+
|
| 1020 |
+
pooled_output = outputs[1]
|
| 1021 |
+
|
| 1022 |
+
pooled_output = self.dropout(pooled_output)
|
| 1023 |
+
logits = self.classifier(pooled_output)
|
| 1024 |
+
|
| 1025 |
+
loss = None
|
| 1026 |
+
if labels is not None:
|
| 1027 |
+
if self.config.problem_type is None:
|
| 1028 |
+
if self.num_labels == 1:
|
| 1029 |
+
self.config.problem_type = "regression"
|
| 1030 |
+
elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
|
| 1031 |
+
self.config.problem_type = "single_label_classification"
|
| 1032 |
+
else:
|
| 1033 |
+
self.config.problem_type = "multi_label_classification"
|
| 1034 |
+
|
| 1035 |
+
if self.config.problem_type == "regression":
|
| 1036 |
+
loss_fct = nn.MSELoss()
|
| 1037 |
+
if self.num_labels == 1:
|
| 1038 |
+
loss = loss_fct(logits.squeeze(), labels.squeeze())
|
| 1039 |
+
else:
|
| 1040 |
+
loss = loss_fct(logits, labels)
|
| 1041 |
+
elif self.config.problem_type == "single_label_classification":
|
| 1042 |
+
loss_fct = nn.CrossEntropyLoss()
|
| 1043 |
+
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
|
| 1044 |
+
elif self.config.problem_type == "multi_label_classification":
|
| 1045 |
+
loss_fct = nn.BCEWithLogitsLoss()
|
| 1046 |
+
loss = loss_fct(logits, labels)
|
| 1047 |
+
|
| 1048 |
+
if not return_dict:
|
| 1049 |
+
output = (logits,) + outputs[2:]
|
| 1050 |
+
return ((loss,) + output) if loss is not None else output
|
| 1051 |
+
|
| 1052 |
+
return SequenceClassifierOutput(
|
| 1053 |
+
loss=loss,
|
| 1054 |
+
logits=logits,
|
| 1055 |
+
hidden_states=outputs.hidden_states,
|
| 1056 |
+
attentions=outputs.attentions,
|
| 1057 |
+
)
|
| 1058 |
+
|
| 1059 |
+
|
| 1060 |
+
class VietnameseForMultipleChoice(VietnamesePreTrainedModel):
|
| 1061 |
+
def __init__(self, config):
|
| 1062 |
+
super().__init__(config)
|
| 1063 |
+
|
| 1064 |
+
self.Vietnamese = VietnameseModel(config, add_pooling_layer=True)
|
| 1065 |
+
classifier_dropout = (
|
| 1066 |
+
config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
|
| 1067 |
+
)
|
| 1068 |
+
self.dropout = nn.Dropout(classifier_dropout)
|
| 1069 |
+
self.classifier = nn.Linear(config.hidden_size, 1)
|
| 1070 |
+
|
| 1071 |
+
self.post_init()
|
| 1072 |
+
|
| 1073 |
+
def forward(
|
| 1074 |
+
self,
|
| 1075 |
+
input_ids: Optional[torch.Tensor] = None,
|
| 1076 |
+
attention_mask: Optional[torch.Tensor] = None,
|
| 1077 |
+
token_type_ids: Optional[torch.Tensor] = None,
|
| 1078 |
+
position_ids: Optional[torch.Tensor] = None,
|
| 1079 |
+
head_mask: Optional[torch.Tensor] = None,
|
| 1080 |
+
inputs_embeds: Optional[torch.Tensor] = None,
|
| 1081 |
+
labels: Optional[torch.Tensor] = None,
|
| 1082 |
+
output_attentions: Optional[bool] = None,
|
| 1083 |
+
output_hidden_states: Optional[bool] = None,
|
| 1084 |
+
return_dict: Optional[bool] = None,
|
| 1085 |
+
unpad_inputs: Optional[bool] = None,
|
| 1086 |
+
) -> Union[Tuple[torch.Tensor], MultipleChoiceModelOutput]:
|
| 1087 |
+
r"""
|
| 1088 |
+
labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
|
| 1089 |
+
Labels for computing the multiple choice classification loss. Indices should be in `[0, ...,
|
| 1090 |
+
num_choices-1]` where `num_choices` is the size of the second dimension of the input tensors. (See
|
| 1091 |
+
`input_ids` above)
|
| 1092 |
+
"""
|
| 1093 |
+
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
|
| 1094 |
+
num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
|
| 1095 |
+
|
| 1096 |
+
input_ids = input_ids.view(-1, input_ids.size(-1)) if input_ids is not None else None
|
| 1097 |
+
attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None
|
| 1098 |
+
token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None
|
| 1099 |
+
position_ids = position_ids.view(-1, position_ids.size(-1)) if position_ids is not None else None
|
| 1100 |
+
inputs_embeds = (
|
| 1101 |
+
inputs_embeds.view(-1, inputs_embeds.size(-2), inputs_embeds.size(-1))
|
| 1102 |
+
if inputs_embeds is not None
|
| 1103 |
+
else None
|
| 1104 |
+
)
|
| 1105 |
+
|
| 1106 |
+
outputs = self.Vietnamese(
|
| 1107 |
+
input_ids,
|
| 1108 |
+
attention_mask=attention_mask,
|
| 1109 |
+
token_type_ids=token_type_ids,
|
| 1110 |
+
position_ids=position_ids,
|
| 1111 |
+
head_mask=head_mask,
|
| 1112 |
+
inputs_embeds=inputs_embeds,
|
| 1113 |
+
output_attentions=output_attentions,
|
| 1114 |
+
output_hidden_states=output_hidden_states,
|
| 1115 |
+
return_dict=return_dict,
|
| 1116 |
+
unpad_inputs=unpad_inputs,
|
| 1117 |
+
)
|
| 1118 |
+
|
| 1119 |
+
pooled_output = outputs[1]
|
| 1120 |
+
|
| 1121 |
+
pooled_output = self.dropout(pooled_output)
|
| 1122 |
+
logits = self.classifier(pooled_output)
|
| 1123 |
+
reshaped_logits = logits.view(-1, num_choices)
|
| 1124 |
+
|
| 1125 |
+
loss = None
|
| 1126 |
+
if labels is not None:
|
| 1127 |
+
loss_fct = nn.CrossEntropyLoss()
|
| 1128 |
+
loss = loss_fct(reshaped_logits, labels)
|
| 1129 |
+
|
| 1130 |
+
if not return_dict:
|
| 1131 |
+
output = (reshaped_logits,) + outputs[2:]
|
| 1132 |
+
return ((loss,) + output) if loss is not None else output
|
| 1133 |
+
|
| 1134 |
+
return MultipleChoiceModelOutput(
|
| 1135 |
+
loss=loss,
|
| 1136 |
+
logits=reshaped_logits,
|
| 1137 |
+
hidden_states=outputs.hidden_states,
|
| 1138 |
+
attentions=outputs.attentions,
|
| 1139 |
+
)
|
| 1140 |
+
|
| 1141 |
+
|
| 1142 |
+
@dataclass
|
| 1143 |
+
class VietnameseTokenClassifierOutput(ModelOutput):
|
| 1144 |
+
loss: Optional[torch.FloatTensor] = None
|
| 1145 |
+
logits: torch.FloatTensor = None
|
| 1146 |
+
last_hidden_state: torch.FloatTensor = None
|
| 1147 |
+
hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
|
| 1148 |
+
attentions: Optional[Tuple[torch.FloatTensor, ...]] = None
|
| 1149 |
+
|
| 1150 |
+
|
| 1151 |
+
class VietnameseForTokenClassification(VietnamesePreTrainedModel):
|
| 1152 |
+
def __init__(self, config):
|
| 1153 |
+
super().__init__(config)
|
| 1154 |
+
self.num_labels = config.num_labels
|
| 1155 |
+
|
| 1156 |
+
self.Vietnamese = VietnameseModel(config, add_pooling_layer=False)
|
| 1157 |
+
classifier_dropout = (
|
| 1158 |
+
config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
|
| 1159 |
+
)
|
| 1160 |
+
self.dropout = nn.Dropout(classifier_dropout)
|
| 1161 |
+
self.classifier = nn.Linear(config.hidden_size, config.num_labels)
|
| 1162 |
+
|
| 1163 |
+
self.post_init()
|
| 1164 |
+
|
| 1165 |
+
def forward(
|
| 1166 |
+
self,
|
| 1167 |
+
input_ids: Optional[torch.Tensor] = None,
|
| 1168 |
+
attention_mask: Optional[torch.Tensor] = None,
|
| 1169 |
+
token_type_ids: Optional[torch.Tensor] = None,
|
| 1170 |
+
position_ids: Optional[torch.Tensor] = None,
|
| 1171 |
+
head_mask: Optional[torch.Tensor] = None,
|
| 1172 |
+
inputs_embeds: Optional[torch.Tensor] = None,
|
| 1173 |
+
labels: Optional[torch.Tensor] = None,
|
| 1174 |
+
output_attentions: Optional[bool] = None,
|
| 1175 |
+
output_hidden_states: Optional[bool] = None,
|
| 1176 |
+
return_dict: Optional[bool] = None,
|
| 1177 |
+
unpad_inputs: Optional[bool] = None,
|
| 1178 |
+
) -> Union[Tuple[torch.Tensor], VietnameseTokenClassifierOutput]:
|
| 1179 |
+
r"""
|
| 1180 |
+
labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
|
| 1181 |
+
Labels for computing the token classification loss. Indices should be in `[0, ..., config.num_labels - 1]`.
|
| 1182 |
+
"""
|
| 1183 |
+
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
|
| 1184 |
+
|
| 1185 |
+
outputs = self.Vietnamese(
|
| 1186 |
+
input_ids,
|
| 1187 |
+
attention_mask=attention_mask,
|
| 1188 |
+
token_type_ids=token_type_ids,
|
| 1189 |
+
position_ids=position_ids,
|
| 1190 |
+
head_mask=head_mask,
|
| 1191 |
+
inputs_embeds=inputs_embeds,
|
| 1192 |
+
output_attentions=output_attentions,
|
| 1193 |
+
output_hidden_states=output_hidden_states,
|
| 1194 |
+
return_dict=return_dict,
|
| 1195 |
+
unpad_inputs=unpad_inputs,
|
| 1196 |
+
)
|
| 1197 |
+
|
| 1198 |
+
sequence_output = outputs[0]
|
| 1199 |
+
|
| 1200 |
+
sequence_output = self.dropout(sequence_output)
|
| 1201 |
+
logits = self.classifier(sequence_output)
|
| 1202 |
+
|
| 1203 |
+
loss = None
|
| 1204 |
+
if labels is not None:
|
| 1205 |
+
loss_fct = nn.CrossEntropyLoss()
|
| 1206 |
+
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
|
| 1207 |
+
|
| 1208 |
+
if not return_dict:
|
| 1209 |
+
output = (logits,) + outputs[2:]
|
| 1210 |
+
return ((loss,) + output) if loss is not None else output
|
| 1211 |
+
|
| 1212 |
+
return VietnameseTokenClassifierOutput(
|
| 1213 |
+
loss=loss,
|
| 1214 |
+
logits=logits,
|
| 1215 |
+
last_hidden_state=sequence_output,
|
| 1216 |
+
hidden_states=outputs.hidden_states,
|
| 1217 |
+
attentions=outputs.attentions,
|
| 1218 |
+
)
|
| 1219 |
+
|
| 1220 |
+
|
| 1221 |
+
class VietnameseForQuestionAnswering(VietnamesePreTrainedModel):
|
| 1222 |
+
def __init__(self, config):
|
| 1223 |
+
super().__init__(config)
|
| 1224 |
+
self.num_labels = config.num_labels
|
| 1225 |
+
|
| 1226 |
+
self.Vietnamese = VietnameseModel(config, add_pooling_layer=False)
|
| 1227 |
+
self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)
|
| 1228 |
+
|
| 1229 |
+
self.post_init()
|
| 1230 |
+
|
| 1231 |
+
def forward(
|
| 1232 |
+
self,
|
| 1233 |
+
input_ids: Optional[torch.Tensor] = None,
|
| 1234 |
+
attention_mask: Optional[torch.Tensor] = None,
|
| 1235 |
+
token_type_ids: Optional[torch.Tensor] = None,
|
| 1236 |
+
position_ids: Optional[torch.Tensor] = None,
|
| 1237 |
+
head_mask: Optional[torch.Tensor] = None,
|
| 1238 |
+
inputs_embeds: Optional[torch.Tensor] = None,
|
| 1239 |
+
start_positions: Optional[torch.Tensor] = None,
|
| 1240 |
+
end_positions: Optional[torch.Tensor] = None,
|
| 1241 |
+
output_attentions: Optional[bool] = None,
|
| 1242 |
+
output_hidden_states: Optional[bool] = None,
|
| 1243 |
+
return_dict: Optional[bool] = None,
|
| 1244 |
+
unpad_inputs: Optional[bool] = None,
|
| 1245 |
+
) -> Union[Tuple[torch.Tensor], QuestionAnsweringModelOutput]:
|
| 1246 |
+
r"""
|
| 1247 |
+
start_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
|
| 1248 |
+
Labels for position (index) of the start of the labelled span for computing the token classification loss.
|
| 1249 |
+
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
|
| 1250 |
+
are not taken into account for computing the loss.
|
| 1251 |
+
end_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
|
| 1252 |
+
Labels for position (index) of the end of the labelled span for computing the token classification loss.
|
| 1253 |
+
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
|
| 1254 |
+
are not taken into account for computing the loss.
|
| 1255 |
+
"""
|
| 1256 |
+
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
|
| 1257 |
+
|
| 1258 |
+
outputs = self.Vietnamese(
|
| 1259 |
+
input_ids,
|
| 1260 |
+
attention_mask=attention_mask,
|
| 1261 |
+
token_type_ids=token_type_ids,
|
| 1262 |
+
position_ids=position_ids,
|
| 1263 |
+
head_mask=head_mask,
|
| 1264 |
+
inputs_embeds=inputs_embeds,
|
| 1265 |
+
output_attentions=output_attentions,
|
| 1266 |
+
output_hidden_states=output_hidden_states,
|
| 1267 |
+
return_dict=return_dict,
|
| 1268 |
+
unpad_inputs=unpad_inputs,
|
| 1269 |
+
)
|
| 1270 |
+
|
| 1271 |
+
sequence_output = outputs[0]
|
| 1272 |
+
|
| 1273 |
+
logits = self.qa_outputs(sequence_output)
|
| 1274 |
+
start_logits, end_logits = logits.split(1, dim=-1)
|
| 1275 |
+
start_logits = start_logits.squeeze(-1).contiguous()
|
| 1276 |
+
end_logits = end_logits.squeeze(-1).contiguous()
|
| 1277 |
+
|
| 1278 |
+
total_loss = None
|
| 1279 |
+
if start_positions is not None and end_positions is not None:
|
| 1280 |
+
if len(start_positions.size()) > 1:
|
| 1281 |
+
start_positions = start_positions.squeeze(-1)
|
| 1282 |
+
if len(end_positions.size()) > 1:
|
| 1283 |
+
end_positions = end_positions.squeeze(-1)
|
| 1284 |
+
ignored_index = start_logits.size(1)
|
| 1285 |
+
start_positions = start_positions.clamp(0, ignored_index)
|
| 1286 |
+
end_positions = end_positions.clamp(0, ignored_index)
|
| 1287 |
+
|
| 1288 |
+
loss_fct = nn.CrossEntropyLoss(ignore_index=ignored_index)
|
| 1289 |
+
start_loss = loss_fct(start_logits, start_positions)
|
| 1290 |
+
end_loss = loss_fct(end_logits, end_positions)
|
| 1291 |
+
total_loss = (start_loss + end_loss) / 2
|
| 1292 |
+
|
| 1293 |
+
if not return_dict:
|
| 1294 |
+
output = (start_logits, end_logits) + outputs[2:]
|
| 1295 |
+
return ((total_loss,) + output) if total_loss is not None else output
|
| 1296 |
+
|
| 1297 |
+
return QuestionAnsweringModelOutput(
|
| 1298 |
+
loss=total_loss,
|
| 1299 |
+
start_logits=start_logits,
|
| 1300 |
+
end_logits=end_logits,
|
| 1301 |
+
hidden_states=outputs.hidden_states,
|
| 1302 |
+
attentions=outputs.attentions,
|
| 1303 |
+
)
|
| 1304 |
+
|
| 1305 |
+
|
| 1306 |
+
|
| 1307 |
+
|
| 1308 |
+
def create_position_ids_from_input_ids(input_ids, padding_idx, past_key_values_length=0):
|
| 1309 |
+
"""
|
| 1310 |
+
Replace non-padding symbols with their position numbers. Position numbers begin at padding_idx+1. Padding symbols
|
| 1311 |
+
are ignored. This is modified from fairseq's `utils.make_positions`.
|
| 1312 |
+
Args:
|
| 1313 |
+
x: torch.Tensor x:
|
| 1314 |
+
Returns: torch.Tensor
|
| 1315 |
+
"""
|
| 1316 |
+
# The series of casts and type-conversions here are carefully balanced to both work with ONNX export and XLA.
|
| 1317 |
+
mask = input_ids.ne(padding_idx).int()
|
| 1318 |
+
incremental_indices = (torch.cumsum(mask, dim=1).type_as(mask) + past_key_values_length) * mask
|
| 1319 |
+
return incremental_indices.long() + padding_idx
|
modules.json
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[
|
| 2 |
+
{
|
| 3 |
+
"idx": 0,
|
| 4 |
+
"name": "0",
|
| 5 |
+
"path": "",
|
| 6 |
+
"type": "sentence_transformers.models.Transformer"
|
| 7 |
+
},
|
| 8 |
+
{
|
| 9 |
+
"idx": 1,
|
| 10 |
+
"name": "1",
|
| 11 |
+
"path": "1_Pooling",
|
| 12 |
+
"type": "sentence_transformers.models.Pooling"
|
| 13 |
+
},
|
| 14 |
+
{
|
| 15 |
+
"idx": 2,
|
| 16 |
+
"name": "2",
|
| 17 |
+
"path": "2_Normalize",
|
| 18 |
+
"type": "sentence_transformers.models.Normalize"
|
| 19 |
+
}
|
| 20 |
+
]
|
sentence_bert_config.json
ADDED
|
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"max_seq_length": 8192,
|
| 3 |
+
"do_lower_case": false
|
| 4 |
+
}
|
special_tokens_map.json
ADDED
|
@@ -0,0 +1,51 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"bos_token": {
|
| 3 |
+
"content": "<s>",
|
| 4 |
+
"lstrip": false,
|
| 5 |
+
"normalized": false,
|
| 6 |
+
"rstrip": false,
|
| 7 |
+
"single_word": false
|
| 8 |
+
},
|
| 9 |
+
"cls_token": {
|
| 10 |
+
"content": "<s>",
|
| 11 |
+
"lstrip": false,
|
| 12 |
+
"normalized": false,
|
| 13 |
+
"rstrip": false,
|
| 14 |
+
"single_word": false
|
| 15 |
+
},
|
| 16 |
+
"eos_token": {
|
| 17 |
+
"content": "</s>",
|
| 18 |
+
"lstrip": false,
|
| 19 |
+
"normalized": false,
|
| 20 |
+
"rstrip": false,
|
| 21 |
+
"single_word": false
|
| 22 |
+
},
|
| 23 |
+
"mask_token": {
|
| 24 |
+
"content": "<mask>",
|
| 25 |
+
"lstrip": true,
|
| 26 |
+
"normalized": false,
|
| 27 |
+
"rstrip": false,
|
| 28 |
+
"single_word": false
|
| 29 |
+
},
|
| 30 |
+
"pad_token": {
|
| 31 |
+
"content": "<pad>",
|
| 32 |
+
"lstrip": false,
|
| 33 |
+
"normalized": false,
|
| 34 |
+
"rstrip": false,
|
| 35 |
+
"single_word": false
|
| 36 |
+
},
|
| 37 |
+
"sep_token": {
|
| 38 |
+
"content": "</s>",
|
| 39 |
+
"lstrip": false,
|
| 40 |
+
"normalized": false,
|
| 41 |
+
"rstrip": false,
|
| 42 |
+
"single_word": false
|
| 43 |
+
},
|
| 44 |
+
"unk_token": {
|
| 45 |
+
"content": "<unk>",
|
| 46 |
+
"lstrip": false,
|
| 47 |
+
"normalized": false,
|
| 48 |
+
"rstrip": false,
|
| 49 |
+
"single_word": false
|
| 50 |
+
}
|
| 51 |
+
}
|
tokenizer.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:aa7a6ad87a7ce8fe196787355f6af7d03aee94d19c54a5eb1392ed18c8ef451a
|
| 3 |
+
size 17082988
|
tokenizer_config.json
ADDED
|
@@ -0,0 +1,62 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"added_tokens_decoder": {
|
| 3 |
+
"0": {
|
| 4 |
+
"content": "<s>",
|
| 5 |
+
"lstrip": false,
|
| 6 |
+
"normalized": false,
|
| 7 |
+
"rstrip": false,
|
| 8 |
+
"single_word": false,
|
| 9 |
+
"special": true
|
| 10 |
+
},
|
| 11 |
+
"1": {
|
| 12 |
+
"content": "<pad>",
|
| 13 |
+
"lstrip": false,
|
| 14 |
+
"normalized": false,
|
| 15 |
+
"rstrip": false,
|
| 16 |
+
"single_word": false,
|
| 17 |
+
"special": true
|
| 18 |
+
},
|
| 19 |
+
"2": {
|
| 20 |
+
"content": "</s>",
|
| 21 |
+
"lstrip": false,
|
| 22 |
+
"normalized": false,
|
| 23 |
+
"rstrip": false,
|
| 24 |
+
"single_word": false,
|
| 25 |
+
"special": true
|
| 26 |
+
},
|
| 27 |
+
"3": {
|
| 28 |
+
"content": "<unk>",
|
| 29 |
+
"lstrip": false,
|
| 30 |
+
"normalized": false,
|
| 31 |
+
"rstrip": false,
|
| 32 |
+
"single_word": false,
|
| 33 |
+
"special": true
|
| 34 |
+
},
|
| 35 |
+
"250001": {
|
| 36 |
+
"content": "<mask>",
|
| 37 |
+
"lstrip": true,
|
| 38 |
+
"normalized": false,
|
| 39 |
+
"rstrip": false,
|
| 40 |
+
"single_word": false,
|
| 41 |
+
"special": true
|
| 42 |
+
}
|
| 43 |
+
},
|
| 44 |
+
"bos_token": "<s>",
|
| 45 |
+
"clean_up_tokenization_spaces": true,
|
| 46 |
+
"cls_token": "<s>",
|
| 47 |
+
"eos_token": "</s>",
|
| 48 |
+
"extra_special_tokens": {},
|
| 49 |
+
"mask_token": "<mask>",
|
| 50 |
+
"max_length": 8192,
|
| 51 |
+
"model_max_length": 8192,
|
| 52 |
+
"pad_to_multiple_of": null,
|
| 53 |
+
"pad_token": "<pad>",
|
| 54 |
+
"pad_token_type_id": 0,
|
| 55 |
+
"padding_side": "right",
|
| 56 |
+
"sep_token": "</s>",
|
| 57 |
+
"stride": 0,
|
| 58 |
+
"tokenizer_class": "XLMRobertaTokenizerFast",
|
| 59 |
+
"truncation_side": "right",
|
| 60 |
+
"truncation_strategy": "longest_first",
|
| 61 |
+
"unk_token": "<unk>"
|
| 62 |
+
}
|