namdp-ptit commited on
Commit
ccefce5
·
verified ·
1 Parent(s): 162d9a7

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +145 -0
README.md ADDED
@@ -0,0 +1,145 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - vi
4
+ license: apache-2.0
5
+ library_name: transformers
6
+ tags:
7
+ - transformers
8
+ - embedding
9
+ pipeline_tag: sentence-similarity
10
+ widget:
11
+ - text: tỉnh nào có diện tích lớn nhất việt nam
12
+ output:
13
+ - label: tỉnh nào có diện tích rộng nhất Việt Nam
14
+ score: 0.9861876964569092
15
+ - label: tỉnh nào có diện tích nhỏ nhất Việt Nam
16
+ score: 0.0560965985059738
17
+ base_model:
18
+ - FacebookAI/xlm-roberta-large
19
+ ---
20
+
21
+ # Table of contents
22
+
23
+ * [Introduce](#introduce)
24
+ * [Usage](#usage)
25
+ * [Performance](#performance)
26
+ * [Contact](#contact)
27
+ * [Support The Project](#support-the-project)
28
+ * [Citation](#citation)
29
+
30
+ ## Introduce
31
+
32
+ ViDense is a VietNamese Embedding Model. Fine-tuned and enhanced with tailored methods, ViDense incorporates advanced
33
+ techniques to optimize performance for text embeddings in various applications.
34
+
35
+ Model Configuration and Methods:
36
+
37
+ * **Base Model**: FacebookAI/xlm-roberta-large
38
+ * Trained for 10 epochs with a train batch size of 2048.
39
+ * Utilizes a 3-phase training approach, where the best checkpoint from each phase serves as the base model for the next.
40
+ * **Position Encoding**: Rotary Position Encoding
41
+ * **Attention**: [Blockwise Parallel Transformer](https://arxiv.org/abs/2305.19370)
42
+ * **Pooling**: Mean Pooling
43
+ * **[Momentum Encoder](https://arxiv.org/abs/1911.05722)**: Incorporates MoCo (Momentum Contrast) to enhance in-batch
44
+ negative sampling.
45
+ * **Rank Encoder**: Introduces a Rank Encoder to account for transitive positive relationships. By considering positives
46
+ of positives as relevant to the anchor, it reranks the corpus using the Spearman metric and integrates Spearman
47
+ weights into the loss calculation for improved ranking.
48
+ * **Loss Function**: Cross Entropy Loss
49
+
50
+ ## Usage
51
+
52
+ ```
53
+ pip install -U transformers
54
+ ```
55
+
56
+ ```python
57
+ import torch
58
+ from transformers import AutoModel, AutoTokenizer
59
+
60
+
61
+ def avg_pooling(attention_mask, outputs):
62
+ last_hidden = outputs.last_hidden_state
63
+ return (last_hidden * attention_mask.unsqueeze(-1)).sum(1) / attention_mask.sum(-1).unsqueeze(-1)
64
+
65
+
66
+ tokenizer = AutoTokenizer.from_pretrained('namdp-ptit/ViDense')
67
+ model = AutoModel.from_pretrained('namdp-ptit/ViDense')
68
+
69
+ sentences = [
70
+ 'Tỉnh nào có diện tích lớn nhất Việt Nam',
71
+ 'Tỉnh nào có diện tích nhỏ nhất Việt Nam',
72
+ 'Tỉnh nào có diện tích rộng nhất Việt Nam'
73
+ ]
74
+
75
+ inputs = tokenizer(sentences, return_tensors='pt', padding=True)
76
+
77
+ with torch.no_grad():
78
+ outputs = model(**inputs)
79
+ outputs = avg_pooling(inputs['attention_mask'], outputs)
80
+
81
+ cosine_sim_1 = torch.nn.functional.cosine_similarity(
82
+ outputs[0].unsqueeze(0),
83
+ outputs[1].unsqueeze(0)
84
+ )
85
+ cosine_sim_2 = torch.nn.functional.cosine_similarity(
86
+ outputs[0].unsqueeze(0),
87
+ outputs[2].unsqueeze(0)
88
+ )
89
+
90
+ print(cosine_sim_1.item()) # 0.056096598505973816
91
+ print(cosine_sim_2.item()) # 0.9861876964569092
92
+ ```
93
+
94
+ ## Performance
95
+
96
+ Below is a comparision table of the results I achieved compared to some other embedding models on three
97
+ benchmarks: [ZAC](https://huggingface.co/datasets/GreenNode/zalo-ai-legal-text-retrieval-vn/viewer/default?views%5B%5D=default_train), [WebFaq](https://huggingface.co/datasets/PaDaS-Lab/webfaq-retrieval), [OwiFaq](https://huggingface.co/datasets/PaDaS-Lab/owi-faq-retrieval)
98
+ with metric **Recall@3**
99
+
100
+ | Model Name | ZAC | WebFaq | OwiFaq |
101
+ |---------------------------------------------------------------------------------------------------------------------|:----------|:----------|:----------|
102
+ | [namdp-ptit/ViDense](https://huggingface.co/namdp-ptit/ViDense) | **54.72** | 82.26 | 85.62 |
103
+ | [VoVanPhuc/sup-SimCSE-VietNamese-phobert-base](https://huggingface.co/VoVanPhuc/sup-SimCSE-VietNamese-phobert-base) | 53.64 | 81.52 | 85.02 |
104
+ | [keepitreal/vietnamese-sbert](https://huggingface.co/keepitreal/vietnamese-sbert) | 50.45 | 80.54 | 78.58 |
105
+ | [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) | 46.12 | **83.45** | **86.08** |
106
+
107
+ Here are the information of these 3 benchmarks:
108
+
109
+ * ZAC: merge train and test into a new benchmark, ~ 3200 queries, ~ 330K documents in corpus
110
+ * WebFAQ and OwiFaq: merge train and test into a new benchmark, ~ 124K queries, ~ 124K documents in corpus
111
+
112
+ ## Contact
113
+
114
+ **Email**: phuongnamdpn2k2@gmail.com
115
+
116
+ **LinkedIn**: [Dang Phuong Nam](https://www.linkedin.com/in/dang-phuong-nam-157912288/)
117
+
118
+ **Facebook**: [Phương Nam](https://www.facebook.com/phuong.namdang.7146557)
119
+
120
+ ## Support The Project
121
+
122
+ If you find this project helpful and wish to support its ongoing development, here are some ways you can contribute:
123
+
124
+ 1. **Star the Repository**: Show your appreciation by starring the repository. Your support motivates further
125
+ development
126
+ and enhancements.
127
+ 2. **Contribute**: I welcome your contributions! You can help by reporting bugs, submitting pull requests, or
128
+ suggesting new features.
129
+ 3. **Donate**: If you’d like to support financially, consider making a donation. You can donate through:
130
+ - Vietcombank: 9912692172 - DANG PHUONG NAM
131
+
132
+ Thank you for your support!
133
+
134
+ ## Citation
135
+
136
+ Please cite as
137
+
138
+ ```Plaintext
139
+ @misc{ViDense,
140
+ title={ViDense: An Embedding Model for Vietnamese Long Context},
141
+ author={Nam Dang Phuong},
142
+ year={2025},
143
+ publisher={Huggingface},
144
+ }
145
+ ```