dangvantuan commited on
Commit
ad93a90
·
verified ·
1 Parent(s): e95ba68

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +119 -96
README.md CHANGED
@@ -8,121 +8,144 @@ tags:
8
  - transformers
9
 
10
  ---
 
 
 
11
 
12
- # {Vietnamese Embedding}
13
-
14
- This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
15
-
16
- <!--- Describe your model here -->
17
-
18
- ## Usage (Sentence-Transformers)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
  Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
21
 
22
  ```
23
  pip install -U sentence-transformers
 
24
  ```
25
 
26
  Then you can use the model like this:
27
 
28
  ```python
29
  from sentence_transformers import SentenceTransformer
30
- sentences = ["This is an example sentence", "Each sentence is converted"]
31
-
32
- model = SentenceTransformer('{MODEL_NAME}')
33
- embeddings = model.encode(sentences)
34
- print(embeddings)
35
- ```
36
-
37
-
38
-
39
- ## Usage (HuggingFace Transformers)
40
- Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
41
-
42
- ```python
43
- from transformers import AutoTokenizer, AutoModel
44
- import torch
45
-
46
-
47
- #Mean Pooling - Take attention mask into account for correct averaging
48
- def mean_pooling(model_output, attention_mask):
49
- token_embeddings = model_output[0] #First element of model_output contains all token embeddings
50
- input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
51
- return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
52
-
53
-
54
- # Sentences we want sentence embeddings for
55
- sentences = ['This is an example sentence', 'Each sentence is converted']
56
-
57
- # Load model from HuggingFace Hub
58
- tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}')
59
- model = AutoModel.from_pretrained('{MODEL_NAME}')
60
-
61
- # Tokenize sentences
62
- encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
63
-
64
- # Compute token embeddings
65
- with torch.no_grad():
66
- model_output = model(**encoded_input)
67
 
68
- # Perform pooling. In this case, mean pooling.
69
- sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
70
 
71
- print("Sentence embeddings:")
72
- print(sentence_embeddings)
73
- ```
74
-
75
-
76
-
77
- ## Evaluation Results
78
-
79
- <!--- Describe how your model was evaluated -->
80
-
81
- For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name={MODEL_NAME})
82
-
83
-
84
- ## Training
85
- The model was trained with the parameters:
86
-
87
- **DataLoader**:
88
 
89
- `torch.utils.data.dataloader.DataLoader` of length 719 with parameters:
90
- ```
91
- {'batch_size': 8, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
92
  ```
93
 
94
- **Loss**:
95
-
96
- `__main__.CosineSimilarityLoss`
97
-
98
- Parameters of the fit()-Method:
99
- ```
100
- {
101
- "epochs": 20,
102
- "evaluation_steps": 500,
103
- "evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
104
- "max_grad_norm": 1,
105
- "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
106
- "optimizer_params": {
107
- "eps": 1e-06,
108
- "lr": 5e-06
109
- },
110
- "scheduler": "WarmupLinear",
111
- "steps_per_epoch": null,
112
- "warmup_steps": 1438,
113
- "weight_decay": 0.01
114
- }
115
- ```
116
 
 
 
117
 
118
- ## Full Model Architecture
119
- ```
120
- SentenceTransformer(
121
- (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: RobertaModel
122
- (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
123
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
124
  ```
125
 
126
- ## Citing & Authors
127
 
128
- <!--- Describe where people can find more information -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  - transformers
9
 
10
  ---
11
+ ## Model Description:
12
+ [**vietnamese-embedding**](https://huggingface.co/dangvantuan/sentence-camembert-large) is the Embedding Model for Vietnamese language. This model is a specialized sentence-embedding trained specifically for the Vietnamese language, leveraging the robust capabilities of PhoBERT, a pre-trained language model based on the RoBERTa architecture.
13
+ The model utilizes PhoBERT to encode Vietnamese sentences into a 768-dimensional vector space, facilitating a wide range of applications from semantic search to text clustering. The embeddings capture the nuanced meanings of Vietnamese sentences, reflecting both the lexical and contextual layers of the language.
14
 
15
+ ## Full Model Architecture
16
+ ```
17
+ SentenceTransformer(
18
+ (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: RobertaModel
19
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
20
+ )
21
+ ```
22
+ ## Training and Fine-tuning process
23
+ The model underwent a rigorous four-stage training and fine-tuning process, each tailored to enhance its ability to generate precise and contextually relevant sentence embeddings for the Vietnamese language. Below is an outline of these stages:
24
+ #### Stage 1: Initial Training
25
+ - Dataset: [ViNLI-SimCSE-supervised](https://huggingface.co/datasets/anti-ai/ViNLI-SimCSE-supervised)
26
+ - Method: Trained using the [SimCSE approach](https://arxiv.org/abs/2104.08821) which employs a supervised contrastive learning framework. The model was optimized using [Triplet Loss](https://www.sbert.net/docs/package_reference/losses.html#tripletloss) to effectively learn from high-quality annotated sentence pairs.
27
+ #### Stage 2: Continued Fine-tuning
28
+ - Dataset: [XNLI-vn ](https://huggingface.co/datasets/xnli/viewer/vi)
29
+ - Method: Continued fine-tuning using Multi-Negative Ranking Loss. This stage focused on improving the model's ability to discern and rank nuanced differences in sentence semantics.
30
+ ### Stage 3: Continued Fine-tuning for Semantic Textual Similarity on STS Benchmark
31
+ - Dataset: [STSB-vn](https://huggingface.co/datasets/doanhieung/vi-stsbenchmark)
32
+ - Method: Fine-tuning specifically for the semantic textual similarity benchmark using Siamese BERT-Networks configured with the 'sentence-transformers' library. This stage honed the model's precision in capturing semantic similarity across various types of Vietnamese texts.
33
+ ### Stage 4: Advanced Augmentation Fine-tuning
34
+ - Dataset: STSB-vn with generate [silver sample from gold sample](https://www.sbert.net/examples/training/data_augmentation/README.html)
35
+ - Method: Employed an advanced strategy using [Augmented SBERT](https://arxiv.org/abs/2010.08240) with Pair Sampling Strategies, integrating both Cross-Encoder and Bi-Encoder models. This stage further refined the embeddings by enriching the training data dynamically, enhancing the model's robustness and accuracy in understanding and processing complex Vietnamese language constructs.
36
+
37
+
38
+ ## Usage:
39
 
40
  Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
41
 
42
  ```
43
  pip install -U sentence-transformers
44
+ pip install -q pyvi
45
  ```
46
 
47
  Then you can use the model like this:
48
 
49
  ```python
50
  from sentence_transformers import SentenceTransformer
51
+ from pyvi.ViTokenizer import tokenize
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
 
53
+ sentences = ["Hà Nội thủ đô của Việt Nam", "Đà Nẵng là thành phố du lịch"]
54
+ tokenizer_sent = [tokenize(sent) for sent in sentences]
55
 
56
+ model = SentenceTransformer('dangvantuan/vietnamese-embedding')
57
+ embeddings = model.encode(tokenizer_sent)
58
+ print(embeddings)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
 
 
 
 
60
  ```
61
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
 
63
+ ## Evaluation
64
+ The model can be evaluated as follows on the Vienamese data of stsb.
65
 
66
+ ```python
67
+ from sentence_transformers import SentenceTransformer
68
+ from sentence_transformers import SentenceTransformer
69
+ from sentence_transformers.readers import InputExample
70
+ from datasets import load_dataset
71
+ from pyvi.ViTokenizer import tokenize
72
+ def convert_dataset(dataset):
73
+ dataset_samples=[]
74
+ for df in dataset:
75
+ score = float(df['score'])/5.0 # Normalize score to range 0 ... 1
76
+ inp_example = InputExample(texts=[tokenize(df['sentence1']),
77
+ tokenize(df['sentence2'])], label=score)
78
+ dataset_samples.append(inp_example)
79
+ return dataset_samples
80
+
81
+ # Loading the dataset for evaluation
82
+ vi_sts = load_dataset("doanhieung/vi-stsbenchmark")["train"]
83
+ df_dev = vi_sts.filter(lambda example: example['split'] == 'dev')
84
+ df_test = vi_sts.filter(lambda example: example['split'] == 'test')
85
+
86
+ # Convert the dataset for evaluation
87
+
88
+ # For Dev set:
89
+ dev_samples = convert_dataset(df_dev)
90
+ val_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name='sts-dev')
91
+ val_evaluator(model, output_path="./")
92
+
93
+ # For Test set:
94
+ test_samples = convert_dataset(df_test)
95
+ test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name='sts-test')
96
+ test_evaluator(model, output_path="./")
97
  ```
98
 
 
99
 
100
+ ### Test Result:
101
+ The performance is measured using Pearson and Spearman correlation:
102
+ - On dev
103
+ | Model | Pearson correlation | Spearman correlation | #params |
104
+ | ------------- | ------------- | ------------- |------------- |
105
+ | [dangvantuan/vietnamese-embedding](dangvantuan/vietnamese-embedding)| 88.33 |88.2 | 135|
106
+ | [VoVanPhuc/sup-SimCSE-VietNamese-phobert-base](https://huggingface.co/VoVanPhuc/sup-SimCSE-VietNamese-phobert-base) | 84.65|84.59 | 135 |
107
+ | [keepitreal/vietnamese-sbert](https://huggingface.co/keepitreal/vietnamese-sbert) | 84.51 | 84.44|135M |
108
+ | [bkai-foundation-models/vietnamese-bi-encoder](https://huggingface.co/bkai-foundation-models/vietnamese-bi-encoder) | 78.05 | 77.94|135 |
109
+
110
+ ### Metric for all dataset of [Semantic Textual Similarity on STS Benchmark](https://huggingface.co/datasets/doanhieung/vi-stsbenchmark)
111
+
112
+ **Pearson score**
113
+ | Model | [STSB-vn] | [STS12-vn ]| [STS13-vn] | [STS14-vn] | [STS15-vn] | [STS16-vn] | [SICK-fr] | Mean |
114
+ |-----------------------------------------------------------|---------|----------|----------|----------|----------|----------|---------|--------|
115
+ | [dangvantuan/vietnamese-embedding](dangvantuan/vietnamese-embedding) |84.87 |87.23| 85.39| 82.94| 86.91| 79.39| 82.77| 84.21|
116
+ | [VoVanPhuc/sup-SimCSE-VietNamese-phobert-base](https://huggingface.co/VoVanPhuc/sup-SimCSE-VietNamese-phobert-base) |81.52| 85.02| 78.22| 75.94| 81.53| 75.39| 77.75| 79.33|
117
+ | [keepitreal/vietnamese-sbert](https://huggingface.co/keepitreal/vietnamese-sbert) |80.54| 78.58| 80.75| 76.98| 82.57| 73.21| 80.16| 78.97|
118
+ | [bkai-foundation-models/vietnamese-bi-encoder](https://huggingface.co/bkai-foundation-models/vietnamese-bi-encoder) |73.30| 67.84| 71.69| 69.80| 78.40| 74.29| 76.01| 73.04|
119
+
120
+
121
+ **Spearman score**
122
+ | Model | [STSB-vn] | [STS12-vn ]| [STS13-vn] | [STS14-vn] | [STS15-vn] | [STS16-vn] | [SICK-fr] | Mean |
123
+ |-----------------------------------------------------------|---------|----------|----------|----------|----------|----------|---------|--------|
124
+ | [dangvantuan/vietnamese-embedding](dangvantuan/vietnamese-embedding) |84.84| 79.04| 85.30| 81.38| 87.06| 79.95| 79.58| 82.45|
125
+ | [VoVanPhuc/sup-SimCSE-VietNamese-phobert-base](https://huggingface.co/VoVanPhuc/sup-SimCSE-VietNamese-phobert-base) |81.43| 76.51| 79.19| 74.91| 81.72| 76.57| 76.45| 78.11|
126
+ | [keepitreal/vietnamese-sbert](https://huggingface.co/keepitreal/vietnamese-sbert) |80.16| 69.08| 80.99| 73.67| 82.81| 74.30| 73.40| 76.34|
127
+ | [bkai-foundation-models/vietnamese-bi-encoder](https://huggingface.co/bkai-foundation-models/vietnamese-bi-encoder) |72.16| 63.86| 71.82| 66.20| 78.62| 74.24| 70.87| 71.11|
128
+
129
+ ## Citation
130
+
131
+
132
+ @article{reimers2019sentence,
133
+ title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
134
+ author={Nils Reimers, Iryna Gurevych},
135
+ journal={https://arxiv.org/abs/1908.10084},
136
+ year={2019}
137
+ }
138
+
139
+
140
+ @article{martin2020camembert,
141
+ title={CamemBERT: a Tasty French Language Mode},
142
+ author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
143
+ journal={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
144
+ year={2020}
145
+ }
146
+ @article{thakur2020augmented,
147
+ title={Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks},
148
+ author={Thakur, Nandan and Reimers, Nils and Daxenberger, Johannes and Gurevych, Iryna},
149
+ journal={arXiv e-prints},
150
+ pages={arXiv--2010},
151
+ year={2020}