Update README.md
Browse files
README.md
CHANGED
|
@@ -29,7 +29,8 @@ base_model:
|
|
| 29 |
|
| 30 |
## Introduce
|
| 31 |
|
| 32 |
-
**ViDense** is a **VietNamese Embedding Model**. Fine-tuned and enhanced with tailored methods, ViDense incorporates
|
|
|
|
| 33 |
techniques to optimize performance for text embeddings in various applications.
|
| 34 |
|
| 35 |
Model Configuration and Methods:
|
|
@@ -94,20 +95,22 @@ print(cosine_sim_2.item()) # 0.9861876964569092
|
|
| 94 |
## Performance
|
| 95 |
|
| 96 |
Below is a comparision table of the results I achieved compared to some other embedding models on three
|
| 97 |
-
benchmarks: [ZAC](https://huggingface.co/datasets/GreenNode/zalo-ai-legal-text-retrieval-vn/viewer/default?views%5B%5D=default_train), [WebFaq](https://huggingface.co/datasets/PaDaS-Lab/webfaq-retrieval), [OwiFaq](https://huggingface.co/datasets/PaDaS-Lab/owi-faq-retrieval)
|
| 98 |
with metric **Recall@3**
|
| 99 |
|
| 100 |
-
| Model Name | ZAC | WebFaq | OwiFaq |
|
| 101 |
-
|
| 102 |
-
| [namdp-ptit/ViDense](https://huggingface.co/namdp-ptit/ViDense) | **54.72** | 82.26 | 85.62 |
|
| 103 |
-
| [VoVanPhuc/sup-SimCSE-VietNamese-phobert-base](https://huggingface.co/VoVanPhuc/sup-SimCSE-VietNamese-phobert-base) | 53.64 | 81.52 | 85.02 |
|
| 104 |
-
| [keepitreal/vietnamese-sbert](https://huggingface.co/keepitreal/vietnamese-sbert) | 50.45 | 80.54 | 78.58 |
|
| 105 |
-
| [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) | 46.12 | **83.45** | **86.08** |
|
| 106 |
|
| 107 |
Here are the information of these 3 benchmarks:
|
| 108 |
|
| 109 |
-
* ZAC: merge train and test into a new benchmark, ~ 3200 queries, ~ 330K documents in corpus
|
| 110 |
-
* WebFAQ and OwiFaq: merge train and test into a new benchmark, ~ 124K queries, ~ 124K documents in corpus
|
|
|
|
|
|
|
| 111 |
|
| 112 |
## Contact
|
| 113 |
|
|
@@ -142,4 +145,9 @@ Please cite as
|
|
| 142 |
year={2025},
|
| 143 |
publisher={Huggingface},
|
| 144 |
}
|
| 145 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
|
| 30 |
## Introduce
|
| 31 |
|
| 32 |
+
**ViDense** is a **VietNamese Embedding Model**. Fine-tuned and enhanced with tailored methods, ViDense incorporates
|
| 33 |
+
advanced
|
| 34 |
techniques to optimize performance for text embeddings in various applications.
|
| 35 |
|
| 36 |
Model Configuration and Methods:
|
|
|
|
| 95 |
## Performance
|
| 96 |
|
| 97 |
Below is a comparision table of the results I achieved compared to some other embedding models on three
|
| 98 |
+
benchmarks: [ZAC](https://huggingface.co/datasets/GreenNode/zalo-ai-legal-text-retrieval-vn/viewer/default?views%5B%5D=default_train), [WebFaq](https://huggingface.co/datasets/PaDaS-Lab/webfaq-retrieval), [OwiFaq](https://huggingface.co/datasets/PaDaS-Lab/owi-faq-retrieval), [ViQuAD2.0](https://huggingface.co/datasets/taidng/UIT-ViQuAD2.0), [Vietnamese-Legal](https://huggingface.co/datasets/CATI-AI/vietnamese-legal-retrieval-with-negatives)
|
| 99 |
with metric **Recall@3**
|
| 100 |
|
| 101 |
+
| Model Name | ZAC | WebFaq | OwiFaq | ViQuAD2.0 | Vietnamese-Legal |
|
| 102 |
+
|---------------------------------------------------------------------------------------------------------------------|:----------|:----------|:----------|:----------|:-----------------|
|
| 103 |
+
| [namdp-ptit/ViDense](https://huggingface.co/namdp-ptit/ViDense) | **54.72** | 82.26 | 85.62 | **61.28** | **58.42** |
|
| 104 |
+
| [VoVanPhuc/sup-SimCSE-VietNamese-phobert-base](https://huggingface.co/VoVanPhuc/sup-SimCSE-VietNamese-phobert-base) | 53.64 | 81.52 | 85.02 | 59.12 | 55.70 |
|
| 105 |
+
| [keepitreal/vietnamese-sbert](https://huggingface.co/keepitreal/vietnamese-sbert) | 50.45 | 80.54 | 78.58 | 52.67 | 51.86 |
|
| 106 |
+
| [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) | 46.12 | **83.45** | **86.08** | 58.27 | 49.02 |
|
| 107 |
|
| 108 |
Here are the information of these 3 benchmarks:
|
| 109 |
|
| 110 |
+
* ZAC: merge train and test into a new benchmark, ~ 3200 queries, ~ 330K documents in corpus.
|
| 111 |
+
* WebFAQ and OwiFaq: merge train and test into a new benchmark, ~ 124K queries, ~ 124K documents in corpus.
|
| 112 |
+
* ViQuAD2.0: merge train, validation and test into a new benchmark, ~ 39.6K queries, ~ 39.6K documents in corpus.
|
| 113 |
+
* Vietnamese-Legal: ~ 144K queries, ~ 144K documents in corpus.
|
| 114 |
|
| 115 |
## Contact
|
| 116 |
|
|
|
|
| 145 |
year={2025},
|
| 146 |
publisher={Huggingface},
|
| 147 |
}
|
| 148 |
+
```
|
| 149 |
+
|
| 150 |
+
Beta
|
| 151 |
+
0 / 0
|
| 152 |
+
used queries
|
| 153 |
+
1
|