Add new SentenceTransformer model with an onnx backend

Browse files

Files changed (12) hide show

1_Pooling/config.json +10 -0
README.md +588 -0
added_tokens.json +3 -0
bpe.codes +0 -0
config.json +27 -0
config_sentence_transformers.json +10 -0
modules.json +14 -0
onnx/model.onnx +3 -0
sentence_bert_config.json +4 -0
special_tokens_map.json +51 -0
tokenizer_config.json +55 -0
vocab.txt +0 -0

1_Pooling/config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "word_embedding_dimension": 768,
+  "pooling_mode_cls_token": false,
+  "pooling_mode_mean_tokens": true,
+  "pooling_mode_max_tokens": false,
+  "pooling_mode_mean_sqrt_len_tokens": false,
+  "pooling_mode_weightedmean_tokens": false,
+  "pooling_mode_lasttoken": false,
+  "include_prompt": true
+}

README.md ADDED Viewed

	@@ -0,0 +1,588 @@

+---
+language:
+- vi
+license: apache-2.0
+tags:
+- sentence-transformers
+- sentence-similarity
+- feature-extraction
+- generated_from_trainer
+- dataset_size:100000
+- loss:MatryoshkaLoss
+- loss:MultipleNegativesRankingLoss
+base_model: bkai-foundation-models/vietnamese-bi-encoder
+widget:
+- source_sentence: 'Điều 2 Quyết định 185/QĐ-UB năm 1998 Bảng giá đất tỉnh Bến Tre
+    có nội dung như sau:
+    Điều 2. Giá đất trên được áp dụng cho những trường hợp: Tính thuế chuyển quyền
+    sử dụng cho những trường hợp: Tính thuế chuyển quyền sử dụng đất, thu lệ phí trước
+    bạ, thu tiền sử dụng đất khi giao đất, cho thuê đất, tính giá trị tài sản khi
+    giao đất, bồi thường thiệt hại về đất khi Nhà nước thu hồi.
+    Trường hợp giao đất theo hình thức đấu giá, thì giá đất sẽ do Uỷ ban nhân dân
+    tỉnh cho trường hợp cụ thể.
+    Giá cho thuê đất đối với các tổ chức, cá nhân nước ngoài hoặc xí nghiệp có vốn
+    đầu tư nước ngoài được áp dụng theo quy định của Chính phủ.'
+  sentences:
+  - Điều 2 Quyết định 55/2012/QĐ-UBND dự toán ngân sách phân bổ dự toán ngân sách
+    2013 Bình Dương
+  - Điều 2 Quyết định 185/QĐ-UB năm 1998 Bảng giá đất tỉnh Bến Tre
+  - Điều 3 Quyết định 79/2019/QĐ-UBND mức thu học phí quản lý và sử dụng học phí giáo
+    dục mầm non Huế
+- source_sentence: 'Điều 3 Quyết định 94/QĐ-UBND 2018 kế hoạch hoạt động kiểm soát
+    thủ tục hành chính Lâm Đồng có nội dung như sau:
+    Điều 3. Chánh Văn phòng UBND tỉnh; Thủ trưởng các sở, ban, ngành; Chủ tịch UBND
+    các huyện, thành phố; Chủ tịch UBND các xã, phường, thị trấn trên địa bàn tỉnh
+    chịu trách nhiệm thi hành Quyết định này'
+  sentences:
+  - Điều 3 Quyết định 94/QĐ-UBND 2018 kế hoạch hoạt động kiểm soát thủ tục hành chính
+    Lâm Đồng
+  - Cơ quan nhà nước có thẩm quyền có trách nhiệm gì trong việc giải quyết tranh chấp
+    lao động khi sa thải người lao động?
+  - 'Thăng hạng giáo viên: Điều kiện về thời gian giữ hạng thấp hơn liền kề'
+- source_sentence: 'Điều 8 Thông tư 63/2013/TT-BGTVT hướng dẫn Bản ghi nhớ vận tải
+    đường bộ giữa Campuchia Lào Việt Nam có nội dung như sau:
+    Điều 8. Hồ sơ cấp Giấy phép liên vận CLV
+    1. Đối với xe thương mại:
+    a) Đơn đề nghị cấp Giấy phép liên vận CLV cho phương tiện thương mại quy định
+    tại Phụ lục VI của Thông tư này;
+    b) Giấy phép kinh doanh vận tải bằng xe ô tô hoặc Giấy chứng nhận đăng ký kinh
+    doanh đối với đơn vị kinh doanh vận tải bằng xe ô tô không thuộc đối tượng phải
+    cấp giấy phép kinh doanh vận tải bằng xe ô tô (bản sao có chứng thực hoặc bản
+    sao kèm theo bản chính để đối chiếu);
+    c) Giấy đăng ký phương tiện (bản sao có chứng thực hoặc bản sao kèm theo bản chính
+    để đối chiếu);
+    d) Văn bản chấp thuận khai thác tuyến (đối với phương tiện kinh doanh vận tải
+    hành khách theo tuyến cố định);
+    đ) Trường hợp phương tiện không thuộc sở hữu của đơn vị kinh doanh vận tải thì
+    phải xuất trình thêm tài liệu chứng minh quyền sử dụng hợp pháp của đơn vị kinh
+    doanh vận tải với phương tiện đó (bản sao có chứng thực hoặc bản sao kèm theo
+    bản chính để đối chiếu).
+    2. Đối với xe phi thương mại:
+    a) Đơn đề nghị cấp Giấy phép liên vận CLV cho phương tiện phi thương mại quy định
+    Phụ lục VII của Thông tư này;
+    b) Giấy đăng ký phương tiện (bản sao có chứng thực hoặc bản sao kèm theo bản chính
+    để đối chiếu). Trường hợp phương tiện không thuộc sở hữu của tổ chức, cá nhân
+    thì phải kèm theo tài liệu chứng minh quyền sử dụng hợp pháp của tổ chức, các
+    nhân với phương tiện đó (bản sao có chứng thực hoặc bản sao kèm theo bản chính
+    để đối chiếu);
+    c) Đối với doanh nghiệp, hợp tác xã thực hiện công trình, dự án hoặc hoạt động
+    kinh doanh trên lãnh thổ Lào hoặc Campuchia thì kèm theo Hợp đồng hoặc tài liệu
+    chứng minh đơn vị đang thực hiện công trình, dự án hoặc hoạt động kinh doanh,
+    trên lãnh thổ Lào, Campuchia (bản sao có chứng thực).'
+  sentences:
+  - Bộ Xây dựng ghi nhận các kiến nghị về quy hoạch đô thị và nông thôn
+  - Điều 3 Quyết định 2106/QĐ-BYT 2020 Kế hoạch triển khai chiến dịch tiêm bổ sung
+    vắc xin Sởi Rubella
+  - Điều 8 Thông tư 63/2013/TT-BGTVT hướng dẫn Bản ghi nhớ vận tải đường bộ giữa Campuchia
+    Lào Việt Nam
+- source_sentence: 'Điều 2 Quyết định 16/2010/QĐ-UBND phân vùng môi trường tiếp nhận
+    nước thải khí thải công nghiệp trên địa bàn tỉnh Đồng Nai có nội dung như sau:
+    Điều 2. Xác định và tính toán lưu lượng các nguồn xả nước thải, khí thải công
+    nghiệp
+    1. Các tổ chức, cá nhân là chủ cơ sở sản xuất, kinh doanh, dịch vụ có trách nhiệm
+    quan trắc, thống kê, kiểm toán chất thải để tính toán, xác định lưu lượng nước
+    thải, khí thải công nghiệp để áp dụng hệ số lưu lượng nguồn thải.
+    2. Các tổ chức, cá nhân có trách nhiệm cung cấp đúng, đầy đủ, chính xác và trung
+    thực các thông tin về lưu lượng nước thải, khí thải công nghiệp cho cơ quan quản
+    lý Nhà nước về môi trường. Trong trường hợp số liệu của các tổ chức, cá nhân cung
+    cấp chưa đủ tin cậy, cơ quan quản lý Nhà nước về môi trường sẽ tính toán, xác
+    định hoặc trưng cầu giám định theo quy định pháp luật.
+    3. Trong một số trường hợp đặc thù tùy thuộc vào quy mô, tính chất dự án, cơ sở
+    sản xuất, kinh doanh, dịch vụ, điều kiện cụ thể về môi trường tiếp nhận nước thải
+    và khí thải, địa điểm thực dự án và quy hoạch phát triển kinh tế - xã hội địa
+    phương, Ủy ban nhân dân tỉnh Đồng Nai có những quy định riêng.'
+  sentences:
+  - Điều 2 Quyết định 16/2010/QĐ-UBND phân vùng môi trường tiếp nhận nước thải khí
+    thải công nghiệp trên địa bàn tỉnh Đồng Nai
+  - Điều 16 Thông tư 14/2010/TT-BKHCN hướng dẫn tiêu chuẩn, quy trình thủ tục xét
+    tặng
+  - Người lao động có quyền đơn phương chấm dứt hợp đồng lao động khi được bổ nhiệm
+    giữ chức vụ gì?
+- source_sentence: Điều 29 Nghị định 46/2015 NĐ-CP quy định về thí nghiệm đối chứng,
+    kiểm định chất lượng, thí nghiệm khả năng chịu lực của kết cấu công trình trong
+    quá trình thi công xây dựng. Tôi xin hỏi, trong dự toán công trình giao thông
+    có chi phí kiểm định tạm tính, chủ đầu tư có quyền lập đề cương, dự toán rồi giao
+    cho phòng thẩm định kết quả có giá trị, sau đó thực hiện thuê đơn vị tư vấn có
+    chức năng thực hiện công tác kiểm định được không?Bộ Xây dựng trả lời vấn đề này
+    như sau:Trường hợp kiểm định theo quy định tại Điểm a, Điểm b, Điểm c, Khoản 2,
+    Điều 29 (thí nghiệm đối chứng, kiểm định chất lượng, thí nghiệm khả năng chịu
+    lực của kết cấu công trình trong quá trình thi công xây dựng) Nghị định46/2015/NĐ-CPngày
+    12/5/2015 của Chính phủ về quản lý chất lượng và bảo trì công trình xây dựng thì
+    việc lập đề cương, dự toán kiểm định do tổ chức đáp ứng điều kiện năng lực theo
+    quy định của pháp luật thực hiện.Đối với trường hợp kiểm định theo quy định tại
+    Điểm đ, Khoản 2, Điều 29 Nghị định46/2015/NĐ-CPthì thực hiện theo quy định tại
+    Điều 18 Thông tư26/2016/TT-BXDngày 26/10/2016 của Bộ Xây dựng quy định chi tiết
+    một số nội dung về quản lý chất lượng và bảo trì công trình xây dựng.
+  sentences:
+  - Quy định về trợ cấp với cán bộ xã già yếu nghỉ việc
+  - Có thể thuê kiểm định chất lượng công trình?
+  - Điều kiện doanh nghiệp được hoạt động tư vấn giám sát
+pipeline_tag: sentence-similarity
+library_name: sentence-transformers
+metrics:
+- cosine_accuracy@1
+- cosine_accuracy@3
+- cosine_accuracy@5
+- cosine_accuracy@10
+- cosine_precision@1
+- cosine_precision@3
+- cosine_precision@5
+- cosine_precision@10
+- cosine_recall@1
+- cosine_recall@3
+- cosine_recall@5
+- cosine_recall@10
+- cosine_ndcg@10
+- cosine_mrr@10
+- cosine_map@100
+model-index:
+- name: bkai-fine-tuned-legal
+  results:
+  - task:
+      type: information-retrieval
+      name: Information Retrieval
+    dataset:
+      name: dim 768
+      type: dim_768
+    metrics:
+    - type: cosine_accuracy@1
+      value: 0.5855925639039504
+      name: Cosine Accuracy@1
+    - type: cosine_accuracy@3
+      value: 0.7033307513555384
+      name: Cosine Accuracy@3
+    - type: cosine_accuracy@5
+      value: 0.7500645494448748
+      name: Cosine Accuracy@5
+    - type: cosine_accuracy@10
+      value: 0.8109992254066615
+      name: Cosine Accuracy@10
+    - type: cosine_precision@1
+      value: 0.5855925639039504
+      name: Cosine Precision@1
+    - type: cosine_precision@3
+      value: 0.23444358378517946
+      name: Cosine Precision@3
+    - type: cosine_precision@5
+      value: 0.15001290988897495
+      name: Cosine Precision@5
+    - type: cosine_precision@10
+      value: 0.08109992254066614
+      name: Cosine Precision@10
+    - type: cosine_recall@1
+      value: 0.5855925639039504
+      name: Cosine Recall@1
+    - type: cosine_recall@3
+      value: 0.7033307513555384
+      name: Cosine Recall@3
+    - type: cosine_recall@5
+      value: 0.7500645494448748
+      name: Cosine Recall@5
+    - type: cosine_recall@10
+      value: 0.8109992254066615
+      name: Cosine Recall@10
+    - type: cosine_ndcg@10
+      value: 0.6937880818561333
+      name: Cosine Ndcg@10
+    - type: cosine_mrr@10
+      value: 0.6568145771089225
+      name: Cosine Mrr@10
+    - type: cosine_map@100
+      value: 0.6626061839086153
+      name: Cosine Map@100
+  - task:
+      type: information-retrieval
+      name: Information Retrieval
+    dataset:
+      name: dim 512
+      type: dim_512
+    metrics:
+    - type: cosine_accuracy@1
+      value: 0.5848179705654531
+      name: Cosine Accuracy@1
+    - type: cosine_accuracy@3
+      value: 0.7002323780015491
+      name: Cosine Accuracy@3
+    - type: cosine_accuracy@5
+      value: 0.7490317583268784
+      name: Cosine Accuracy@5
+    - type: cosine_accuracy@10
+      value: 0.8073844564936742
+      name: Cosine Accuracy@10
+    - type: cosine_precision@1
+      value: 0.5848179705654531
+      name: Cosine Precision@1
+    - type: cosine_precision@3
+      value: 0.23341079266718306
+      name: Cosine Precision@3
+    - type: cosine_precision@5
+      value: 0.1498063516653757
+      name: Cosine Precision@5
+    - type: cosine_precision@10
+      value: 0.0807384456493674
+      name: Cosine Precision@10
+    - type: cosine_recall@1
+      value: 0.5848179705654531
+      name: Cosine Recall@1
+    - type: cosine_recall@3
+      value: 0.7002323780015491
+      name: Cosine Recall@3
+    - type: cosine_recall@5
+      value: 0.7490317583268784
+      name: Cosine Recall@5
+    - type: cosine_recall@10
+      value: 0.8073844564936742
+      name: Cosine Recall@10
+    - type: cosine_ndcg@10
+      value: 0.6917119064236622
+      name: Cosine Ndcg@10
+    - type: cosine_mrr@10
+      value: 0.6551604719691482
+      name: Cosine Mrr@10
+    - type: cosine_map@100
+      value: 0.6611599622252305
+      name: Cosine Map@100
+  - task:
+      type: information-retrieval
+      name: Information Retrieval
+    dataset:
+      name: dim 256
+      type: dim_256
+    metrics:
+    - type: cosine_accuracy@1
+      value: 0.5814613994319648
+      name: Cosine Accuracy@1
+    - type: cosine_accuracy@3
+      value: 0.6935192357345726
+      name: Cosine Accuracy@3
+    - type: cosine_accuracy@5
+      value: 0.7428350116189001
+      name: Cosine Accuracy@5
+    - type: cosine_accuracy@10
+      value: 0.8022205009036922
+      name: Cosine Accuracy@10
+    - type: cosine_precision@1
+      value: 0.5814613994319648
+      name: Cosine Precision@1
+    - type: cosine_precision@3
+      value: 0.2311730785781909
+      name: Cosine Precision@3
+    - type: cosine_precision@5
+      value: 0.14856700232378
+      name: Cosine Precision@5
+    - type: cosine_precision@10
+      value: 0.08022205009036923
+      name: Cosine Precision@10
+    - type: cosine_recall@1
+      value: 0.5814613994319648
+      name: Cosine Recall@1
+    - type: cosine_recall@3
+      value: 0.6935192357345726
+      name: Cosine Recall@3
+    - type: cosine_recall@5
+      value: 0.7428350116189001
+      name: Cosine Recall@5
+    - type: cosine_recall@10
+      value: 0.8022205009036922
+      name: Cosine Recall@10
+    - type: cosine_ndcg@10
+      value: 0.6871061609559359
+      name: Cosine Ndcg@10
+    - type: cosine_mrr@10
+      value: 0.6508078926552976
+      name: Cosine Mrr@10
+    - type: cosine_map@100
+      value: 0.6566099087487134
+      name: Cosine Map@100
+  - task:
+      type: information-retrieval
+      name: Information Retrieval
+    dataset:
+      name: dim 128
+      type: dim_128
+    metrics:
+    - type: cosine_accuracy@1
+      value: 0.5695843015750065
+      name: Cosine Accuracy@1
+    - type: cosine_accuracy@3
+      value: 0.6785437645236251
+      name: Cosine Accuracy@3
+    - type: cosine_accuracy@5
+      value: 0.7273431448489543
+      name: Cosine Accuracy@5
+    - type: cosine_accuracy@10
+      value: 0.7936999741802221
+      name: Cosine Accuracy@10
+    - type: cosine_precision@1
+      value: 0.5695843015750065
+      name: Cosine Precision@1
+    - type: cosine_precision@3
+      value: 0.22618125484120832
+      name: Cosine Precision@3
+    - type: cosine_precision@5
+      value: 0.14546862896979085
+      name: Cosine Precision@5
+    - type: cosine_precision@10
+      value: 0.0793699974180222
+      name: Cosine Precision@10
+    - type: cosine_recall@1
+      value: 0.5695843015750065
+      name: Cosine Recall@1
+    - type: cosine_recall@3
+      value: 0.6785437645236251
+      name: Cosine Recall@3
+    - type: cosine_recall@5
+      value: 0.7273431448489543
+      name: Cosine Recall@5
+    - type: cosine_recall@10
+      value: 0.7936999741802221
+      name: Cosine Recall@10
+    - type: cosine_ndcg@10
+      value: 0.6754615621699942
+      name: Cosine Ndcg@10
+    - type: cosine_mrr@10
+      value: 0.6384098910241435
+      name: Cosine Mrr@10
+    - type: cosine_map@100
+      value: 0.6443976474654151
+      name: Cosine Map@100
+  - task:
+      type: information-retrieval
+      name: Information Retrieval
+    dataset:
+      name: dim 64
+      type: dim_64
+    metrics:
+    - type: cosine_accuracy@1
+      value: 0.5543506325845597
+      name: Cosine Accuracy@1
+    - type: cosine_accuracy@3
+      value: 0.6609863155176865
+      name: Cosine Accuracy@3
+    - type: cosine_accuracy@5
+      value: 0.7061709269300284
+      name: Cosine Accuracy@5
+    - type: cosine_accuracy@10
+      value: 0.7717531629227988
+      name: Cosine Accuracy@10
+    - type: cosine_precision@1
+      value: 0.5543506325845597
+      name: Cosine Precision@1
+    - type: cosine_precision@3
+      value: 0.22032877183922883
+      name: Cosine Precision@3
+    - type: cosine_precision@5
+      value: 0.14123418538600568
+      name: Cosine Precision@5
+    - type: cosine_precision@10
+      value: 0.07717531629227987
+      name: Cosine Precision@10
+    - type: cosine_recall@1
+      value: 0.5543506325845597
+      name: Cosine Recall@1
+    - type: cosine_recall@3
+      value: 0.6609863155176865
+      name: Cosine Recall@3
+    - type: cosine_recall@5
+      value: 0.7061709269300284
+      name: Cosine Recall@5
+    - type: cosine_recall@10
+      value: 0.7717531629227988
+      name: Cosine Recall@10
+    - type: cosine_ndcg@10
+      value: 0.6571206813679893
+      name: Cosine Ndcg@10
+    - type: cosine_mrr@10
+      value: 0.6212180172869554
+      name: Cosine Mrr@10
+    - type: cosine_map@100
+      value: 0.6275272633144896
+      name: Cosine Map@100
+---
+# DEk21_hcmute_embedding
+DEk21_hcmute_embedding is a Vietnamese text embedding  focused on RAG and production efficiency:
+📚 **Trained Dataset**:
+The model was trained on an in-house dataset consisting of approximately **100,000 examples** of legal questions and their related contexts.
+⚙️ Efficiency:
+Trained with a **Matryoshka loss**, allowing embeddings to be truncated with minimal performance loss. This ensures that smaller embeddings are faster to compare, making the model efficient for real-world production use.
+## Model Details
+### Model Description
+- **Model Type:** Sentence Transformer
+- **Maximum Sequence Length:** 512 tokens
+- **Output Dimensionality:** 768 dimensions
+- **Similarity Function:** Cosine Similarity
+- **Language:** vietnamese
+- **License:** apache-2.0
+### Model Sources
+- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
+- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
+- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
+### Full Model Architecture
+```
+SentenceTransformer(
+  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: RobertaModel
+  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
+)
+```
+## Usage
+### Direct Usage (Sentence Transformers)
+First install the Sentence Transformers library:
+```bash
+pip install -U sentence-transformers
+```
+Then you can load this model and run inference.
+```python
+from sentence_transformers import SentenceTransformer
+import torch
+from pyvi import ViTokenizer
+# Download from the 🤗 Hub
+model = SentenceTransformer("huyydangg/DEk21_hcmute_embedding")
+# Define query (câu hỏi pháp luật) và docs (điều luật)
+query = "Điều kiện để kết hôn hợp pháp là gì?"
+docs = [
+    "Điều 8 Bộ luật Dân sự 2015 quy định về quyền và nghĩa vụ của công dân trong quan hệ gia đình.",
+    "Điều 18 Luật Hôn nhân và gia đình 2014 quy định về độ tuổi kết hôn của nam và nữ.",
+    "Điều 14 Bộ luật Dân sự 2015 quy định về quyền và nghĩa vụ của cá nhân khi tham gia hợp đồng.",
+    "Điều 27 Luật Hôn nhân và gia đình 2014 quy định về các trường hợp không được kết hôn.",
+    "Điều 51 Luật Hôn nhân và gia đình 2014 quy định về việc kết hôn giữa công dân Việt Nam và người nước ngoài."
+]
+# Tách từ cho query
+segmented_query = ViTokenizer.tokenize(query)
+# Tách từ cho từng dòng văn bản
+segmented_docs = [ViTokenizer.tokenize(doc) for doc in docs]
+# Encode query and documents
+query_embedding = model.encode([segmented_query])
+doc_embeddings = model.encode(segmented_docs)
+similarities = torch.nn.functional.cosine_similarity(
+    torch.tensor(query_embedding), torch.tensor(doc_embeddings)
+).flatten()
+# Sort documents by cosine similarity
+sorted_indices = torch.argsort(similarities, descending=True)
+sorted_docs = [docs[idx] for idx in sorted_indices]
+sorted_scores = [similarities[idx].item() for idx in sorted_indices]
+# Print sorted documents with their cosine scores
+for doc, score in zip(sorted_docs, sorted_scores):
+    print(f"Document: {doc} - Cosine Similarity: {score:.4f}")
+```
+## Evaluation
+### Metrics
+#### Information Retrieval
+* Datasets: [another-symato/VMTEB-Zalo-legel-retrieval-wseg](https://huggingface.co/datasets/another-symato/VMTEB-Zalo-legel-retrieval-wseg)
+* Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)
+| model                                        | type   |   ndcg@3 |   ndcg@5 |   ndcg@10 |    mrr@3 |    mrr@5 |   mrr@10 |
+|:---------------------------------------------|:-------|---------:|---------:|----------:|---------:|---------:|---------:|
+| huyydangg/DEk21_hcmute_embedding_wseg        | dense  | 0.908405 | 0.914792 |  0.917742 | 0.889583 | 0.893099 | 0.894266 |
+| AITeamVN/Vietnamese_Embedding                | dense  | 0.842687 | 0.854993 |  0.865006 | 0.822135 | 0.82901  | 0.833389 |
+| bkai-foundation-models/vietnamese-bi-encoder | hybrid | 0.827247 | 0.844781 |  0.846937 | 0.799219 | 0.809505 | 0.806771 |
+| bkai-foundation-models/vietnamese-bi-encoder | dense  | 0.814116 | 0.82965  |  0.839567 | 0.796615 | 0.805286 | 0.809572 |
+| AITeamVN/Vietnamese_Embedding                | hybrid | 0.788724 | 0.810062 |  0.820797 | 0.758333 | 0.77224  | 0.776461 |
+| BAAI/bge-m3                                  | dense  | 0.784056 | 0.80665  |  0.817016 | 0.763281 | 0.775859 | 0.780293 |
+| BAAI/bge-m3                                  | hybrid | 0.775239 | 0.797382 |  0.811962 | 0.747656 | 0.763333 | 0.77128  |
+| huyydangg/DEk21_hcmute_embedding             | dense  | 0.752173 | 0.769259 |  0.785101 | 0.72474  | 0.734427 | 0.741076 |
+| hiieu/halong_embedding                       | hybrid | 0.73627  | 0.757183 |  0.779169 | 0.710417 | 0.721901 | 0.731976 |
+| bm25                                         | bm25   | 0.728122 | 0.74974  |  0.761612 | 0.699479 | 0.711198 | 0.715738 |
+| dangvantuan/vietnamese-embedding             | dense  | 0.718971 | 0.746521 |  0.763416 | 0.696354 | 0.711953 | 0.718854 |
+| dangvantuan/vietnamese-embedding             | hybrid | 0.71711  | 0.743537 |  0.758315 | 0.690104 | 0.704792 | 0.712261 |
+| VoVanPhuc/sup-SimCSE-VietNamese-phobert-base | hybrid | 0.688483 | 0.713829 |  0.733894 | 0.660156 | 0.671198 | 0.676961 |
+| hiieu/halong_embedding                       | dense  | 0.656377 | 0.675881 |  0.701368 | 0.630469 | 0.641406 | 0.652057 |
+| VoVanPhuc/sup-SimCSE-VietNamese-phobert-base | dense  | 0.558852 | 0.584799 |  0.611329 | 0.536979 | 0.55112  | 0.562218 |
+## Citation
+You can cite our work as below:
+```bibtex
+@misc{DEk21_hcmute_embedding,
+  title={DEk21_hcmute_embedding: A Vietnamese Text Embedding},
+  author={QUANG HUY},
+  year={2025},
+  publisher={Huggingface},
+}
+```
+### BibTeX
+#### Sentence Transformers
+```bibtex
+@inproceedings{reimers-2019-sentence-bert,
+    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
+    author = "Reimers, Nils and Gurevych, Iryna",
+    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
+    month = "11",
+    year = "2019",
+    publisher = "Association for Computational Linguistics",
+    url = "https://arxiv.org/abs/1908.10084",
+}
+```
+#### MatryoshkaLoss
+```bibtex
+@misc{kusupati2024matryoshka,
+    title={Matryoshka Representation Learning},
+    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
+    year={2024},
+    eprint={2205.13147},
+    archivePrefix={arXiv},
+    primaryClass={cs.LG}
+}
+```
+#### MultipleNegativesRankingLoss
+```bibtex
+@misc{henderson2017efficient,
+    title={Efficient Natural Language Response Suggestion for Smart Reply},
+    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
+    year={2017},
+    eprint={1705.00652},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+```

added_tokens.json ADDED Viewed

	@@ -0,0 +1,3 @@

+{
+  "<mask>": 64000
+}

bpe.codes ADDED Viewed

The diff for this file is too large to render. See raw diff

config.json ADDED Viewed

	@@ -0,0 +1,27 @@

+{
+  "architectures": [
+    "RobertaModel"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "bos_token_id": 0,
+  "classifier_dropout": null,
+  "eos_token_id": 2,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-05,
+  "max_position_embeddings": 258,
+  "model_type": "roberta",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 1,
+  "position_embedding_type": "absolute",
+  "tokenizer_class": "PhobertTokenizer",
+  "torch_dtype": "float32",
+  "transformers_version": "4.52.4",
+  "type_vocab_size": 1,
+  "use_cache": true,
+  "vocab_size": 64001
+}

config_sentence_transformers.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "__version__": {
+    "sentence_transformers": "4.1.0",
+    "transformers": "4.52.4",
+    "pytorch": "2.6.0+cu124"
+  },
+  "prompts": {},
+  "default_prompt_name": null,
+  "similarity_fn_name": "cosine"
+}

modules.json ADDED Viewed

	@@ -0,0 +1,14 @@

+[
+  {
+    "idx": 0,
+    "name": "0",
+    "path": "",
+    "type": "sentence_transformers.models.Transformer"
+  },
+  {
+    "idx": 1,
+    "name": "1",
+    "path": "1_Pooling",
+    "type": "sentence_transformers.models.Pooling"
+  }
+]

onnx/model.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:197140746f98508307a8fe2d70def43599963d5104f68859964bd7695af3ca9e
+size 537974349

sentence_bert_config.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+  "max_seq_length": 256,
+  "do_lower_case": false
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,51 @@

+{
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "cls_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "<mask>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<pad>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,55 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "64000": {
+      "content": "<mask>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "extra_special_tokens": {},
+  "mask_token": "<mask>",
+  "model_max_length": 256,
+  "pad_token": "<pad>",
+  "sep_token": "</s>",
+  "tokenizer_class": "PhobertTokenizer",
+  "unk_token": "<unk>"
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff