| { |
| "model_id": "AITeamVN/Vietnamese_Embedding", |
| "downloads": 13362, |
| "tags": [ |
| "sentence-transformers", |
| "safetensors", |
| "xlm-roberta", |
| "Embedding", |
| "sentence-similarity", |
| "vi", |
| "base_model:BAAI/bge-m3", |
| "base_model:finetune:BAAI/bge-m3", |
| "license:apache-2.0", |
| "autotrain_compatible", |
| "text-embeddings-inference", |
| "endpoints_compatible", |
| "region:us" |
| ], |
| "description": "--- license: apache-2.0 language: - vi base_model: - BAAI/bge-m3 pipeline_tag: sentence-similarity library_name: sentence-transformers tags: - Embedding --- ## Model Card: Vietnamese_Embedding Vietnamese_Embedding is an embedding model fine-tuned from the BGE-M3 model ( to enhance retrieval capabilities for Vietnamese. * The model was trained on approximately 300,000 triplets of queries, positive documents, and negative documents for Vietnamese. * The model was trained with a maximum sequence length of 2048. ## Model Details ### Model Description - **Model Type:** Sentence Transformer - **Base model:** BAAI/bge-m3 - **Maximum Sequence Length:** 2048 tokens - **Output Dimensionality:** 1024 dimensions - **Similarity Function:** Dot product Similarity - **Language:** Vietnamese - **Licence:** Apache 2.0 ## Usage ### Evaluation: - Dataset: Entire training dataset of Legal Zalo 2021. Our model was not trained on this dataset. | Model | Accuracy@1 | Accuracy@3 | Accuracy@5 | Accuracy@10 | MRR@10 | |----------------------|------------|------------|------------|-------------|--------------| | Vietnamese_Reranker | 0.7944 | 0.9324 | 0.9537 | 0.9740 | 0.8672 | | Vietnamese_Embedding_v2 | 0.7262 | 0.8927 | 0.9268 | 0.9578 | 0.8149 | | Vietnamese_Embedding (public) | 0.7274 | 0.8992 | 0.9305 | 0.9568 | 0.8181 | | Vietnamese-bi-encoder (BKAI) | 0.7109 | 0.8680 | 0.9014 | 0.9299 | 0.7951 | | BGE-M3 | 0.5682 | 0.7728 | 0.8382 | 0.8921 | 0.6822 | Vietnamese_Reranker and Vietnamese_Embedding_v2 was trained on 1100000 triplets. Although the score on the legal domain drops a bit on Vietnamese_Embedding_v2, since this phase data is much larger, it is very good for other domains. You can access 2 model via link: Vietnamese_Embedding_v2, Vietnamese_Reranker You can reproduce the evaluation result by running code python evaluation_model.py (data downloaded from Kaggle). ## Contact Email: nguyennhotrung3004@gmail.com **Developer** Member: Nguyễn Nho Trung, Nguyễn Nhật Quang ## Citation", |
| "model_explanation_gemini": "\"Fine-tuned from BGE-M3 to generate Vietnamese sentence embeddings for improved retrieval tasks, with a 1024-dimensional output and 2048-token sequence length.\"\n\n### Model Features: \n- **Base Model:** BAAI/bge-m3 \n- **Task:** Sentence similarity (dot product) \n- **Language:** Vietnamese \n- **Sequence Length:** 2048 tokens \n- **Output Dimension:** 1024 \n- **Training Data:** ~300K Vietnamese query-d", |
| "release_year": null, |
| "parameter_count": null, |
| "is_fine_tuned": true, |
| "category": "Embedding", |
| "model_family": "BERT", |
| "api_enhanced": true |
| } |