🧠 ViMedEmbedding
ViMedEmbedding is a Vietnamese medical sentence embedding model fine-tuned from PhoBERT using contrastive learning. It aims to encode medical questions, answers, and contexts into a shared semantic vector space for semantic similarity, retrieval, and Q&A matching tasks.
📖 Model Details
- Base model:
vinai/phobert-base - Architecture: PhoBERT + Linear projection (768 → 768)
- Objective: Contrastive loss (InfoNCE)
- Embedding size: 768
- Normalization: L2 normalization
This model is trained on ViMedAQA, a Vietnamese dataset for medical Q&A, where each sample consists of:
- A question (
question) - A correct answer (
answer) - A related context (
context)
🧩 Intended Uses
This model is designed for:
- Semantic similarity between Vietnamese medical texts
- Q&A retrieval and ranking
- Contextual search in healthcare chatbots
- Domain-specific embedding generation
🧠 How It Works
Each input (question, answer, context) is encoded using PhoBERT.
The [CLS] embedding is projected through a linear layer and L2-normalized.
Contrastive learning ensures semantically related pairs are close in embedding space.
graph TD
A[Question + Context] -->|Tokenizer| B[PhoBERT Encoder]
C[Answer + Context] -->|Tokenizer| B2[PhoBERT Encoder]
B --> D[CLS Embedding]
B2 --> D2[CLS Embedding]
D --> E[Linear Projection]
D2 --> E2[Linear Projection]
E --> F[L2 Normalize]
E2 --> F2[L2 Normalize]
F --> G[Contrastive Loss]
F2 --> G
📚 Training Data
| Field | Description |
|---|---|
question |
Medical question |
answer |
Correct answer |
context |
Related context such as symptoms or drug effects |
⚙️ Training Procedure
Loss: Contrastive Loss (based on cosine similarity)
Batch size: 16
Epochs: 4
Optimizer: AdamW (lr = 2e-5)
Evaluation metric: Average cosine similarity between anchor–positive pairs
Training and validation data were split using the original ViMedAQA splits.
📈 Evaluation Results
| Metric | Validation Value |
|---|---|
| Average Cosine Similarity | ~0.97 |
Embeddings from semantically related pairs (question–answer) have high cosine similarity, while unrelated pairs have low similarity.
🚀 Example Usage
import torch
from transformers import AutoTokenizer
from model import ViMedEmbeddingModel
import torch.nn.functional as F
device = "cuda" if torch.cuda.is_available() else "cpu"
model = ViMedEmbeddingModel()
model.load_state_dict(torch.load("checkpoints_with_anchor/best_model.pt", map_location=device))
model.to(device).eval()
tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base", use_fast=True)
def get_embedding(text):
inputs = tokenizer(text, padding="max_length", truncation=True, max_length=256, return_tensors="pt").to(device)
with torch.no_grad():
emb = model(inputs["input_ids"], inputs["attention_mask"])
return emb
question = "Thuốc Buscopan có thể gây ra tác dụng phụ nào liên quan đến huyết áp?"
answer = "Thuốc Buscopan có thể gây hạ huyết áp và chóng mặt."
emb_q = get_embedding(question)
emb_a = get_embedding(answer)
similarity = F.cosine_similarity(emb_q, emb_a).item()
print(f"🔹 Cosine Similarity: {similarity:.4f}")
💡 Applications
Medical question–answer retrieval
Semantic clustering of clinical texts
Healthcare chatbot response ranking
Information retrieval in biomedical research
👨💻 Author
Developed by Mouth Ji Base model: vinai/phobert-base
Dataset: tmnam20/ViMedAQA
📜 License
This model is released under the MIT License — free to use, modify, and distribute for research and educational purposes.
⭐ Citation
If you use this model, please cite:
@misc
{ViMedEmbedding2025,
title={ViMedEmbedding: Vietnamese Medical Sentence Embedding },
author={Mouth Ji},
year={2025},
publisher={Hugging Face},
howpublished={[https://huggingface.co/MouhJI/vi-heath-embedding/edit/main/README.md](https://huggingface.co/MouhJI/vi-heath-embedding)}
}
Model tree for MouhJI/vi-heath-embedding
Base model
vinai/phobert-base