A newer version of this model is available: vinai/phobert-base

🧠 ViMedEmbedding

ViMedEmbedding is a Vietnamese medical sentence embedding model fine-tuned from PhoBERT using contrastive learning. It aims to encode medical questions, answers, and contexts into a shared semantic vector space for semantic similarity, retrieval, and Q&A matching tasks.


📖 Model Details

  • Base model: vinai/phobert-base
  • Architecture: PhoBERT + Linear projection (768 → 768)
  • Objective: Contrastive loss (InfoNCE)
  • Embedding size: 768
  • Normalization: L2 normalization

This model is trained on ViMedAQA, a Vietnamese dataset for medical Q&A, where each sample consists of:

  • A question (question)
  • A correct answer (answer)
  • A related context (context)

🧩 Intended Uses

This model is designed for:

  • Semantic similarity between Vietnamese medical texts
  • Q&A retrieval and ranking
  • Contextual search in healthcare chatbots
  • Domain-specific embedding generation

🧠 How It Works

Each input (question, answer, context) is encoded using PhoBERT. The [CLS] embedding is projected through a linear layer and L2-normalized. Contrastive learning ensures semantically related pairs are close in embedding space.

graph TD
A[Question + Context] -->|Tokenizer| B[PhoBERT Encoder]
C[Answer + Context] -->|Tokenizer| B2[PhoBERT Encoder]
B --> D[CLS Embedding]
B2 --> D2[CLS Embedding]
D --> E[Linear Projection]
D2 --> E2[Linear Projection]
E --> F[L2 Normalize]
E2 --> F2[L2 Normalize]
F --> G[Contrastive Loss]
F2 --> G

📚 Training Data

Field Description
question Medical question
answer Correct answer
context Related context such as symptoms or drug effects

⚙️ Training Procedure

Loss: Contrastive Loss (based on cosine similarity)

Batch size: 16

Epochs: 4

Optimizer: AdamW (lr = 2e-5)

Evaluation metric: Average cosine similarity between anchor–positive pairs

Training and validation data were split using the original ViMedAQA splits.


📈 Evaluation Results

Metric Validation Value
Average Cosine Similarity ~0.97

Embeddings from semantically related pairs (question–answer) have high cosine similarity, while unrelated pairs have low similarity.


🚀 Example Usage

import torch
from transformers import AutoTokenizer
from model import ViMedEmbeddingModel
import torch.nn.functional as F

device = "cuda" if torch.cuda.is_available() else "cpu"
model = ViMedEmbeddingModel()
model.load_state_dict(torch.load("checkpoints_with_anchor/best_model.pt", map_location=device))
model.to(device).eval()

tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base", use_fast=True)

def get_embedding(text):
inputs = tokenizer(text, padding="max_length", truncation=True, max_length=256, return_tensors="pt").to(device)
with torch.no_grad():
emb = model(inputs["input_ids"], inputs["attention_mask"])
return emb

question = "Thuốc Buscopan có thể gây ra tác dụng phụ nào liên quan đến huyết áp?"
answer = "Thuốc Buscopan có thể gây hạ huyết áp và chóng mặt."

emb_q = get_embedding(question)
emb_a = get_embedding(answer)

similarity = F.cosine_similarity(emb_q, emb_a).item()
print(f"🔹 Cosine Similarity: {similarity:.4f}")

💡 Applications

Medical question–answer retrieval

Semantic clustering of clinical texts

Healthcare chatbot response ranking

Information retrieval in biomedical research


👨‍💻 Author

Developed by Mouth Ji Base model: vinai/phobert-base

Dataset: tmnam20/ViMedAQA


📜 License

This model is released under the MIT License — free to use, modify, and distribute for research and educational purposes.


⭐ Citation

If you use this model, please cite:



@misc
{ViMedEmbedding2025,
title={ViMedEmbedding: Vietnamese Medical Sentence Embedding },
author={Mouth Ji},
year={2025},
publisher={Hugging Face},
howpublished={[https://huggingface.co/MouhJI/vi-heath-embedding/edit/main/README.md](https://huggingface.co/MouhJI/vi-heath-embedding)}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MouhJI/vi-heath-embedding

Base model

vinai/phobert-base
Finetuned
(150)
this model

Dataset used to train MouhJI/vi-heath-embedding