RinNguyen103
/

Vietnamese-Image-Captioning

+# 🇻🇳 Vietnamese Image Captioning Model
+> **EfficientNet-B0 × BARTPho** | *Trained on UIT-ViIC dataset*
+## 📌 Giới thiệu
+Mô hình Sinh chú thích ảnh tiếng Việt (Vietnamese Image Captioning) huấn luyện trên bộ dữ liệu UIT-ViIC, cho phép tạo mô tả ảnh tự nhiên và chính xác bằng tiếng Việt.
+### Ứng dụng:
+* 🔍 **Tìm kiếm ảnh** theo ngôn ngữ tự nhiên
+* 🦯 **Hỗ trợ người khiếm thị** tiếp cận nội dung hình ảnh
+* 🤖 **Tích hợp** vào hệ thống AI đa phương thức (Multimodal AI)
+## 🧠 Kiến trúc mô hình
+| Thành phần | Mô tả |
+|------------|-------|
+| **Encoder** | EfficientNet-B0 (pretrained từ NVIDIA TorchHub) → Trích xuất đặc trưng ảnh thành vector embedding |
+| **Decoder** | BARTPho-Syllable → Sinh câu mô tả dựa trên đặc trưng ảnh |
+### Pipeline:
+```
+Ảnh → EncoderCNN (EfficientNet-B0) → vector đặc trưng (embed size = 768)
+    → Linear projection → encoder BARTPho
+    → BARTPho decoder → sinh chú thích tiếng Việt
+```
+## ⚙️ Thông số huấn luyện
+| Tham số | Giá trị |
+|---------|---------|
+| **Dataset** | UIT-ViIC (train/val/test) |
+| **Loss** | CrossEntropyLoss (ignore pad tokens) |
+| **Optimizer** | Adam (lr = 5e-5) |
+| **Batch size** | 32 |
+| **Epochs** | 30 |
+| **Gradient clipping** | 1.0 |
+| **Mixed Precision** | torch.cuda.amp |
+| **Image augmentation** | Resize(256) → RandomCrop(224) → Normalize(Imagenet) |
+## 📊 Metrics hỗ trợ
+- **BLEU**
+- **ROUGE-L**
+- **METEOR**
+- **CIDEr**
+- **F1 trung bình token-level**
+- **Recall trung bình token-level**
+> *Điểm số cụ thể phụ thuộc vào checkpoint được tải.*
+## 🚀 Cách sử dụng
+```python
+import torch
+from PIL import Image
+from torchvision import transforms
+from image_caption import ImageCaptioningModel, Vocabulary
+from huggingface_hub import hf_hub_download
+DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+model_name = "vinai/bartpho-syllable"
+# Load vocab & model
+vocab = Vocabulary(model_name=model_name)
+model = ImageCaptioningModel(embed_size=768, bartpho_model_name=model_name,
+                     train_CNN=False, freeze_bartpho=False).to(DEVICE)
+# Download checkpoint từ Hugging Face
+ckpt_path = hf_hub_download(repo_id="username/vietnamese-image-captioning",
+                    filename="best_image_captioning_model_vietnamese.pth.tar")
+model.load_state_dict(torch.load(ckpt_path, map_location=DEVICE)["state_dict"])
+model.eval()
+# Transform ảnh
+tfm = transforms.Compose([
+    transforms.Resize((256, 256)),
+    transforms.CenterCrop((224, 224)),
+    transforms.ToTensor(),
+    transforms.Normalize(mean=[0.485, 0.456, 0.406],
+                 std=[0.229, 0.224, 0.225]),
+])
+img = Image.open("your_image.jpg").convert("RGB")
+img = tfm(img).to(DEVICE)
+with torch.no_grad():
+    caption = model.predict(img, vocab, max_length=50)
+print("Caption:", caption)
+```
+## 📜 Giấy phép
+- **Model**: Tuân theo giấy phép của BARTPho và EfficientNet
+- **Dataset**: UIT-ViIC (chỉ sử dụng cho nghiên cứu & học tập)
+## 👤 Tác giả
+**Nguyễn Thành Đạt**