File size: 3,514 Bytes

---

language: vi
license: mit
tags:
- image-captioning
- vision
- nlp
- multimodal
datasets:
- uit-viic
metrics:
- bleu
- rouge
- meteor
- cider
pipeline_tag: image-to-text
---

# 🇻🇳 Vietnamese Image Captioning Model

> **EfficientNet-B0 × BARTPho** | *Trained on UIT-ViIC dataset*

## 📌 Giới thiệu

Mô hình Sinh chú thích ảnh tiếng Việt (Vietnamese Image Captioning) huấn luyện trên bộ dữ liệu UIT-ViIC, cho phép tạo mô tả ảnh tự nhiên và chính xác bằng tiếng Việt.

### Ứng dụng:

* 🔍 **Tìm kiếm ảnh** theo ngôn ngữ tự nhiên
* 🦯 **Hỗ trợ người khiếm thị** tiếp cận nội dung hình ảnh
* 🤖 **Tích hợp** vào hệ thống AI đa phương thức (Multimodal AI)

## 🧠 Kiến trúc mô hình

| Thành phần | Mô tả |
|------------|-------|
| **Encoder** | EfficientNet-B0 (pretrained từ NVIDIA TorchHub) → Trích xuất đặc trưng ảnh thành vector embedding |
| **Decoder** | BARTPho-Syllable → Sinh câu mô tả dựa trên đặc trưng ảnh |

### Pipeline:

```

Ảnh → EncoderCNN (EfficientNet-B0) → vector đặc trưng (embed size = 768)

    → Linear projection → encoder BARTPho

    → BARTPho decoder → sinh chú thích tiếng Việt

```

## ⚙️ Thông số huấn luyện

| Tham số | Giá trị |
|---------|---------|
| **Dataset** | UIT-ViIC (train/val/test) |
| **Loss** | CrossEntropyLoss (ignore pad tokens) |
| **Optimizer** | Adam (lr = 5e-5) |
| **Batch size** | 32 |
| **Epochs** | 30 |
| **Gradient clipping** | 1.0 |
| **Mixed Precision** | torch.cuda.amp |
| **Image augmentation** | Resize(256) → RandomCrop(224) → Normalize(Imagenet) |

## 📊 Metrics hỗ trợ

- **BLEU**
- **ROUGE-L**
- **METEOR**
- **CIDEr**
- **F1 trung bình token-level**
- **Recall trung bình token-level**

> *Điểm số cụ thể phụ thuộc vào checkpoint được tải.*

## 🚀 Cách sử dụng

```python

import torch

from PIL import Image

from torchvision import transforms

from image_caption import ImageCaptioningModel, Vocabulary

from huggingface_hub import hf_hub_download



DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_name = "vinai/bartpho-syllable"



# Load vocab & model

vocab = Vocabulary(model_name=model_name)

model = ImageCaptioningModel(embed_size=768, bartpho_model_name=model_name,

                     train_CNN=False, freeze_bartpho=False).to(DEVICE)



# Download checkpoint từ Hugging Face

ckpt_path = hf_hub_download(repo_id="username/vietnamese-image-captioning",

                    filename="best_image_captioning_model_vietnamese.pth.tar")

model.load_state_dict(torch.load(ckpt_path, map_location=DEVICE)["state_dict"])

model.eval()



# Transform ảnh

tfm = transforms.Compose([

    transforms.Resize((256, 256)),

    transforms.CenterCrop((224, 224)),

    transforms.ToTensor(),

    transforms.Normalize(mean=[0.485, 0.456, 0.406],

                 std=[0.229, 0.224, 0.225]),

])



img = Image.open("your_image.jpg").convert("RGB")

img = tfm(img).to(DEVICE)



with torch.no_grad():

    caption = model.predict(img, vocab, max_length=50)



print("Caption:", caption)

```

## 📜 Giấy phép

- **Model**: Tuân theo giấy phép của BARTPho và EfficientNet
- **Dataset**: UIT-ViIC (chỉ sử dụng cho nghiên cứu & học tập)

## 👤 Tác giả

**Nguyễn Thành Đạt**