---
language: vi
license: mit
tags:
- image-captioning
- vision
- nlp
- multimodal
datasets:
- uit-viic
metrics:
- bleu
- rouge
- meteor
- cider
pipeline_tag: image-to-text
---
# 🇻🇳 Vietnamese Image Captioning Model

> **EfficientNet-B0 × BARTPho** | *Trained on UIT-ViIC dataset*

## 📌 Giới thiệu

Mô hình Sinh chú thích ảnh tiếng Việt (Vietnamese Image Captioning) huấn luyện trên bộ dữ liệu UIT-ViIC, cho phép tạo mô tả ảnh tự nhiên và chính xác bằng tiếng Việt.

### Ứng dụng:

* 🔍 **Tìm kiếm ảnh** theo ngôn ngữ tự nhiên
* 🦯 **Hỗ trợ người khiếm thị** tiếp cận nội dung hình ảnh
* 🤖 **Tích hợp** vào hệ thống AI đa phương thức (Multimodal AI)

## 🧠 Kiến trúc mô hình

| Thành phần | Mô tả |
|------------|-------|
| **Encoder** | EfficientNet-B0 (pretrained từ NVIDIA TorchHub) → Trích xuất đặc trưng ảnh thành vector embedding |
| **Decoder** | BARTPho-Syllable → Sinh câu mô tả dựa trên đặc trưng ảnh |

### Pipeline:

```
Ảnh → EncoderCNN (EfficientNet-B0) → vector đặc trưng (embed size = 768)
    → Linear projection → encoder BARTPho
    → BARTPho decoder → sinh chú thích tiếng Việt
```

## ⚙️ Thông số huấn luyện

| Tham số | Giá trị |
|---------|---------|
| **Dataset** | UIT-ViIC (train/val/test) |
| **Loss** | CrossEntropyLoss (ignore pad tokens) |
| **Optimizer** | Adam (lr = 5e-5) |
| **Batch size** | 32 |
| **Epochs** | 30 |
| **Gradient clipping** | 1.0 |
| **Mixed Precision** | torch.cuda.amp |
| **Image augmentation** | Resize(256) → RandomCrop(224) → Normalize(Imagenet) |

## 📊 Metrics hỗ trợ

- **BLEU**
- **ROUGE-L**
- **METEOR**
- **CIDEr**
- **F1 trung bình token-level**
- **Recall trung bình token-level**

> *Điểm số cụ thể phụ thuộc vào checkpoint được tải.*

## 🚀 Cách sử dụng

```python
import torch
from PIL import Image
from torchvision import transforms
from image_caption import ImageCaptioningModel, Vocabulary
from huggingface_hub import hf_hub_download

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name = "vinai/bartpho-syllable"

# Load vocab & model
vocab = Vocabulary(model_name=model_name)
model = ImageCaptioningModel(embed_size=768, bartpho_model_name=model_name,
                     train_CNN=False, freeze_bartpho=False).to(DEVICE)

# Download checkpoint từ Hugging Face
ckpt_path = hf_hub_download(repo_id="username/vietnamese-image-captioning",
                    filename="best_image_captioning_model_vietnamese.pth.tar")
model.load_state_dict(torch.load(ckpt_path, map_location=DEVICE)["state_dict"])
model.eval()

# Transform ảnh
tfm = transforms.Compose([
    transforms.Resize((256, 256)),
    transforms.CenterCrop((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                 std=[0.229, 0.224, 0.225]),
])

img = Image.open("your_image.jpg").convert("RGB")
img = tfm(img).to(DEVICE)

with torch.no_grad():
    caption = model.predict(img, vocab, max_length=50)

print("Caption:", caption)
```

## 📜 Giấy phép

- **Model**: Tuân theo giấy phép của BARTPho và EfficientNet
- **Dataset**: UIT-ViIC (chỉ sử dụng cho nghiên cứu & học tập)

## 👤 Tác giả

**Nguyễn Thành Đạt**