RinNguyen103
/

Vietnamese-Image-Captioning

image-captioning

Model card Files Files and versions

Vietnamese-Image-Captioning / README.md

RinNguyen103's picture

Upload folder using huggingface_hub

26f240d verified 7 months ago

|

history blame contribute delete

3.51 kB

	---
	language: vi
	license: mit
	tags:
	- image-captioning
	- vision
	- nlp
	- multimodal
	datasets:
	- uit-viic
	metrics:
	- bleu
	- rouge
	- meteor
	- cider
	pipeline_tag: image-to-text
	---
	# 🇻🇳 Vietnamese Image Captioning Model

	> EfficientNet-B0 × BARTPho \| Trained on UIT-ViIC dataset

	## 📌 Giới thiệu

	Mô hình Sinh chú thích ảnh tiếng Việt (Vietnamese Image Captioning) huấn luyện trên bộ dữ liệu UIT-ViIC, cho phép tạo mô tả ảnh tự nhiên và chính xác bằng tiếng Việt.

	### Ứng dụng:

	* 🔍 Tìm kiếm ảnh theo ngôn ngữ tự nhiên
	* 🦯 Hỗ trợ người khiếm thị tiếp cận nội dung hình ảnh
	* 🤖 Tích hợp vào hệ thống AI đa phương thức (Multimodal AI)

	## 🧠 Kiến trúc mô hình

	\| Thành phần \| Mô tả \|
	\|------------\|-------\|
	\| Encoder \| EfficientNet-B0 (pretrained từ NVIDIA TorchHub) → Trích xuất đặc trưng ảnh thành vector embedding \|
	\| Decoder \| BARTPho-Syllable → Sinh câu mô tả dựa trên đặc trưng ảnh \|

	### Pipeline:

	```
	Ảnh → EncoderCNN (EfficientNet-B0) → vector đặc trưng (embed size = 768)
	→ Linear projection → encoder BARTPho
	→ BARTPho decoder → sinh chú thích tiếng Việt
	```

	## ⚙️ Thông số huấn luyện

	\| Tham số \| Giá trị \|
	\|---------\|---------\|
	\| Dataset \| UIT-ViIC (train/val/test) \|
	\| Loss \| CrossEntropyLoss (ignore pad tokens) \|
	\| Optimizer \| Adam (lr = 5e-5) \|
	\| Batch size \| 32 \|
	\| Epochs \| 30 \|
	\| Gradient clipping \| 1.0 \|
	\| Mixed Precision \| torch.cuda.amp \|
	\| Image augmentation \| Resize(256) → RandomCrop(224) → Normalize(Imagenet) \|

	## 📊 Metrics hỗ trợ

	- BLEU
	- ROUGE-L
	- METEOR
	- CIDEr
	- F1 trung bình token-level
	- Recall trung bình token-level

	> Điểm số cụ thể phụ thuộc vào checkpoint được tải.

	## 🚀 Cách sử dụng

	```python
	import torch
	from PIL import Image
	from torchvision import transforms
	from image_caption import ImageCaptioningModel, Vocabulary
	from huggingface_hub import hf_hub_download

	DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model_name = "vinai/bartpho-syllable"

	# Load vocab & model
	vocab = Vocabulary(model_name=model_name)
	model = ImageCaptioningModel(embed_size=768, bartpho_model_name=model_name,
	train_CNN=False, freeze_bartpho=False).to(DEVICE)

	# Download checkpoint từ Hugging Face
	ckpt_path = hf_hub_download(repo_id="username/vietnamese-image-captioning",
	filename="best_image_captioning_model_vietnamese.pth.tar")
	model.load_state_dict(torch.load(ckpt_path, map_location=DEVICE)["state_dict"])
	model.eval()

	# Transform ảnh
	tfm = transforms.Compose([
	transforms.Resize((256, 256)),
	transforms.CenterCrop((224, 224)),
	transforms.ToTensor(),
	transforms.Normalize(mean=[0.485, 0.456, 0.406],
	std=[0.229, 0.224, 0.225]),
	])

	img = Image.open("your_image.jpg").convert("RGB")
	img = tfm(img).to(DEVICE)

	with torch.no_grad():
	caption = model.predict(img, vocab, max_length=50)

	print("Caption:", caption)
	```

	## 📜 Giấy phép

	- Model: Tuân theo giấy phép của BARTPho và EfficientNet
	- Dataset: UIT-ViIC (chỉ sử dụng cho nghiên cứu & học tập)

	## 👤 Tác giả

	Nguyễn Thành Đạt