zhaospei
/

Model_14

Model card Files Files and versions

Model_14 / README.md

zhaospei's picture

Update README.md

ac2fc0e verified 8 months ago

|

history blame contribute delete

2.73 kB

	# 🖼️ Mô hình sinh mô tả cho hình ảnh
	## 📝 Giới thiệu
	Mô hình BLIP (Bootstrapping Language–Image Pre‑training) sử dụng Vision Transformer (ViT) để tạo ra mô hình hiểu và mô tả hình ảnh một cách linh hoạt, bao gồm cả các tác vụ như image captioning, image–text retrieval và visual question answering.
	Phiên bản base được fine‑tune trên tập dữ liệu COCO cho nhiệm vụ generate caption, hỗ trợ cả hai chế độ:
	- Conditional: cung cấp prompt văn bản để điều hướng kết quả.
	- Unconditional: tự động mô tả theo ngữ cảnh hình ảnh.

	## 📌 Tính năng
	Sinh caption chất lượng cao theo ngữ cảnh hoặc không cần prompt.
	Chỉ cần ViT + Q‑Former + Text decoder (BLIP 0. Similarly BLIP‑2 use LLM) — hiệu quả mà vẫn mạnh mẽ.
	Chạy trên CPU hoặc GPU, hỗ trợ chế độ half‑precision (FP16) để tối ưu tốc độ.

	## 📥 Đầu vào
	Hình ảnh: RGB

	Kích thước đầu vào: bất kỳ, vì BlipProcessor sẽ tự resize và crop center về 224×224

	Prompt (tuỳ chọn): ví dụ "a photo of" cho image-to-text có định hướng

	## 📤 Đầu ra
	Caption ở dạng chuỗi văn bản (string), đã được decode qua tokenizer

	Có thể lấy logits của từng token nếu cần

	## 🛠 Cài đặt
	```bash
	pip install torch torchvision transformers pillow
	```

	## 🧪 Ví dụ sử dụng

	```python
	import requests
	from PIL import Image
	from transformers import BlipProcessor, BlipForConditionalGeneration

	processor = BlipProcessor.from_pretrained("zhaospei/Model_14")
	model = BlipForConditionalGeneration.from_pretrained("zhaospei/Model_14")

	url = "https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg"
	raw_image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

	# -- Conditional captioning --
	inputs = processor(raw_image, "a photography of", return_tensors="pt")
	out = model.generate(**inputs)
	print(processor.decode(out[0], skip_special_tokens=True))
	# → "a photography of a woman and her dog"

	# -- Unconditional captioning --
	inputs = processor(raw_image, return_tensors="pt")
	out = model.generate(**inputs)
	print(processor.decode(out[0], skip_special_tokens=True))
	# → "a woman sitting on the beach with her dog"
	```

	## 📊 Hiệu năng & Ứng dụng
	Tăng ~2.8 điểm CIDEr cho task captioning so với các baseline trước đó.

	Mô hình cũng thể hiện khả năng zero-shot tốt trên video (inference có thể dùng chế độ freeze) .

	Ứng dụng thực tế gồm: trợ năng dành cho người khiếm thị, caption sản phẩm/E‑commerce, social media metadata, v.v.