Caplin43
/

multimodal-vision-language-mini

vision-encoder-decoder

vision-language

image-captioning

Model card Files Files and versions

Caplin43 commited on Feb 27

Commit

713b2f5

·

verified ·

1 Parent(s): a7d6c86

Create README.md

Files changed (1) hide show

README.md +53 -0

README.md ADDED Viewed

	@@ -0,0 +1,53 @@

+---
+license: mit
+language:
+- en
+pipeline_tag: image-to-text
+tags:
+- vision-language
+- multimodal
+- image-captioning
+- transformer
+---
+# 🖼️ Multimodal Vision Language Mini
+A lightweight multimodal transformer model designed to process images and text instructions to generate structured descriptions.
+---
+## 🧠 Model Details
+- Architecture: Vision Encoder + Text Decoder
+- Vision Backbone: ViT-base
+- Text Decoder: Transformer (12 layers)
+- Hidden Size: 768
+- Parameters: ~220M
+- Training Samples: 500k image-text pairs
+---
+## 📥 Input
+Image + Instruction
+Example:
+Instruction: "Describe the objects in the image."
+## 📤 Output
+"Two people sitting at a wooden table with laptops."
+---
+## 🎯 Intended Use
+- Image captioning
+- Visual question answering
+- Robotics perception modules
+---
+## ⚠️ Limitations
+- English only
+- Not optimized for high-resolution images