Caplin43
/

multimodal-vision-language-mini

vision-encoder-decoder

vision-language

image-captioning

Model card Files Files and versions

🖼️ Multimodal Vision Language Mini

A lightweight multimodal transformer model designed to process images and text instructions to generate structured descriptions.

🧠 Model Details

Architecture: Vision Encoder + Text Decoder
Vision Backbone: ViT-base
Text Decoder: Transformer (12 layers)
Hidden Size: 768
Parameters: ~220M
Training Samples: 500k image-text pairs

📥 Input

Image + Instruction
Example: Instruction: "Describe the objects in the image."

📤 Output

"Two people sitting at a wooden table with laptops."

🎯 Intended Use

Image captioning
Visual question answering
Robotics perception modules

⚠️ Limitations

English only
Not optimized for high-resolution images

Downloads last month: 8