πΌοΈ Multimodal Vision Language Mini
A lightweight multimodal transformer model designed to process images and text instructions to generate structured descriptions.
π§ Model Details
- Architecture: Vision Encoder + Text Decoder
- Vision Backbone: ViT-base
- Text Decoder: Transformer (12 layers)
- Hidden Size: 768
- Parameters: ~220M
- Training Samples: 500k image-text pairs
π₯ Input
Image + Instruction
Example:
Instruction: "Describe the objects in the image."
π€ Output
"Two people sitting at a wooden table with laptops."
π― Intended Use
- Image captioning
- Visual question answering
- Robotics perception modules
β οΈ Limitations
- English only
- Not optimized for high-resolution images
- Downloads last month
- 11