πŸ–ΌοΈ Multimodal Vision Language Mini

A lightweight multimodal transformer model designed to process images and text instructions to generate structured descriptions.


🧠 Model Details

  • Architecture: Vision Encoder + Text Decoder
  • Vision Backbone: ViT-base
  • Text Decoder: Transformer (12 layers)
  • Hidden Size: 768
  • Parameters: ~220M
  • Training Samples: 500k image-text pairs

πŸ“₯ Input

Image + Instruction
Example: Instruction: "Describe the objects in the image."

πŸ“€ Output

"Two people sitting at a wooden table with laptops."


🎯 Intended Use

  • Image captioning
  • Visual question answering
  • Robotics perception modules

⚠️ Limitations

  • English only
  • Not optimized for high-resolution images
Downloads last month
11
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support