README / README.md
etzion's picture
Update README.md
2d3ffdf verified
|
Raw
History Blame Contribute Delete
5.03 kB

TLV R&D VLMs for Image Retrieving and Visual Reasoning

Vision-Language Retrieving Models

Model Name Model Type Base Model Training Set Owner Link Freezed Parameters
ImiClip CLIP openai/clip-vit-base-patch32 DM Etzion TLVLM/ImiClip Vision Encoder
ImiClip_v2 CLIP openai/clip-vit-base-patch32 DM + RSICD Etzion TLVLM/ImiClip_v2 Vision Encoder
ImiClip_v3 CLIP openai/clip-vit-base-patch32 DM + RSICD Etzion TLVLM/ImiClip_v3
ImiGlip SigLIP google/siglip-so400m-patch14-384 DM Etzion TLVLM/ImiGlip Vision Encoder
ImiGlip_V2 SigLIP google/siglip-so400m-patch14-384 DM + RSICD Etzion TLVLM/ImiGlip_V2 Vision Encoder
ImiGlip_V3 SigLIP google/siglip-so400m-patch14-384 DM + RSICD Etzion TLVLM/ImiGlip_V3
ImiGlip2 SigLIP2 google/siglip2-so400m-patch14-384 DM + RSICD Etzion TLVLM/ImiGlip2 Both Encoders + Logits
ImiGlip2n SigLIP2 google/siglip2-so400m-patch16-naflex DM + RSICD Etzion TLVLM/ImiGlip2n Both Encoders + Logits

Image Captioning & VQA Models

Model Name Model Type Base Model Training Set Owner Link Freezed Parameters
ImiBlip_V1 BLIP Salesforce/blip-image-captioning-base DM + RSICD (Captions Only) Etzion TLVLM/ImiBlip_V1
ImiBlipVQA_V1 BLIP Salesforce/blip-vqa-base DM (VQA Only) Etzion TLVLM/ImiBlipVQA_V1
ImiBlipVQA_V2 BLIP Salesforce/blip-vqa-base DM + RSICD (VQA Only) Etzion TLVLM/ImiBlipVQA_V2

Runtime

Model Type Base Model Time per Single Text Time per Single Image Time per 10,000 Texts Time per 10,000 Images
CLIP openai/clip-vit-base-patch32 0.0129 0.0101 129.4 100.8
SigLIP (1+2) google/siglip-so400m-patch14-384 0.0578 0.0189 577.5 188.9
SigLIP2n google/siglip2-so400m-patch16-naflex 0.0257 0.0189 257.0 188.6

Important notes:

  • Time reported in seconds.
  • All the calculations conduct on NVIDIA A40 GPU
  • Avr. Text length: 633±93 Characters
  • Avr. Image size: 536×536 Pixels

Collections

Here you can find the model Collections