TLV R&D VLMs for Image Retrieving and Visual Reasoning
Vision-Language Retrieving Models
| Model Name | Model Type | Base Model | Training Set | Owner | Link | Freezed Parameters |
|---|---|---|---|---|---|---|
| ImiClip | CLIP | openai/clip-vit-base-patch32 | DM | Etzion | TLVLM/ImiClip | Vision Encoder |
| ImiClip_v2 | CLIP | openai/clip-vit-base-patch32 | DM + RSICD | Etzion | TLVLM/ImiClip_v2 | Vision Encoder |
| ImiClip_v3 | CLIP | openai/clip-vit-base-patch32 | DM + RSICD | Etzion | TLVLM/ImiClip_v3 | ❌ |
| ImiGlip | SigLIP | google/siglip-so400m-patch14-384 | DM | Etzion | TLVLM/ImiGlip | Vision Encoder |
| ImiGlip_V2 | SigLIP | google/siglip-so400m-patch14-384 | DM + RSICD | Etzion | TLVLM/ImiGlip_V2 | Vision Encoder |
| ImiGlip_V3 | SigLIP | google/siglip-so400m-patch14-384 | DM + RSICD | Etzion | TLVLM/ImiGlip_V3 | ❌ |
| ImiGlip2 | SigLIP2 | google/siglip2-so400m-patch14-384 | DM + RSICD | Etzion | TLVLM/ImiGlip2 | Both Encoders + Logits |
| ImiGlip2n | SigLIP2 | google/siglip2-so400m-patch16-naflex | DM + RSICD | Etzion | TLVLM/ImiGlip2n | Both Encoders + Logits |
Image Captioning & VQA Models
| Model Name | Model Type | Base Model | Training Set | Owner | Link | Freezed Parameters |
|---|---|---|---|---|---|---|
| ImiBlip_V1 | BLIP | Salesforce/blip-image-captioning-base | DM + RSICD (Captions Only) | Etzion | TLVLM/ImiBlip_V1 | ❌ |
| ImiBlipVQA_V1 | BLIP | Salesforce/blip-vqa-base | DM (VQA Only) | Etzion | TLVLM/ImiBlipVQA_V1 | ❌ |
| ImiBlipVQA_V2 | BLIP | Salesforce/blip-vqa-base | DM + RSICD (VQA Only) | Etzion | TLVLM/ImiBlipVQA_V2 | ❌ |
Runtime
| Model Type | Base Model | Time per Single Text | Time per Single Image | Time per 10,000 Texts | Time per 10,000 Images |
|---|---|---|---|---|---|
| CLIP | openai/clip-vit-base-patch32 | 0.0129 | 0.0101 | 129.4 | 100.8 |
| SigLIP (1+2) | google/siglip-so400m-patch14-384 | 0.0578 | 0.0189 | 577.5 | 188.9 |
| SigLIP2n | google/siglip2-so400m-patch16-naflex | 0.0257 | 0.0189 | 257.0 | 188.6 |
Important notes:
- Time reported in seconds.
- All the calculations conduct on NVIDIA A40 GPU
- Avr. Text length: 633±93 Characters
- Avr. Image size: 536×536 Pixels
Collections
Here you can find the model Collections
- CLIP based finetuned models: TLVLM/clips
- SigLIP based finetuned models: TLVLM/siglips
- SigLIP 2 based finetuned models: TLVLM/siglips2
- BLIP based finetuned models: TLVLM/blips