# TLV R&D VLMs for Image Retrieving and Visual Reasoning
## Vision-Language Retrieving Models
| Model Name | Model Type | Base Model | Training Set | Owner | Link | Freezed Parameters |
| :--------- | :--------- | :---------- | :---------- | :---- | :--- | :----------------- |
| ImiClip | CLIP | [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) | DM | [Etzion](https://huggingface.co/etzion) | [TLVLM/ImiClip](https://huggingface.co/TLVLM/ImiClip) | Vision Encoder |
| ImiClip_v2 | CLIP | [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) | DM + RSICD | [Etzion](https://huggingface.co/etzion) | [TLVLM/ImiClip_v2](https://huggingface.co/TLVLM/ImiClip_v2) | Vision Encoder |
| ImiClip_v3 | CLIP | [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) | DM + RSICD | [Etzion](https://huggingface.co/etzion) | [TLVLM/ImiClip_v3](https://huggingface.co/TLVLM/ImiClip_v3) | ❌ |
| ImiGlip | SigLIP | [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) | DM | [Etzion](https://huggingface.co/etzion) | [TLVLM/ImiGlip](https://huggingface.co/TLVLM/ImiGlip) | Vision Encoder |
| ImiGlip_V2 | SigLIP | [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) | DM + RSICD | [Etzion](https://huggingface.co/etzion) | [TLVLM/ImiGlip_V2](https://huggingface.co/TLVLM/ImiGlip_V2) | Vision Encoder |
| ImiGlip_V3 | SigLIP | [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) | DM + RSICD | [Etzion](https://huggingface.co/etzion) | [TLVLM/ImiGlip_V3](https://huggingface.co/TLVLM/ImiGlip_V3) | ❌ |
| ImiGlip2 | SigLIP2 | [google/siglip2-so400m-patch14-384](https://huggingface.co/google/siglip2-so400m-patch14-384) | DM + RSICD | [Etzion](https://huggingface.co/etzion) | [TLVLM/ImiGlip2](https://huggingface.co/TLVLM/ImiGlip2) | Both Encoders + Logits |
| ImiGlip2n | SigLIP2 | [google/siglip2-so400m-patch16-naflex](https://huggingface.co/google/siglip2-so400m-patch16-naflex) | DM + RSICD | [Etzion](https://huggingface.co/etzion) | [TLVLM/ImiGlip2n](https://huggingface.co/TLVLM/ImiGlip2n) | Both Encoders + Logits |
## Image Captioning & VQA Models
| Model Name | Model Type | Base Model | Training Set | Owner | Link | Freezed Parameters |
| :------------ | :--------- | :---------- | :---------- | :---- | :--- | :----------------- |
| ImiBlip_V1 | BLIP | [Salesforce/blip-image-captioning-base](https://huggingface.co/Salesforce/blip-image-captioning-base) | DM + RSICD (Captions Only) | [Etzion](https://huggingface.co/etzion) | [TLVLM/ImiBlip_V1](https://huggingface.co/TLVLM/ImiBlip_V1) | ❌ |
| ImiBlipVQA_V1 | BLIP | [Salesforce/blip-vqa-base](https://huggingface.co/Salesforce/blip-vqa-base) | DM (VQA Only) | [Etzion](https://huggingface.co/etzion) | [TLVLM/ImiBlipVQA_V1](https://huggingface.co/TLVLM/ImiBlipVQA_V1) | ❌ |
| ImiBlipVQA_V2 | BLIP | [Salesforce/blip-vqa-base](https://huggingface.co/Salesforce/blip-vqa-base) | DM + RSICD (VQA Only) | [Etzion](https://huggingface.co/etzion) | [TLVLM/ImiBlipVQA_V2](https://huggingface.co/TLVLM/ImiBlipVQA_V2) | ❌ |
# Runtime
| Model Type | Base Model | Time per **Single** Text | Time per **Single** Image | Time per **10,000** Texts | Time per **10,000** Images |
| :--------- | :--------- | :------------------- | :-------------------- | :------------------- | :-------------------- |
| CLIP | [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) | 0.0129 | 0.0101 | 129.4 | 100.8 |
| SigLIP (1+2) | [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) | 0.0578 | 0.0189 | 577.5 | 188.9 |
| SigLIP2n | [google/siglip2-so400m-patch16-naflex](https://huggingface.co/google/siglip2-so400m-patch16-naflex) | 0.0257 | 0.0189 | 257.0 | 188.6 |
Important notes:
- Time reported in **seconds**.
- All the calculations conduct on **NVIDIA A40 GPU**
- Avr. Text length: 633±93 Characters
- Avr. Image size: 536×536 Pixels
## Collections
Here you can find the model [Collections](https://huggingface.co/TLVLM/collections)
- CLIP based finetuned models: [TLVLM/clips](https://huggingface.co/collections/TLVLM/clips)
- SigLIP based finetuned models: [TLVLM/siglips](https://huggingface.co/collections/TLVLM/siglips)
- SigLIP **2** based finetuned models: [TLVLM/siglips2](https://huggingface.co/collections/TLVLM/siglip2s)
- BLIP based finetuned models: [TLVLM/blips](https://huggingface.co/collections/TLVLM/blips)