# TLV R&D VLMs for Image Retrieving and Visual Reasoning


## Vision-Language Retrieving Models
<details open>
<summary></summary>
  
| Model Name | Model Type |  Base Model | Training Set | Owner | Link | Freezed Parameters |
| :--------- | :--------- | :---------- | :---------- | :---- | :--- | :----------------- |
| ImiClip    | CLIP    | [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32)                 | DM         | [Etzion](https://huggingface.co/etzion) | [TLVLM/ImiClip](https://huggingface.co/TLVLM/ImiClip)       | Vision Encoder | 
| ImiClip_v2 | CLIP    | [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32)                 | DM + RSICD | [Etzion](https://huggingface.co/etzion) | [TLVLM/ImiClip_v2](https://huggingface.co/TLVLM/ImiClip_v2) | Vision Encoder |
| ImiClip_v3 | CLIP    | [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32)                 | DM + RSICD | [Etzion](https://huggingface.co/etzion) | [TLVLM/ImiClip_v3](https://huggingface.co/TLVLM/ImiClip_v3) | &#10060;       |
| ImiGlip    | SigLIP  | [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384)         | DM         | [Etzion](https://huggingface.co/etzion) | [TLVLM/ImiGlip](https://huggingface.co/TLVLM/ImiGlip)       | Vision Encoder |
| ImiGlip_V2 | SigLIP  | [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384)         | DM + RSICD | [Etzion](https://huggingface.co/etzion) | [TLVLM/ImiGlip_V2](https://huggingface.co/TLVLM/ImiGlip_V2) | Vision Encoder |
| ImiGlip_V3 | SigLIP  | [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384)         | DM + RSICD | [Etzion](https://huggingface.co/etzion) | [TLVLM/ImiGlip_V3](https://huggingface.co/TLVLM/ImiGlip_V3) | &#10060;       |
| ImiGlip2   | SigLIP2 | [google/siglip2-so400m-patch14-384](https://huggingface.co/google/siglip2-so400m-patch14-384)       | DM + RSICD | [Etzion](https://huggingface.co/etzion) | [TLVLM/ImiGlip2](https://huggingface.co/TLVLM/ImiGlip2)     | Both Encoders + Logits |
| ImiGlip2n  | SigLIP2 | [google/siglip2-so400m-patch16-naflex](https://huggingface.co/google/siglip2-so400m-patch16-naflex) | DM + RSICD | [Etzion](https://huggingface.co/etzion) | [TLVLM/ImiGlip2n](https://huggingface.co/TLVLM/ImiGlip2n)   | Both Encoders + Logits |

</details>

## Image Captioning & VQA Models
<details open>
<summary></summary>

| Model Name    | Model Type |  Base Model | Training Set | Owner | Link | Freezed Parameters |
| :------------ | :--------- | :---------- | :---------- | :---- | :--- | :----------------- |
| ImiBlip_V1    | BLIP       | [Salesforce/blip-image-captioning-base](https://huggingface.co/Salesforce/blip-image-captioning-base)                 | DM + RSICD (Captions Only) | [Etzion](https://huggingface.co/etzion) | [TLVLM/ImiBlip_V1](https://huggingface.co/TLVLM/ImiBlip_V1) | &#10060;       |
| ImiBlipVQA_V1 | BLIP       | [Salesforce/blip-vqa-base](https://huggingface.co/Salesforce/blip-vqa-base)                 | DM (VQA Only) | [Etzion](https://huggingface.co/etzion) | [TLVLM/ImiBlipVQA_V1](https://huggingface.co/TLVLM/ImiBlipVQA_V1) | &#10060;       |
| ImiBlipVQA_V2 | BLIP       | [Salesforce/blip-vqa-base](https://huggingface.co/Salesforce/blip-vqa-base)                 | DM + RSICD (VQA Only) | [Etzion](https://huggingface.co/etzion) | [TLVLM/ImiBlipVQA_V2](https://huggingface.co/TLVLM/ImiBlipVQA_V2) | &#10060;       |


</details>

# Runtime
<details open>
<summary></summary>
  
| Model Type | Base Model | Time per **Single** Text | Time per **Single** Image | Time per **10,000** Texts | Time per **10,000** Images |
| :--------- | :--------- | :------------------- | :-------------------- | :------------------- | :-------------------- |
| CLIP       | [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) | 0.0129 | 0.0101 | 129.4 | 100.8 |
| SigLIP (1+2) | [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) | 0.0578 | 0.0189 | 577.5 | 188.9 |
| SigLIP2n   | [google/siglip2-so400m-patch16-naflex](https://huggingface.co/google/siglip2-so400m-patch16-naflex) | 0.0257 | 0.0189 | 257.0 | 188.6 |

Important notes:
- Time reported in **seconds**.
- All the calculations conduct on **NVIDIA A40 GPU**
- Avr. Text length: 633±93 Characters
- Avr. Image size: 536×536 Pixels

</details>

## Collections
<details open>
<summary></summary>
  
Here you can find the model [Collections](https://huggingface.co/TLVLM/collections)
- CLIP based finetuned models: [TLVLM/clips](https://huggingface.co/collections/TLVLM/clips)
- SigLIP based finetuned models: [TLVLM/siglips](https://huggingface.co/collections/TLVLM/siglips)
- SigLIP **2** based finetuned models: [TLVLM/siglips2](https://huggingface.co/collections/TLVLM/siglip2s)
- BLIP based finetuned models: [TLVLM/blips](https://huggingface.co/collections/TLVLM/blips)

</details>