Update README.md
Browse files
README.md
CHANGED
|
@@ -13,6 +13,7 @@ Authors: [Rui Xiao](https://www.eml-munich.de/people/rui-xiao), [Sanghwan Kim](h
|
|
| 13 |
FLAIR was introduced in the paper [FLAIR: VLM with Fine-grained Language-informed Image Representations](https://arxiv.org/abs/2412.03561). Based on ViT-B-16 Model from [OpenCLIP](https://github.com/mlfoundations/open_clip), FLAIR features text-conditioned attention pooling at the end of its vision transformer. Pre-trained on MLLM-recaptioned datasets from [DreamLIP](https://huggingface.co/datasets/qidouxiong619/dreamlip_long_captions), FALIR achieves strong performance in tasks such as zero-shot image-text retrieval and zero-shot segmentation.
|
| 14 |
|
| 15 |
**Usage**
|
|
|
|
| 16 |
We offer the detailed usage in our [Github repo](https://github.com/ExplainableML/flair). Example Usage:
|
| 17 |
|
| 18 |
```python
|
|
@@ -46,8 +47,67 @@ with torch.no_grad(), torch.cuda.amp.autocast():
|
|
| 46 |
print("logits get using clip's way:", clip_logits) # [12.4609, 15.6797, -3.8535, -0.2281]
|
| 47 |
```
|
| 48 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
|
|
|
|
| 50 |
|
|
|
|
|
|
|
| 51 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
|
| 53 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
FLAIR was introduced in the paper [FLAIR: VLM with Fine-grained Language-informed Image Representations](https://arxiv.org/abs/2412.03561). Based on ViT-B-16 Model from [OpenCLIP](https://github.com/mlfoundations/open_clip), FLAIR features text-conditioned attention pooling at the end of its vision transformer. Pre-trained on MLLM-recaptioned datasets from [DreamLIP](https://huggingface.co/datasets/qidouxiong619/dreamlip_long_captions), FALIR achieves strong performance in tasks such as zero-shot image-text retrieval and zero-shot segmentation.
|
| 14 |
|
| 15 |
**Usage**
|
| 16 |
+
|
| 17 |
We offer the detailed usage in our [Github repo](https://github.com/ExplainableML/flair). Example Usage:
|
| 18 |
|
| 19 |
```python
|
|
|
|
| 47 |
print("logits get using clip's way:", clip_logits) # [12.4609, 15.6797, -3.8535, -0.2281]
|
| 48 |
```
|
| 49 |
|
| 50 |
+
As the primary method for FLAIR to generate logits, FLAIR utilizes the text-conditioned attention pooling to pool the local image tokens, generating language-informed image representations. The logits are generated by multiplying with the text features:
|
| 51 |
+
|
| 52 |
+
```python
|
| 53 |
+
def get_logits(self, image, text):
|
| 54 |
+
"""
|
| 55 |
+
FLAIR's way ot get the logits. Only used as a minimal example to get the logits, not used in training or inference at this stage
|
| 56 |
+
"""
|
| 57 |
+
global_image_token, local_image_tokens = self.encode_image(image)
|
| 58 |
+
global_text_token, _ = self.encode_text(text)
|
| 59 |
+
global_text_token = self.text_post(global_text_token) # (B*K, D)
|
| 60 |
+
global_image_token, local_image_tokens = self.image_post(global_image_token), self.image_post(
|
| 61 |
+
local_image_tokens) # (B, D), (B, L, D)
|
| 62 |
+
batch_size = global_image_token.shape[0]
|
| 63 |
+
|
| 64 |
+
# Broadcast the global text token to (B, B*K, D), this is too costly in large-scale training, so we downsample them to (B, B+K-1, D) in training
|
| 65 |
+
global_text_token = global_text_token.unsqueeze(0).expand(batch_size, -1, -1)
|
| 66 |
+
|
| 67 |
+
local_image_features = self.visual_proj(global_text_token, local_image_tokens, local_image_tokens) # (B, B*K, D)
|
| 68 |
|
| 69 |
+
text_features, image_features = F.normalize(global_text_token, dim=-1), F.normalize(local_image_features, dim=-1)
|
| 70 |
|
| 71 |
+
image_logits = self.logit_scale.exp() * torch.einsum('bij,bij->bi', image_features, text_features) # (B, B*K)
|
| 72 |
+
image_logits += self.logit_bias
|
| 73 |
|
| 74 |
+
text_logits = image_logits.T
|
| 75 |
+
|
| 76 |
+
return image_logits, text_logits
|
| 77 |
+
```
|
| 78 |
+
|
| 79 |
+
Thanks to teh global loss, FLAIR also enforces the matching between global-level image and text features. Therefore, just like the originally CLIP does, FLAIR could also produce logits only considering global image and text features.
|
| 80 |
+
|
| 81 |
+
```python
|
| 82 |
+
def get_logits_as_clip(self, image, text):
|
| 83 |
+
"""
|
| 84 |
+
FLAIR could also generate the global-to-global logits as the original CLIP does
|
| 85 |
+
"""
|
| 86 |
+
global_image_token, _ = self.encode_image(image)
|
| 87 |
+
global_text_token, _ = self.encode_text(text)
|
| 88 |
|
| 89 |
|
| 90 |
+
global_image_token = self.image_post(global_image_token) # (B, D)
|
| 91 |
+
global_text_token = self.text_post(global_text_token) # (B*K, D)
|
| 92 |
+
|
| 93 |
+
image_features, text_features = F.normalize(global_image_token, dim=-1), F.normalize(global_text_token, dim=-1)
|
| 94 |
+
|
| 95 |
+
image_logits = self.logit_scale.exp() * image_features @ text_features.t()
|
| 96 |
+
text_logits = image_logits.T
|
| 97 |
+
|
| 98 |
+
return image_logits, text_logits
|
| 99 |
+
```
|
| 100 |
+
|
| 101 |
+
**Citation**
|
| 102 |
+
|
| 103 |
+
If you find our work useful, please consider citing:
|
| 104 |
+
|
| 105 |
+
```bibtex
|
| 106 |
+
@article{xiao2024flair,
|
| 107 |
+
title={FLAIR: VLM with Fine-grained Language-informed Image Representations},
|
| 108 |
+
author={Xiao, Rui and Kim, Sanghwan and Georgescu, Mariana-Iuliana and Akata, Zeynep and Alaniz, Stephan},
|
| 109 |
+
journal={arXiv preprint arXiv:2412.03561},
|
| 110 |
+
year={2024}
|
| 111 |
+
}
|
| 112 |
+
```
|
| 113 |
+
|