Update README.md
Browse files
README.md
CHANGED
|
@@ -1,12 +1,12 @@
|
|
| 1 |
# Searching Reaction GIFs with CLIP
|
| 2 |
|
| 3 |
-

|
| 4 |
|
| 5 |
Reaction GIFs are an integral part of today's communication. They convey complex emotions with many levels, in a short compact format.
|
| 6 |
|
| 7 |
If a picture is worth a thousand words then a GIF is worth more.
|
| 8 |
|
| 9 |
-
We might even say that the level of complexity and expressiveness
|
| 10 |
|
| 11 |
`Emoji < Memes/Image < GIFs`
|
| 12 |
|
|
@@ -16,13 +16,14 @@ Although we started out with the more ambitious goal of GIF/Image generation we
|
|
| 16 |
Which is needed to properly drive a generation model (like VQGAN).
|
| 17 |
Available CLIP models wouldn't be suitable to use without this finetuning as explained in the challenges below.
|
| 18 |
|
| 19 |
-
## Challenges
|
| 20 |
|
| 21 |
Classic (Image,Text) tasks like, image search, caption generation all focus on cases where the text is a description of the image.
|
| 22 |
This is mainly because large scale datasets available like COCO,WIT happen to be of that format.
|
| 23 |
-
So it is interesting to see if models can also capture some more higher level relations
|
| 24 |
-
like sentiment-> image
|
| 25 |
We can think of reaction gif/images to be sentiment like, in fact the dataset we use was also gathered for sentiment analysis.
|
|
|
|
| 26 |
|
| 27 |
# Dataset
|
| 28 |
|
|
@@ -54,7 +55,7 @@ This model is `cardiffnlp/twitter-roberta-base` further fine-tuned on emoji clas
|
|
| 54 |
|
| 55 |
Also tried `vit-base-patch32-384`, `vit-base-patch16-384` for the vision models, but results were inconclusive.
|
| 56 |
|
| 57 |
-
### Training Logs
|
| 58 |
|
| 59 |
Training logs can be found [here](https://wandb.ai/cceyda/flax-clip?workspace=user-cceyda)
|
| 60 |
It was really easy to overfit since it was a tiny dataset. Used early stopping.
|
|
@@ -66,7 +67,7 @@ Other parameters:
|
|
| 66 |
--warmup_steps="150"
|
| 67 |
```
|
| 68 |
|
| 69 |
-
# Future Potential
|
| 70 |
|
| 71 |
It is possible to generate a very large training set by scraping twitter.(Couldn't do during the event because of twitter rate limit)
|
| 72 |
|
|
@@ -78,6 +79,31 @@ I will definitely be trying out training a similar model for emoji & meme data.
|
|
| 78 |
|
| 79 |
Training CLIP is just the first step, if we have a well trained CLIP generation is within reach ๐
|
| 80 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 81 |
# TL;DR The task
|
| 82 |
|
| 83 |
Input: Some sentence (like a tweet)
|
|
@@ -86,7 +112,7 @@ Output: The most suitable reaction GIF image (Ranking)
|
|
| 86 |
Example:
|
| 87 |
- Input: I miss you
|
| 88 |
- Output: 
|
| 89 |
-
|
| 90 |
# Demo
|
| 91 |
|
| 92 |
-
https://huggingface.co/spaces/flax-community/clip-reply-demo
|
|
|
|
| 1 |
# Searching Reaction GIFs with CLIP
|
| 2 |
|
| 3 |
+

|
| 4 |
|
| 5 |
Reaction GIFs are an integral part of today's communication. They convey complex emotions with many levels, in a short compact format.
|
| 6 |
|
| 7 |
If a picture is worth a thousand words then a GIF is worth more.
|
| 8 |
|
| 9 |
+
We might even say that the level of complexity and expressiveness increases like:
|
| 10 |
|
| 11 |
`Emoji < Memes/Image < GIFs`
|
| 12 |
|
|
|
|
| 16 |
Which is needed to properly drive a generation model (like VQGAN).
|
| 17 |
Available CLIP models wouldn't be suitable to use without this finetuning as explained in the challenges below.
|
| 18 |
|
| 19 |
+
## ๐ Challenges
|
| 20 |
|
| 21 |
Classic (Image,Text) tasks like, image search, caption generation all focus on cases where the text is a description of the image.
|
| 22 |
This is mainly because large scale datasets available like COCO,WIT happen to be of that format.
|
| 23 |
+
So it is interesting to see if models can also capture some more higher level relations.
|
| 24 |
+
like sentiment-> image mapping, where there is great variation on both sides.
|
| 25 |
We can think of reaction gif/images to be sentiment like, in fact the dataset we use was also gathered for sentiment analysis.
|
| 26 |
+
There is no one correct reaction GIF, which also makes evaluation challenging.
|
| 27 |
|
| 28 |
# Dataset
|
| 29 |
|
|
|
|
| 55 |
|
| 56 |
Also tried `vit-base-patch32-384`, `vit-base-patch16-384` for the vision models, but results were inconclusive.
|
| 57 |
|
| 58 |
+
### ๐ Training Logs
|
| 59 |
|
| 60 |
Training logs can be found [here](https://wandb.ai/cceyda/flax-clip?workspace=user-cceyda)
|
| 61 |
It was really easy to overfit since it was a tiny dataset. Used early stopping.
|
|
|
|
| 67 |
--warmup_steps="150"
|
| 68 |
```
|
| 69 |
|
| 70 |
+
# ๐ก Future Potential
|
| 71 |
|
| 72 |
It is possible to generate a very large training set by scraping twitter.(Couldn't do during the event because of twitter rate limit)
|
| 73 |
|
|
|
|
| 79 |
|
| 80 |
Training CLIP is just the first step, if we have a well trained CLIP generation is within reach ๐
|
| 81 |
|
| 82 |
+
# How to use
|
| 83 |
+
|
| 84 |
+
```py
|
| 85 |
+
from model import FlaxHybridCLIP # see demo
|
| 86 |
+
from transformers import AutoTokenizer, CLIPProcessor
|
| 87 |
+
|
| 88 |
+
model = FlaxHybridCLIP.from_pretrained("ceyda/clip-reply")
|
| 89 |
+
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
|
| 90 |
+
processor.tokenizer = AutoTokenizer.from_pretrained("cardiffnlp/twitter-roberta-base")
|
| 91 |
+
|
| 92 |
+
def query(image_paths,query_text):
|
| 93 |
+
images = [Image.open(im).convert("RGB") for im in image_paths]
|
| 94 |
+
inputs = processor(text=[query_text], images=images, return_tensors="jax", padding=True)
|
| 95 |
+
inputs["pixel_values"] = jnp.transpose(inputs["pixel_values"], axes=[0, 2, 3, 1])
|
| 96 |
+
outputs = model(**inputs)
|
| 97 |
+
logits_per_image = outputs.logits_per_image.reshape(-1)
|
| 98 |
+
probs = jax.nn.softmax(logits_per_image)
|
| 99 |
+
```
|
| 100 |
+
|
| 101 |
+
# Created By
|
| 102 |
+
|
| 103 |
+
Ceyda Cinarel [@ceyda](https://huggingface.co/ceyda)
|
| 104 |
+
|
| 105 |
+
Made during the flax community [event](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104/58)
|
| 106 |
+
|
| 107 |
# TL;DR The task
|
| 108 |
|
| 109 |
Input: Some sentence (like a tweet)
|
|
|
|
| 112 |
Example:
|
| 113 |
- Input: I miss you
|
| 114 |
- Output: 
|
| 115 |
+
|
| 116 |
# Demo
|
| 117 |
|
| 118 |
+
https://huggingface.co/spaces/flax-community/clip-reply-demo
|