flax-community
/

clip-reply

TensorBoard

Model card Files Files and versions

xet

Metrics Training metrics Community

ceyda commited on Jul 19, 2021

Commit

70fef6c

1 Parent(s): ba2979b

Update README.md

Browse files

Files changed (1) hide show

README.md +35 -9

README.md CHANGED Viewed

@@ -1,12 +1,12 @@
 # Searching Reaction GIFs with CLIP
-![header gif](/assets/main.gif)
 Reaction GIFs are an integral part of today's communication. They convey complex emotions with many levels, in a short compact format.
 If a picture is worth a thousand words then a GIF is worth more.
-We might even say that the level of complexity and expressiveness goes like this:
 `Emoji < Memes/Image < GIFs`
@@ -16,13 +16,14 @@ Although we started out with the more ambitious goal of GIF/Image generation we
 Which is needed to properly drive a generation model (like VQGAN).
 Available CLIP models wouldn't be suitable to use without this finetuning as explained in the challenges below.
-## Challenges
 Classic (Image,Text) tasks like, image search, caption generation all focus on cases where the text is a description of the image.
 This is mainly because large scale datasets available like COCO,WIT happen to be of that format.
-So it is interesting to see if models can also capture some more higher level relations
-like sentiment-> image. Where there is greater variation on both sides.
 We can think of reaction gif/images to be sentiment like, in fact the dataset we use was also gathered for sentiment analysis.
 # Dataset
@@ -54,7 +55,7 @@ This model is `cardiffnlp/twitter-roberta-base` further fine-tuned on emoji clas
 Also tried `vit-base-patch32-384`, `vit-base-patch16-384` for the vision models, but results were inconclusive.
-### Training Logs
 Training logs can be found [here](https://wandb.ai/cceyda/flax-clip?workspace=user-cceyda)
 It was really easy to overfit since it was a tiny dataset. Used early stopping.
@@ -66,7 +67,7 @@ Other parameters:
 --warmup_steps="150"
 ```
-# Future Potential
 It is possible to generate a very large training set by scraping twitter.(Couldn't do during the event because of twitter rate limit)
@@ -78,6 +79,31 @@ I will definitely be trying out training a similar model for emoji & meme data.
 Training CLIP is just the first step, if we have a well trained CLIP generation is within reach 🚀
 # TL;DR The task
 Input: Some sentence (like a tweet)
@@ -86,7 +112,7 @@ Output: The most suitable reaction GIF image (Ranking)
 Example:
   - Input: I miss you
   - Output: ![hug](./assets/example_gif.jpg)
 # Demo
-https://huggingface.co/spaces/flax-community/clip-reply-demo

 # Searching Reaction GIFs with CLIP
+![header gif](./assets/main.gif)
 Reaction GIFs are an integral part of today's communication. They convey complex emotions with many levels, in a short compact format.
 If a picture is worth a thousand words then a GIF is worth more.
+We might even say that the level of complexity and expressiveness increases like:
 `Emoji < Memes/Image < GIFs`
 Which is needed to properly drive a generation model (like VQGAN).
 Available CLIP models wouldn't be suitable to use without this finetuning as explained in the challenges below.
+## 📝 Challenges
 Classic (Image,Text) tasks like, image search, caption generation all focus on cases where the text is a description of the image.
 This is mainly because large scale datasets available like COCO,WIT happen to be of that format.
+So it is interesting to see if models can also capture some more higher level relations.
+like sentiment-> image mapping, where there is great variation on both sides.
 We can think of reaction gif/images to be sentiment like, in fact the dataset we use was also gathered for sentiment analysis.
+There is no one correct reaction GIF, which also makes evaluation challenging.
 # Dataset
 Also tried `vit-base-patch32-384`, `vit-base-patch16-384` for the vision models, but results were inconclusive.
+### 📈 Training Logs
 Training logs can be found [here](https://wandb.ai/cceyda/flax-clip?workspace=user-cceyda)
 It was really easy to overfit since it was a tiny dataset. Used early stopping.
 --warmup_steps="150"
 ```
+# 💡 Future Potential
 It is possible to generate a very large training set by scraping twitter.(Couldn't do during the event because of twitter rate limit)
 Training CLIP is just the first step, if we have a well trained CLIP generation is within reach 🚀
+# How to use
+```py
+from model import FlaxHybridCLIP # see demo
+from transformers import AutoTokenizer, CLIPProcessor
+model = FlaxHybridCLIP.from_pretrained("ceyda/clip-reply")
+processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
+processor.tokenizer = AutoTokenizer.from_pretrained("cardiffnlp/twitter-roberta-base")
+def query(image_paths,query_text):
+    images = [Image.open(im).convert("RGB") for im in image_paths]
+    inputs = processor(text=[query_text], images=images, return_tensors="jax", padding=True)
+    inputs["pixel_values"] = jnp.transpose(inputs["pixel_values"], axes=[0, 2, 3, 1])
+    outputs = model(**inputs)
+    logits_per_image = outputs.logits_per_image.reshape(-1)
+    probs = jax.nn.softmax(logits_per_image)
+```
+# Created By
+Ceyda Cinarel [@ceyda](https://huggingface.co/ceyda)
+Made during the flax community [event](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104/58)
 # TL;DR The task
 Input: Some sentence (like a tweet)
 Example:
   - Input: I miss you
   - Output: ![hug](./assets/example_gif.jpg)
 # Demo
+https://huggingface.co/spaces/flax-community/clip-reply-demo