Spaces:
Runtime error
Runtime error
compete CLIP score calculation
Browse files
README.md
CHANGED
|
@@ -12,37 +12,63 @@ pinned: false
|
|
| 12 |
|
| 13 |
# Metric Card for CLIP Score
|
| 14 |
|
| 15 |
-
***Module Card Instructions:*** *
|
| 16 |
|
| 17 |
## Metric Description
|
| 18 |
-
|
|
|
|
| 19 |
|
| 20 |
## How to Use
|
| 21 |
-
*Give general statement of how to use the metric*
|
| 22 |
|
| 23 |
-
|
| 24 |
|
| 25 |
### Inputs
|
| 26 |
-
*List all input arguments in the format below*
|
| 27 |
-
- **input_field** *(type): Definition of input, with explanation if necessary. State any default value(s).*
|
| 28 |
|
| 29 |
-
|
|
|
|
| 30 |
|
| 31 |
-
|
| 32 |
|
| 33 |
-
|
| 34 |
|
| 35 |
-
|
| 36 |
-
*Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
|
| 37 |
|
| 38 |
### Examples
|
| 39 |
-
*Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
|
| 40 |
|
| 41 |
-
|
| 42 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
|
| 44 |
## Citation
|
| 45 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
|
| 47 |
## Further References
|
| 48 |
-
|
|
|
|
|
|
| 12 |
|
| 13 |
# Metric Card for CLIP Score
|
| 14 |
|
| 15 |
+
***Module Card Instructions:*** *This module calculates CLIPScore, a reference-free evaluation metric for image captioning.*
|
| 16 |
|
| 17 |
## Metric Description
|
| 18 |
+
|
| 19 |
+
CLIPScore is a reference-free evaluation metric for image captioning that measures the alignment between images and their corresponding text descriptions. It leverages the CLIP (Contrastive Language-Image Pretraining) model to compute a similarity score between the visual and textual modalities.
|
| 20 |
|
| 21 |
## How to Use
|
|
|
|
| 22 |
|
| 23 |
+
To use the CLIPScore metric, you need to provide a list of text predictions and a list of images. The metric will compute the CLIPScore for each pair of image and text.
|
| 24 |
|
| 25 |
### Inputs
|
|
|
|
|
|
|
| 26 |
|
| 27 |
+
- **predictions** *(string): A list of text predictions to score. Each prediction should be a string.*
|
| 28 |
+
- **references** *(PIL.Image.Image): A list of images to score against. Each image should be a PIL image.*
|
| 29 |
|
| 30 |
+
### Output Values
|
| 31 |
|
| 32 |
+
The CLIPScore metric outputs a dictionary with a single key-value pair:
|
| 33 |
|
| 34 |
+
- **clip_score** *(float)*: The average CLIPScore across all provided image-text pairs. The score ranges from -1 to 1, where higher scores indicate better alignment between the image and text.
|
|
|
|
| 35 |
|
| 36 |
### Examples
|
|
|
|
| 37 |
|
| 38 |
+
```python
|
| 39 |
+
from PIL import Image
|
| 40 |
+
import evaluate
|
| 41 |
+
|
| 42 |
+
metric = evaluate.load("sunhill/clip_score")
|
| 43 |
+
predictions = ["A cat sitting on a windowsill.", "A dog playing with a ball."]
|
| 44 |
+
references = [Image.open("cat.jpg"), Image.open("dog.jpg")]
|
| 45 |
+
results = metric.compute(predictions=predictions, references=references)
|
| 46 |
+
print(results)
|
| 47 |
+
# Output: {'clip_score': 0.85}
|
| 48 |
+
```
|
| 49 |
|
| 50 |
## Citation
|
| 51 |
+
|
| 52 |
+
```bibtex
|
| 53 |
+
@article{DBLP:journals/corr/abs-2104-08718,
|
| 54 |
+
author = {Jack Hessel and
|
| 55 |
+
Ari Holtzman and
|
| 56 |
+
Maxwell Forbes and
|
| 57 |
+
Ronan Le Bras and
|
| 58 |
+
Yejin Choi},
|
| 59 |
+
title = {CLIPScore: {A} Reference-free Evaluation Metric for Image Captioning},
|
| 60 |
+
journal = {CoRR},
|
| 61 |
+
volume = {abs/2104.08718},
|
| 62 |
+
year = {2021},
|
| 63 |
+
url = {https://arxiv.org/abs/2104.08718},
|
| 64 |
+
eprinttype = {arXiv},
|
| 65 |
+
eprint = {2104.08718},
|
| 66 |
+
timestamp = {Sat, 29 Apr 2023 10:09:27 +0200},
|
| 67 |
+
biburl = {https://dblp.org/rec/journals/corr/abs-2104-08718.bib},
|
| 68 |
+
bibsource = {dblp computer science bibliography, https://dblp.org}
|
| 69 |
+
}
|
| 70 |
+
```
|
| 71 |
|
| 72 |
## Further References
|
| 73 |
+
|
| 74 |
+
- [clip-score](https://github.com/Taited/clip-score)
|
app.py
CHANGED
|
@@ -1,12 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
| 1 |
import evaluate
|
| 2 |
import gradio as gr
|
|
|
|
| 3 |
|
| 4 |
-
|
| 5 |
-
metric = evaluate.load("clip_score.py")
|
| 6 |
|
| 7 |
|
| 8 |
def compute_clip_score(image, text):
|
| 9 |
-
results = metric.compute(predictions=[text],
|
| 10 |
return results["clip_score"]
|
| 11 |
|
| 12 |
|
|
@@ -22,13 +25,14 @@ iface = gr.Interface(
|
|
| 22 |
examples=[
|
| 23 |
[
|
| 24 |
"https://images.unsplash.com/photo-1720539222585-346e73f01536",
|
| 25 |
-
"A cat sitting on a couch
|
| 26 |
],
|
| 27 |
[
|
| 28 |
"https://images.unsplash.com/photo-1694253987647-4eebcf679974",
|
| 29 |
-
"A scenic view of mountains during sunset
|
| 30 |
],
|
| 31 |
],
|
|
|
|
| 32 |
)
|
| 33 |
|
| 34 |
iface.launch()
|
|
|
|
| 1 |
+
import sys
|
| 2 |
+
from pathlib import Path
|
| 3 |
+
|
| 4 |
import evaluate
|
| 5 |
import gradio as gr
|
| 6 |
+
from evaluate import parse_readme
|
| 7 |
|
| 8 |
+
metric = evaluate.load("sunhill/clip_score")
|
|
|
|
| 9 |
|
| 10 |
|
| 11 |
def compute_clip_score(image, text):
|
| 12 |
+
results = metric.compute(predictions=[text], references=[image])
|
| 13 |
return results["clip_score"]
|
| 14 |
|
| 15 |
|
|
|
|
| 25 |
examples=[
|
| 26 |
[
|
| 27 |
"https://images.unsplash.com/photo-1720539222585-346e73f01536",
|
| 28 |
+
"A cat sitting on a couch",
|
| 29 |
],
|
| 30 |
[
|
| 31 |
"https://images.unsplash.com/photo-1694253987647-4eebcf679974",
|
| 32 |
+
"A scenic view of mountains during sunset",
|
| 33 |
],
|
| 34 |
],
|
| 35 |
+
article=parse_readme(Path(sys.path[0]) / "README.md"),
|
| 36 |
)
|
| 37 |
|
| 38 |
iface.launch()
|
clip_score.py
CHANGED
|
@@ -63,7 +63,7 @@ class CLIPScore(evaluate.Metric):
|
|
| 63 |
features=datasets.Features(
|
| 64 |
{
|
| 65 |
"predictions": datasets.Value("string"),
|
| 66 |
-
"references": datasets.
|
| 67 |
}
|
| 68 |
),
|
| 69 |
# Homepage of the module for documentation
|
|
@@ -85,14 +85,12 @@ class CLIPScore(evaluate.Metric):
|
|
| 85 |
refer = self.processor(
|
| 86 |
text=None, images=references, return_tensors="pt", padding=True
|
| 87 |
)
|
| 88 |
-
refer["pixel_values"] = refer["pixel_values"][0]
|
| 89 |
pred = self.tokenizer(predictions, return_tensors="pt", padding=True)
|
| 90 |
-
for key in pred:
|
| 91 |
-
pred[key] = pred[key].squeeze()
|
| 92 |
|
| 93 |
refer_features = self.model.get_image_features(**refer)
|
| 94 |
pred_features = self.model.get_text_features(**pred)
|
| 95 |
|
| 96 |
refer_features = refer_features / refer_features.norm(dim=1, keepdim=True)
|
| 97 |
pred_features = pred_features / pred_features.norm(dim=1, keepdim=True)
|
| 98 |
-
|
|
|
|
|
|
| 63 |
features=datasets.Features(
|
| 64 |
{
|
| 65 |
"predictions": datasets.Value("string"),
|
| 66 |
+
"references": datasets.Image(),
|
| 67 |
}
|
| 68 |
),
|
| 69 |
# Homepage of the module for documentation
|
|
|
|
| 85 |
refer = self.processor(
|
| 86 |
text=None, images=references, return_tensors="pt", padding=True
|
| 87 |
)
|
|
|
|
| 88 |
pred = self.tokenizer(predictions, return_tensors="pt", padding=True)
|
|
|
|
|
|
|
| 89 |
|
| 90 |
refer_features = self.model.get_image_features(**refer)
|
| 91 |
pred_features = self.model.get_text_features(**pred)
|
| 92 |
|
| 93 |
refer_features = refer_features / refer_features.norm(dim=1, keepdim=True)
|
| 94 |
pred_features = pred_features / pred_features.norm(dim=1, keepdim=True)
|
| 95 |
+
clip_score = (refer_features * pred_features).sum().item()
|
| 96 |
+
return {"clip_score": clip_score / refer_features.shape[0]}
|
tests.py
CHANGED
|
@@ -1,17 +1,69 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
test_cases = [
|
| 2 |
{
|
| 3 |
-
"predictions":
|
| 4 |
-
"references":
|
| 5 |
-
"result": {"
|
| 6 |
},
|
| 7 |
{
|
| 8 |
-
"predictions": [
|
| 9 |
-
"references": [
|
| 10 |
-
"result": {"
|
| 11 |
},
|
| 12 |
{
|
| 13 |
-
"predictions": [1
|
| 14 |
-
"references": [1
|
| 15 |
-
"result": {"
|
| 16 |
-
}
|
| 17 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import requests
|
| 2 |
+
from PIL import Image
|
| 3 |
+
|
| 4 |
+
import evaluate
|
| 5 |
+
|
| 6 |
+
|
| 7 |
+
metric = evaluate.load("./clip_score.py")
|
| 8 |
+
|
| 9 |
+
|
| 10 |
+
def download_image(image_path):
|
| 11 |
+
if image_path.startswith("http"):
|
| 12 |
+
image = Image.open(requests.get(image_path, stream=True).raw)
|
| 13 |
+
else:
|
| 14 |
+
image = Image.open(image_path)
|
| 15 |
+
return image
|
| 16 |
+
|
| 17 |
+
|
| 18 |
+
def compute_clip_score(image, text):
|
| 19 |
+
if not isinstance(image, list):
|
| 20 |
+
references = [image]
|
| 21 |
+
else:
|
| 22 |
+
references = image
|
| 23 |
+
if not isinstance(text, list):
|
| 24 |
+
predictions = [text]
|
| 25 |
+
else:
|
| 26 |
+
predictions = text
|
| 27 |
+
results = metric.compute(predictions=predictions, references=references)
|
| 28 |
+
return results["clip_score"]
|
| 29 |
+
|
| 30 |
+
|
| 31 |
+
predictions = ["A cat sitting on a couch", "A scenic view of mountains during sunset"]
|
| 32 |
+
references = [
|
| 33 |
+
"https://images.unsplash.com/photo-1720539222585-346e73f01536",
|
| 34 |
+
"https://images.unsplash.com/photo-1694253987647-4eebcf679974",
|
| 35 |
+
]
|
| 36 |
+
references = [download_image(url) for url in references]
|
| 37 |
+
|
| 38 |
test_cases = [
|
| 39 |
{
|
| 40 |
+
"predictions": predictions,
|
| 41 |
+
"references": references,
|
| 42 |
+
"result": {"clip_score": 0.307},
|
| 43 |
},
|
| 44 |
{
|
| 45 |
+
"predictions": predictions[0],
|
| 46 |
+
"references": references[0],
|
| 47 |
+
"result": {"clip_score": 0.304},
|
| 48 |
},
|
| 49 |
{
|
| 50 |
+
"predictions": predictions[1],
|
| 51 |
+
"references": references[1],
|
| 52 |
+
"result": {"clip_score": 0.310},
|
| 53 |
+
},
|
| 54 |
+
{
|
| 55 |
+
"predictions": predictions[0],
|
| 56 |
+
"references": references[1],
|
| 57 |
+
"result": {"clip_score": 0.106},
|
| 58 |
+
},
|
| 59 |
+
{
|
| 60 |
+
"predictions": predictions[1],
|
| 61 |
+
"references": references[0],
|
| 62 |
+
"result": {"clip_score": 0.134},
|
| 63 |
+
},
|
| 64 |
+
]
|
| 65 |
+
|
| 66 |
+
for i, test_case in enumerate(test_cases):
|
| 67 |
+
result = compute_clip_score(test_case["references"], test_case["predictions"])
|
| 68 |
+
error = abs(result - test_case["result"]["clip_score"])
|
| 69 |
+
assert error < 0.1, f"Test case {i} failed"
|