sunhill commited on
Commit
6f3e563
·
1 Parent(s): 8ef9909

compete CLIP score calculation

Browse files
Files changed (4) hide show
  1. README.md +42 -16
  2. app.py +9 -5
  3. clip_score.py +3 -5
  4. tests.py +63 -11
README.md CHANGED
@@ -12,37 +12,63 @@ pinned: false
12
 
13
  # Metric Card for CLIP Score
14
 
15
- ***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
16
 
17
  ## Metric Description
18
- *Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
 
19
 
20
  ## How to Use
21
- *Give general statement of how to use the metric*
22
 
23
- *Provide simplest possible example for using the metric*
24
 
25
  ### Inputs
26
- *List all input arguments in the format below*
27
- - **input_field** *(type): Definition of input, with explanation if necessary. State any default value(s).*
28
 
29
- ### Output Values
 
30
 
31
- *Explain what this metric outputs and provide an example of what the metric output looks like. Modules should return a dictionary with one or multiple key-value pairs, e.g. {"bleu" : 6.02}*
32
 
33
- *State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*
34
 
35
- #### Values from Popular Papers
36
- *Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
37
 
38
  ### Examples
39
- *Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
40
 
41
- ## Limitations and Bias
42
- *Note any known limitations or biases that the metric has, with links and references if possible.*
 
 
 
 
 
 
 
 
 
43
 
44
  ## Citation
45
- *Cite the source where this metric was introduced.*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
 
47
  ## Further References
48
- *Add any useful further references.*
 
 
12
 
13
  # Metric Card for CLIP Score
14
 
15
+ ***Module Card Instructions:*** *This module calculates CLIPScore, a reference-free evaluation metric for image captioning.*
16
 
17
  ## Metric Description
18
+
19
+ CLIPScore is a reference-free evaluation metric for image captioning that measures the alignment between images and their corresponding text descriptions. It leverages the CLIP (Contrastive Language-Image Pretraining) model to compute a similarity score between the visual and textual modalities.
20
 
21
  ## How to Use
 
22
 
23
+ To use the CLIPScore metric, you need to provide a list of text predictions and a list of images. The metric will compute the CLIPScore for each pair of image and text.
24
 
25
  ### Inputs
 
 
26
 
27
+ - **predictions** *(string): A list of text predictions to score. Each prediction should be a string.*
28
+ - **references** *(PIL.Image.Image): A list of images to score against. Each image should be a PIL image.*
29
 
30
+ ### Output Values
31
 
32
+ The CLIPScore metric outputs a dictionary with a single key-value pair:
33
 
34
+ - **clip_score** *(float)*: The average CLIPScore across all provided image-text pairs. The score ranges from -1 to 1, where higher scores indicate better alignment between the image and text.
 
35
 
36
  ### Examples
 
37
 
38
+ ```python
39
+ from PIL import Image
40
+ import evaluate
41
+
42
+ metric = evaluate.load("sunhill/clip_score")
43
+ predictions = ["A cat sitting on a windowsill.", "A dog playing with a ball."]
44
+ references = [Image.open("cat.jpg"), Image.open("dog.jpg")]
45
+ results = metric.compute(predictions=predictions, references=references)
46
+ print(results)
47
+ # Output: {'clip_score': 0.85}
48
+ ```
49
 
50
  ## Citation
51
+
52
+ ```bibtex
53
+ @article{DBLP:journals/corr/abs-2104-08718,
54
+ author = {Jack Hessel and
55
+ Ari Holtzman and
56
+ Maxwell Forbes and
57
+ Ronan Le Bras and
58
+ Yejin Choi},
59
+ title = {CLIPScore: {A} Reference-free Evaluation Metric for Image Captioning},
60
+ journal = {CoRR},
61
+ volume = {abs/2104.08718},
62
+ year = {2021},
63
+ url = {https://arxiv.org/abs/2104.08718},
64
+ eprinttype = {arXiv},
65
+ eprint = {2104.08718},
66
+ timestamp = {Sat, 29 Apr 2023 10:09:27 +0200},
67
+ biburl = {https://dblp.org/rec/journals/corr/abs-2104-08718.bib},
68
+ bibsource = {dblp computer science bibliography, https://dblp.org}
69
+ }
70
+ ```
71
 
72
  ## Further References
73
+
74
+ - [clip-score](https://github.com/Taited/clip-score)
app.py CHANGED
@@ -1,12 +1,15 @@
 
 
 
1
  import evaluate
2
  import gradio as gr
 
3
 
4
-
5
- metric = evaluate.load("clip_score.py")
6
 
7
 
8
  def compute_clip_score(image, text):
9
- results = metric.compute(predictions=[text], images=[image])
10
  return results["clip_score"]
11
 
12
 
@@ -22,13 +25,14 @@ iface = gr.Interface(
22
  examples=[
23
  [
24
  "https://images.unsplash.com/photo-1720539222585-346e73f01536",
25
- "A cat sitting on a couch.",
26
  ],
27
  [
28
  "https://images.unsplash.com/photo-1694253987647-4eebcf679974",
29
- "A scenic view of mountains during sunset.",
30
  ],
31
  ],
 
32
  )
33
 
34
  iface.launch()
 
1
+ import sys
2
+ from pathlib import Path
3
+
4
  import evaluate
5
  import gradio as gr
6
+ from evaluate import parse_readme
7
 
8
+ metric = evaluate.load("sunhill/clip_score")
 
9
 
10
 
11
  def compute_clip_score(image, text):
12
+ results = metric.compute(predictions=[text], references=[image])
13
  return results["clip_score"]
14
 
15
 
 
25
  examples=[
26
  [
27
  "https://images.unsplash.com/photo-1720539222585-346e73f01536",
28
+ "A cat sitting on a couch",
29
  ],
30
  [
31
  "https://images.unsplash.com/photo-1694253987647-4eebcf679974",
32
+ "A scenic view of mountains during sunset",
33
  ],
34
  ],
35
+ article=parse_readme(Path(sys.path[0]) / "README.md"),
36
  )
37
 
38
  iface.launch()
clip_score.py CHANGED
@@ -63,7 +63,7 @@ class CLIPScore(evaluate.Metric):
63
  features=datasets.Features(
64
  {
65
  "predictions": datasets.Value("string"),
66
- "references": datasets.Value("float32"),
67
  }
68
  ),
69
  # Homepage of the module for documentation
@@ -85,14 +85,12 @@ class CLIPScore(evaluate.Metric):
85
  refer = self.processor(
86
  text=None, images=references, return_tensors="pt", padding=True
87
  )
88
- refer["pixel_values"] = refer["pixel_values"][0]
89
  pred = self.tokenizer(predictions, return_tensors="pt", padding=True)
90
- for key in pred:
91
- pred[key] = pred[key].squeeze()
92
 
93
  refer_features = self.model.get_image_features(**refer)
94
  pred_features = self.model.get_text_features(**pred)
95
 
96
  refer_features = refer_features / refer_features.norm(dim=1, keepdim=True)
97
  pred_features = pred_features / pred_features.norm(dim=1, keepdim=True)
98
- return {"clip_score": (refer_features * pred_features).sum().item()}
 
 
63
  features=datasets.Features(
64
  {
65
  "predictions": datasets.Value("string"),
66
+ "references": datasets.Image(),
67
  }
68
  ),
69
  # Homepage of the module for documentation
 
85
  refer = self.processor(
86
  text=None, images=references, return_tensors="pt", padding=True
87
  )
 
88
  pred = self.tokenizer(predictions, return_tensors="pt", padding=True)
 
 
89
 
90
  refer_features = self.model.get_image_features(**refer)
91
  pred_features = self.model.get_text_features(**pred)
92
 
93
  refer_features = refer_features / refer_features.norm(dim=1, keepdim=True)
94
  pred_features = pred_features / pred_features.norm(dim=1, keepdim=True)
95
+ clip_score = (refer_features * pred_features).sum().item()
96
+ return {"clip_score": clip_score / refer_features.shape[0]}
tests.py CHANGED
@@ -1,17 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  test_cases = [
2
  {
3
- "predictions": [0, 0],
4
- "references": [1, 1],
5
- "result": {"metric_score": 0}
6
  },
7
  {
8
- "predictions": [1, 1],
9
- "references": [1, 1],
10
- "result": {"metric_score": 1}
11
  },
12
  {
13
- "predictions": [1, 0],
14
- "references": [1, 1],
15
- "result": {"metric_score": 0.5}
16
- }
17
- ]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import requests
2
+ from PIL import Image
3
+
4
+ import evaluate
5
+
6
+
7
+ metric = evaluate.load("./clip_score.py")
8
+
9
+
10
+ def download_image(image_path):
11
+ if image_path.startswith("http"):
12
+ image = Image.open(requests.get(image_path, stream=True).raw)
13
+ else:
14
+ image = Image.open(image_path)
15
+ return image
16
+
17
+
18
+ def compute_clip_score(image, text):
19
+ if not isinstance(image, list):
20
+ references = [image]
21
+ else:
22
+ references = image
23
+ if not isinstance(text, list):
24
+ predictions = [text]
25
+ else:
26
+ predictions = text
27
+ results = metric.compute(predictions=predictions, references=references)
28
+ return results["clip_score"]
29
+
30
+
31
+ predictions = ["A cat sitting on a couch", "A scenic view of mountains during sunset"]
32
+ references = [
33
+ "https://images.unsplash.com/photo-1720539222585-346e73f01536",
34
+ "https://images.unsplash.com/photo-1694253987647-4eebcf679974",
35
+ ]
36
+ references = [download_image(url) for url in references]
37
+
38
  test_cases = [
39
  {
40
+ "predictions": predictions,
41
+ "references": references,
42
+ "result": {"clip_score": 0.307},
43
  },
44
  {
45
+ "predictions": predictions[0],
46
+ "references": references[0],
47
+ "result": {"clip_score": 0.304},
48
  },
49
  {
50
+ "predictions": predictions[1],
51
+ "references": references[1],
52
+ "result": {"clip_score": 0.310},
53
+ },
54
+ {
55
+ "predictions": predictions[0],
56
+ "references": references[1],
57
+ "result": {"clip_score": 0.106},
58
+ },
59
+ {
60
+ "predictions": predictions[1],
61
+ "references": references[0],
62
+ "result": {"clip_score": 0.134},
63
+ },
64
+ ]
65
+
66
+ for i, test_case in enumerate(test_cases):
67
+ result = compute_clip_score(test_case["references"], test_case["predictions"])
68
+ error = abs(result - test_case["result"]["clip_score"])
69
+ assert error < 0.1, f"Test case {i} failed"