What's in here?

This directory contains code that replicates the experiments we ran to compute correlation with human judgments in the Flickr8K corpus. This setup has been used in prior work, but there are a number of specific settings one needs to use to replicate the original results from the SPICE paper, who are the first to run in this setup. More details are available in appendix A of:

CLIPScore: A Reference-free Evaluation Metric for Image Captioning by Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, Yejin Choi https://arxiv.org/abs/2104.08718

How do I run the code?

There are two steps:

run download.py which downloads and preprocesses the Flickr8K corpus.
run compute_metrics.py which will compute the appropriate evaluation metrics and report correlations with human judgment