Question about the evaluation metrics for captioning benchmarks

#3
by ygyjrc - opened

Hi, thanks for releasing Marlin-2B and the evaluation results. I have a question regarding the metric used in the leaderboard figures.
In the captioning plots, the y-axis is labeled as “VideoEvalV2 mean / 10” for benchmarks such as DREAM-1K and CaReBench. I noticed that the reported scores do not match the official leaderboard scores, which use the Recall/Precision/F1 metric.
Could you clarify: What exactly is “VideoEvalV2”?
I’m very interested in video caption tasks, so I’d really appreciate any clarification.

Sign up or log in to comment