VideoScore

Visual Question Answering

Transformers

text-generation-inference

Model card Files Files and versions

xet

Community

hexuan21 commited on Jun 20, 2024

Commit

7ef7cf6

verified ·

1 Parent(s): 0c1ac5e

Update README.md

Browse files

Files changed (1) hide show

README.md +20 -18

README.md CHANGED Viewed

@@ -28,30 +28,32 @@ a large video evaluation dataset with multi-aspect human scores.
 - MantisScore also beat the best baselines on other three benchmarks EvalCrafter, GenAI-Bench and VBench, showing high alignment with human evaluations.
 ## Performance
-### Evaluation Results on 4 benchmarks.
 We test our video evaluation model MantisScore on VideoEval-test, EvalCrafter, GenAI-Bench and VBench.
 For the first two benchmarks, we take Spearman corrleation between model's output and human ratings
 averaged among all the evaluation aspects as indicator.
 For GenAI-Bench and VBench, which include human preference data among two or more videos,
 we employ the model's output to predict preferences and use pairwise accuracy as the performance indicator.
-| metric           | Final Sum Score | VideoEval-test | EvalCrafter | GenAI-Bench | VBench |
-|------------------|----------------:|---------------:|------------:|-------------|--------|
-| MantisScore      |                 |                |             |             |        |
-| Gemini-1.5-Pro   |           158.8 |           22.1 |        22.9 |        60.9 |   52.9 |
-| Gemini-1.5-Flash |           157.5 |           20.8 |        17.3 |        67.1 |   52.3 |
-| GPT-4o           |           155.4 |           23.1 |        28.7 |        52.0 |   51.7 |
-| CLIP-sim         |           126.8 |            8.9 |        36.2 |        34.2 |   47.4 |
-| DINO-sim         |           121.3 |            7.5 |        32.1 |        38.5 |   43.3 |
-| SSIM-sim         |           118.0 |           13.4 |        26.9 |        34.1 |   43.5 |
-| CLIP-Score       |           114.4 |           -7.2 |        21.7 |        45.0 |   54.9 |
-| LLaVA-1.5-7B     |           108.3 |            8.5 |        10.5 |        49.9 |   39.4 |
-| LLaVA-1.6-7B     |            93.3 |           -3.1 |        13.2 |        44.5 |   38.7 |
-| X-CLIP-Score     |            92.9 |           -1.9 |        13.3 |        41.4 |   40.1 |
-| PIQE             |            78.3 |          -10.1 |        -1.2 |        34.5 |   55.1 |
-| BRISQUE          |            75.9 |          -20.3 |         3.9 |        38.5 |   53.7 |
-| SSIM-dyn         |            42.5 |           -5.5 |       -17.0 |        28.4 |   36.5 |
-| MES-dyn          |            36.7 |          -12.9 |       -26.4 |        31.4 |   44.5 |
 ## Usage

 - MantisScore also beat the best baselines on other three benchmarks EvalCrafter, GenAI-Bench and VBench, showing high alignment with human evaluations.
 ## Performance
+### Evaluation Results
 We test our video evaluation model MantisScore on VideoEval-test, EvalCrafter, GenAI-Bench and VBench.
 For the first two benchmarks, we take Spearman corrleation between model's output and human ratings
 averaged among all the evaluation aspects as indicator.
 For GenAI-Bench and VBench, which include human preference data among two or more videos,
 we employ the model's output to predict preferences and use pairwise accuracy as the performance indicator.
+| metric            | Final Sum Score | VideoEval-test | EvalCrafter | GenAI-Bench | VBench |
+|-------------------|----------------:|---------------:|------------:|-------------|--------|
+| MantisScore (reg) |           278.3 |           75.7 |        51.1 |        78.5 |   73.0 |
+| MantisScore (gen) |           222.4 |           77.1 |        27.6 |        59.0 |   58.7 |
+| Gemini-1.5-Pro    |           158.8 |           22.1 |        22.9 |        60.9 |   52.9 |
+| Gemini-1.5-Flash  |           157.5 |           20.8 |        17.3 |        67.1 |   52.3 |
+| GPT-4o            |           155.4 |           23.1 |        28.7 |        52.0 |   51.7 |
+| CLIP-sim          |           126.8 |            8.9 |        36.2 |        34.2 |   47.4 |
+| DINO-sim          |           121.3 |            7.5 |        32.1 |        38.5 |   43.3 |
+| SSIM-sim          |           118.0 |           13.4 |        26.9 |        34.1 |   43.5 |
+| CLIP-Score        |           114.4 |           -7.2 |        21.7 |        45.0 |   54.9 |
+| LLaVA-1.5-7B      |           108.3 |            8.5 |        10.5 |        49.9 |   39.4 |
+| LLaVA-1.6-7B      |            93.3 |           -3.1 |        13.2 |        44.5 |   38.7 |
+| X-CLIP-Score      |            92.9 |           -1.9 |        13.3 |        41.4 |   40.1 |
+| PIQE              |            78.3 |          -10.1 |        -1.2 |        34.5 |   55.1 |
+| BRISQUE           |            75.9 |          -20.3 |         3.9 |        38.5 |   53.7 |
+| Idefics2          |            73.0 |            6.5 |         0.3 |        34.6 |   31.7 |
+| SSIM-dyn          |            42.5 |           -5.5 |       -17.0 |        28.4 |   36.5 |
+| MES-dyn           |            36.7 |          -12.9 |       -26.4 |        31.4 |   44.5 |
 ## Usage