hexuan21 commited on
Commit
7ef7cf6
·
verified ·
1 Parent(s): 0c1ac5e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -18
README.md CHANGED
@@ -28,30 +28,32 @@ a large video evaluation dataset with multi-aspect human scores.
28
  - MantisScore also beat the best baselines on other three benchmarks EvalCrafter, GenAI-Bench and VBench, showing high alignment with human evaluations.
29
 
30
  ## Performance
31
- ### Evaluation Results on 4 benchmarks.
32
 
33
  We test our video evaluation model MantisScore on VideoEval-test, EvalCrafter, GenAI-Bench and VBench.
34
  For the first two benchmarks, we take Spearman corrleation between model's output and human ratings
35
  averaged among all the evaluation aspects as indicator.
36
  For GenAI-Bench and VBench, which include human preference data among two or more videos,
37
  we employ the model's output to predict preferences and use pairwise accuracy as the performance indicator.
38
- | metric | Final Sum Score | VideoEval-test | EvalCrafter | GenAI-Bench | VBench |
39
- |------------------|----------------:|---------------:|------------:|-------------|--------|
40
- | MantisScore | | | | | |
41
- | Gemini-1.5-Pro | 158.8 | 22.1 | 22.9 | 60.9 | 52.9 |
42
- | Gemini-1.5-Flash | 157.5 | 20.8 | 17.3 | 67.1 | 52.3 |
43
- | GPT-4o | 155.4 | 23.1 | 28.7 | 52.0 | 51.7 |
44
- | CLIP-sim | 126.8 | 8.9 | 36.2 | 34.2 | 47.4 |
45
- | DINO-sim | 121.3 | 7.5 | 32.1 | 38.5 | 43.3 |
46
- | SSIM-sim | 118.0 | 13.4 | 26.9 | 34.1 | 43.5 |
47
- | CLIP-Score | 114.4 | -7.2 | 21.7 | 45.0 | 54.9 |
48
- | LLaVA-1.5-7B | 108.3 | 8.5 | 10.5 | 49.9 | 39.4 |
49
- | LLaVA-1.6-7B | 93.3 | -3.1 | 13.2 | 44.5 | 38.7 |
50
- | X-CLIP-Score | 92.9 | -1.9 | 13.3 | 41.4 | 40.1 |
51
- | PIQE | 78.3 | -10.1 | -1.2 | 34.5 | 55.1 |
52
- | BRISQUE | 75.9 | -20.3 | 3.9 | 38.5 | 53.7 |
53
- | SSIM-dyn | 42.5 | -5.5 | -17.0 | 28.4 | 36.5 |
54
- | MES-dyn | 36.7 | -12.9 | -26.4 | 31.4 | 44.5 |
 
 
55
 
56
 
57
  ## Usage
 
28
  - MantisScore also beat the best baselines on other three benchmarks EvalCrafter, GenAI-Bench and VBench, showing high alignment with human evaluations.
29
 
30
  ## Performance
31
+ ### Evaluation Results
32
 
33
  We test our video evaluation model MantisScore on VideoEval-test, EvalCrafter, GenAI-Bench and VBench.
34
  For the first two benchmarks, we take Spearman corrleation between model's output and human ratings
35
  averaged among all the evaluation aspects as indicator.
36
  For GenAI-Bench and VBench, which include human preference data among two or more videos,
37
  we employ the model's output to predict preferences and use pairwise accuracy as the performance indicator.
38
+ | metric | Final Sum Score | VideoEval-test | EvalCrafter | GenAI-Bench | VBench |
39
+ |-------------------|----------------:|---------------:|------------:|-------------|--------|
40
+ | MantisScore (reg) | 278.3 | 75.7 | 51.1 | 78.5 | 73.0 |
41
+ | MantisScore (gen) | 222.4 | 77.1 | 27.6 | 59.0 | 58.7 |
42
+ | Gemini-1.5-Pro | 158.8 | 22.1 | 22.9 | 60.9 | 52.9 |
43
+ | Gemini-1.5-Flash | 157.5 | 20.8 | 17.3 | 67.1 | 52.3 |
44
+ | GPT-4o | 155.4 | 23.1 | 28.7 | 52.0 | 51.7 |
45
+ | CLIP-sim | 126.8 | 8.9 | 36.2 | 34.2 | 47.4 |
46
+ | DINO-sim | 121.3 | 7.5 | 32.1 | 38.5 | 43.3 |
47
+ | SSIM-sim | 118.0 | 13.4 | 26.9 | 34.1 | 43.5 |
48
+ | CLIP-Score | 114.4 | -7.2 | 21.7 | 45.0 | 54.9 |
49
+ | LLaVA-1.5-7B | 108.3 | 8.5 | 10.5 | 49.9 | 39.4 |
50
+ | LLaVA-1.6-7B | 93.3 | -3.1 | 13.2 | 44.5 | 38.7 |
51
+ | X-CLIP-Score | 92.9 | -1.9 | 13.3 | 41.4 | 40.1 |
52
+ | PIQE | 78.3 | -10.1 | -1.2 | 34.5 | 55.1 |
53
+ | BRISQUE | 75.9 | -20.3 | 3.9 | 38.5 | 53.7 |
54
+ | Idefics2 | 73.0 | 6.5 | 0.3 | 34.6 | 31.7 |
55
+ | SSIM-dyn | 42.5 | -5.5 | -17.0 | 28.4 | 36.5 |
56
+ | MES-dyn | 36.7 | -12.9 | -26.4 | 31.4 | 44.5 |
57
 
58
 
59
  ## Usage