| | --- |
| | license: apache-2.0 |
| | datasets: |
| | - TIGER-Lab/VideoFeedback |
| | language: |
| | - en |
| | metrics: |
| | - accuracy/spcc |
| | library_name: transformers |
| | pipeline_tag: visual-question-answering |
| | --- |
| | |
| |
|
| | [📃Paper] | [🌐Website](https://tiger-ai-lab.github.io/MantisScore/) | [💻Github](https://github.com/TIGER-AI-Lab/MantisScore) | [🛢️Datasets](https://huggingface.co/datasets/TIGER-Lab/VideoFeedback) | [🤗Model](https://huggingface.co/TIGER-Lab/MantisScore) | [🤗Model-variant](https://huggingface.co/TIGER-Lab/MantisScore-anno-only) | [🤗Demo](https://huggingface.co/spaces/Mantis-VL/MantisScore) |
| |
|
| |
|
| |  |
| |
|
| | ## Introduction |
| | - MantisScore is a video quality evaluation model, taking [Mantis-8B-Idefics2](https://huggingface.co/TIGER-Lab/Mantis-8B-Idefics2) as base-model |
| | and trained on [VideoFeedback](https://huggingface.co/datasets/TIGER-Lab/VideoFeedback), |
| | a large video evaluation dataset with multi-aspect human scores. |
| |
|
| | - MantisScore can reach 75+ Spearman correlation with humans on VideoEval-test, surpassing all the MLLM-prompting methods and feature-based metrics. |
| |
|
| | - MantisScore also beat the best baselines on other three benchmarks EvalCrafter, GenAI-Bench and VBench, showing high alignment with human evaluations. |
| |
|
| | ## Evaluation Results |
| |
|
| | We test our video evaluation model MantisScore on VideoEval-test, EvalCrafter, GenAI-Bench and VBench. |
| | For the first two benchmarks, we take Spearman corrleation between model's output and human ratings |
| | averaged among all the evaluation aspects as indicator. |
| | For GenAI-Bench and VBench, which include human preference data among two or more videos, |
| | we employ the model's output to predict preferences and use pairwise accuracy as the performance indicator. |
| |
|
| | Moreover, we use [MantisScore](https://huggingface.co/TIGER-Lab/MantisScore) trained on VideoFeedback dataset |
| | for VideoFeedback-test set, while for other three benchmarks, we use |
| | [MantisScore-anno-only](https://huggingface.co/TIGER-Lab/MantisScore-anno-only) variant trained on VideoFeedback dataset |
| | with real videos excluded. |
| |
|
| | The evaluation results are shown below: |
| |
|
| |
|
| | | metric | Final Avg Score | VideoFeedback-test | EvalCrafter | GenAI-Bench | VBench | |
| | |:-----------------:|:--------------:|:--------------:|:-----------:|:-----------:|:----------:| |
| | | MantisScore (reg) | **69.6** | 75.7 | **51.1** | **78.5** | **73.0** | |
| | | MantisScore (gen) | 55.6 | **77.1** | 27.6 | 59.0 | 58.7 | |
| | | Gemini-1.5-Pro | <u>39.7</u> | 22.1 | 22.9 | 60.9 | 52.9 | |
| | | Gemini-1.5-Flash | 39.4 | 20.8 | 17.3 | <u>67.1</u> | 52.3 | |
| | | GPT-4o | 38.9 | <u>23.1</u> | 28.7 | 52.0 | 51.7 | |
| | | CLIP-sim | 31.7 | 8.9 | <u>36.2</u> | 34.2 | 47.4 | |
| | | DINO-sim | 30.3 | 7.5 | 32.1 | 38.5 | 43.3 | |
| | | SSIM-sim | 29.5 | 13.4 | 26.9 | 34.1 | 43.5 | |
| | | CLIP-Score | 28.6 | -7.2 | 21.7 | 45.0 | 54.9 | |
| | | LLaVA-1.5-7B | 27.1 | 8.5 | 10.5 | 49.9 | 39.4 | |
| | | LLaVA-1.6-7B | 23.3 | -3.1 | 13.2 | 44.5 | 38.7 | |
| | | X-CLIP-Score | 23.2 | -1.9 | 13.3 | 41.4 | 40.1 | |
| | | PIQE | 19.6 | -10.1 | -1.2 | 34.5 |<u> 55.1</u>| |
| | | BRISQUE | 19.0 | -20.3 | 3.9 | 38.5 | 53.7 | |
| | | Idefics2 | 18.3 | 6.5 | 0.3 | 34.6 | 31.7 | |
| | | MSE-dyn | 10.6 | -5.5 | -17.0 | 28.4 | 36.5 | |
| | | SSIM-dyn | 9.2 | -12.9 | -26.4 | 31.4 | 44.5 | |
| | <!-- | Fuyu | - | - | - | - | - | |
| | | Kosmos-2 | - | - | - | - | - | |
| | | CogVLM | - | - | - | - | - | |
| | | OpenFlamingo | - | - | - | - | - | --> |
| |
|
| | The best in MantisScore series is in bold and the best in baselines is underlined. |
| | <!-- "-" means the answer of MLLM is meaningless or in wrong format. --> |
| |
|
| | ## Usage |
| | ### Installation |
| | ```bash |
| | pip install git+https://github.com/TIGER-AI-Lab/MantisScore.git |
| | ``` |
| |
|
| | ### Inference |
| | ```shell |
| | cd examples |
| | ``` |
| |
|
| | ```python |
| | import av |
| | import numpy as np |
| | from typing import List |
| | import torch |
| | from transformers import AutoProcessor |
| | from models.idefics2 import Idefics2ForSequenceClassification |
| | |
| | def _read_video_pyav( |
| | frame_paths:List[str], |
| | max_frames:int, |
| | ): |
| | frames = [] |
| | container.seek(0) |
| | start_index = indices[0] |
| | end_index = indices[-1] |
| | for i, frame in enumerate(container.decode(video=0)): |
| | if i > end_index: |
| | break |
| | if i >= start_index and i in indices: |
| | frames.append(frame) |
| | return np.stack([x.to_ndarray(format="rgb24") for x in frames]) |
| | |
| | MAX_NUM_FRAMES=16 |
| | ROUND_DIGIT=4 |
| | REGRESSION_QUERY_PROMPT = """ |
| | Suppose you are an expert in judging and evaluating the quality of AI-generated videos, |
| | please watch the following frames of a given video and see the text prompt for generating the video, |
| | then give scores from 5 different dimensions: |
| | (1) visual quality: the quality of the video in terms of clearness, resolution, brightness, and color |
| | (2) temporal consistency, both the consistency of objects or humans and the smoothness of motion or movements |
| | (3) dynamic degree, the degree of dynamic changes |
| | (4) text-to-video alignment, the alignment between the text prompt and the video content |
| | (5) factual consistency, the consistency of the video content with the common-sense and factual knowledge |
| | |
| | for each dimension, output a float number from 1.0 to 4.0, |
| | the higher the number is, the better the video performs in that sub-score, |
| | the lowest 1.0 means Bad, the highest 4.0 means Perfect/Real (the video is like a real video) |
| | Here is an output example: |
| | visual quality: 3.2 |
| | temporal consistency: 2.7 |
| | dynamic degree: 4.0 |
| | text-to-video alignment: 2.3 |
| | factual consistency: 1.8 |
| | |
| | For this video, the text prompt is "{text_prompt}", |
| | all the frames of video are as follows: |
| | """ |
| | |
| | video_path="examples/video1.mp4" |
| | video_prompt="Near the Elephant Gate village, they approach the haunted house at night. Rajiv feels anxious, but Bhavesh encourages him. As they reach the house, a mysterious sound in the air adds to the suspense." |
| | |
| | processor = AutoProcessor.from_pretrained(f"TIGER-Lab/MantisScore",torch_dtype=torch.bfloat16) |
| | model = Idefics2ForSequenceClassification.from_pretrained(f"TIGER-Lab/MantisScore",torch_dtype=torch.bfloat16).eval() |
| | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
| | model.to(device) |
| | |
| | # sample uniformly 8 frames from the video |
| | container = av.open(video_path) |
| | total_frames = container.streams.video[0].frames |
| | if total_frames > MAX_NUM_FRAMES: |
| | indices = np.arange(0, total_frames, total_frames / MAX_NUM_FRAMES).astype(int) |
| | else: |
| | indices = np.arange(total_frames) |
| | |
| | frames = [Image.fromarray(x) for x in _read_video_pyav(container, indices)] |
| | eval_prompt = REGRESSION_QUERY_PROMPT.format(text_prompt=video_prompt) |
| | num_image_token = eval_prompt.count("<image>") |
| | if num_image_token < len(frames): |
| | eval_prompt += "<image> " * (len(frames) - num_image_token) |
| | |
| | flatten_images = [] |
| | for x in [frames]: |
| | if isinstance(x, list): |
| | flatten_images.extend(x) |
| | else: |
| | flatten_images.append(x) |
| | flatten_images = [Image.open(x) if isinstance(x, str) else x for x in flatten_images] |
| | inputs = processor(text=eval_prompt, images=flatten_images, return_tensors="pt") |
| | inputs = {k: v.to(model.device) for k, v in inputs.items()} |
| | |
| | with torch.no_grad(): |
| | outputs = model(**inputs) |
| | |
| | logits = outputs.logits |
| | num_aspects = logits.shape[-1] |
| | |
| | aspect_scores = [] |
| | for i in range(num_aspects): |
| | aspect_scores.append(round(logits[0, i].item(),ROUND_DIGIT)) |
| | print(aspect_scores) |
| | |
| | """ |
| | # model output on visual quality, temporal consistency, dynamic degree, |
| | # text-to-video alignment, factual consistency, respectively |
| | [2.2969, 2.4375, 2.8281, 2.5, 2.4688] |
| | """ |
| | |
| | ``` |
| |
|
| | ### Training |
| | see [MantisScore/training](https://github.com/TIGER-AI-Lab/MantisScore/training) for details |
| |
|
| | ### Evaluation |
| | see [MantisScore/benchmark]((https://github.com/TIGER-AI-Lab/MantisScore/benchmark)) for details |
| |
|
| | ## Citation |
| |
|