File size: 8,977 Bytes
d338ca8 16d54dc d338ca8 16d54dc d338ca8 16d54dc e5ae908 d338ca8 16d54dc d338ca8 16d54dc 8e3721f 16d54dc d338ca8 16d54dc | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 | ---
license: mit
pipeline_tag: video-text-to-text
library_name: transformers
language:
- en
metrics:
- spearmanr
- rouge
- accuracy
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
---
# AIGVE-MACS: Unified Multi-Aspect Commenting and Scoring Model for AI-Generated Video Evaluation
<img src="https://cdn-uploads.huggingface.co/production/uploads/6526e6845fe2335fdd327ace/C4udNbYMyA63IgFvJqQi5.png" height="100"/>
## π Introduction
This repository contains the AIGVE-MACS model, a unified model for AI-Generated Video Evaluation (AIGVE), as presented in the paper [AIGVE-MACS: Unified Multi-Aspect Commenting and Scoring Model for AI-Generated Video Evaluation](https://arxiv.org/abs/2507.01255).
**AIGVE-MACS** is a unified Vision-Language Model (VLM) for evaluating AI-generated videos. It produces both **numerical scores** (from 0 to 5) and **natural language justifications** across **9 human-aligned aspects** of video quality:
| Metric | Description |
|----------------------------|-------------|
| **Technical Quality** | Assesses the technical aspects of the video, including whether the resolution is sufficient for object recognition, whether the colors are natural, and whether there is an absence of noise or artifacts. |
| **Dynamic** | Measures the extent of pixel changes throughout the video, focusing on significant object or camera movements and changes in environmental factors such as daylight, weather, or seasons. |
| **Consistency** | Evaluates whether objects in the video maintain consistent properties, avoiding glitches, flickering, or unexpected changes. |
| **Physics** | Determines if the scene adheres to physical laws, ensuring that object behaviors and interactions are realistic and aligned with real-world physics. |
| **Element Presence** | Checks if all objects mentioned in the instructions are present in the video. The score is based on the proportion of objects that are correctly included. |
| **Element Quality** | Assesses the realism and fidelity of objects in the video, awarding higher scores for detailed, natural, and visually appealing appearances. |
| **Action/Interaction Presence** | Evaluates whether all actions and interactions described in the instructions are accurately represented in the video. |
| **Action/Interaction Quality** | Measures the naturalness and smoothness of actions and interactions, with higher scores for those that are realistic, lifelike, and seamlessly integrated into the scene. |
| **Overall** | Reflects the comprehensive quality of the video based on all metrics, allowing raters to incorporate their subjective preferences into the evaluation. |
---
## π Quickstart
### Installation
```bash
pip install transformers accelerate
pip install qwen-vl-utils
```
### Example Usage
```python
from transformers import AutoProcessor
from models import Qwen2_5_VLForConditionalGeneration
import torch
from qwen_vl_utils import process_vision_info
# Load model and processor
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"xiaoliux/AIGVE-MACS",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2"
).to("cuda:0")
processor = AutoProcessor.from_pretrained("xiaoliux/AIGVE-MACS", use_fast=True)
# Compose input message
def get_user_message(video_frames, prompt):
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"resized_height": 480,
"resized_width": 854,
'fps': 1,
"video": video_frames,
},
{
"type": "text",
"text": "You are an expert in evaluating AI-Generated Videos, you evaluate videos in the following 9 aspects: "
"1. technical_quality: including whether the resolution is sufficient for object recognition, whether the colors are natural, and whether there is an absence of noise or artifacts. "
"2. dynamic: the extent of pixel changes throughout the video, focusing on significant object or camera movements and changes in environmental factors such as daylight, weather, or seasons. "
"3. consistency: whether objects in the video maintain consistent properties, avoiding glitches, flickering, or unexpected changes."
"4. physics: Determines if the scene adheres to physical laws."
"5. element_presentence: Checks if all objects mentioned in the instructions are present in the video. "
"6. element_quality: Assesses the realism and fidelity of objects in the video, awarding higher scores for detailed, natural, and visually appealing appearances. "
"7. action_presentence: Evaluates whether all actions and interactions described in the instructions are accurately represented in the video. "
"8. action_quality: Measures the naturalness and smoothness of actions and interactions, with higher scores for those that are realistic, lifelike, and seamlessly integrated into the scene."
"9. overall: Reflects the comprehensive quality of the video based on all metrics. "
"The score can be chosen from [0, 5] with whole numbers. You should also include the comment for each score. "
"Please output as a JSON."
f"The video instruction is: {prompt}"
},
],
},
]
return messages
# Example inputs
video_frames = ["/path/to/frame1.png", "/path/to/frame2.png", ...]
prompt = "A tiger runs across a snowy field while snowflakes fall."
messages = get_user_message(video_frames, prompt)
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
**video_kwargs,
).to("cuda:0")
# Inference
generated_ids = model.generate(**inputs, max_new_tokens=1500)
output_text = processor.batch_decode(
[out[len(inp):] for inp, out in zip(inputs.input_ids, generated_ids)],
skip_special_tokens=True
)[0]
print("Evaluation Result:\n", output_text)
```
## π Output Format
```json
{
"technical_quality": {"score": 5, "comment": "..."},
"dynamic": {"score": 4, "comment": "..."},
"consistency": {"score": 5, "comment": "..."},
...
"overall": {"score": 4, "comment": "..."}
}
```
---
## π Main Results on AIGVE-BENCH 2
### π Score Correlation (Spearmanβs Ο β)
| Method | TQ | Dy | CS | Phy | EP | EQ | AP | AQ | OR | AVG |
|----------------|------|------|------|------|------|------|------|------|------|--------|
| GPT-4o | 34.71| 7.05 |18.12 |20.28 |23.10 |30.47 |36.57 |31.58 |38.57 | 26.72 |
| GPT-4.1 | 36.49| 5.81 |26.68 |19.87 |27.22 |28.77 |32.75 |20.22 |29.98 | 25.31 |
| Qwen2.5-VL | 8.77| 4.00 | 1.24 |-6.01 | 9.19 |10.19 |18.74 | 0.72 | 9.59 | 6.27 |
| VideoLLaMA3 | 15.94|19.44 |11.70 |13.21 |-3.13 |12.27 |13.61 |-0.69 |11.58 | 10.44 |
| **AIGVE-MACS** |**40.60**|**57.31**|**61.49**|**64.36**|**40.32**|**40.81**|**44.31**|**60.71**|**59.88**|**52.20**|
> TQ: Technical Quality, Dy: Dynamics, CS: Consistency, Phy: Physics
> EP/EQ: Element Presence/Quality, AP/AQ: Action Presence/Quality, OR: Overall
---
### π¬ Comment Generation Quality
| Method | ROUGE-1 β | ROUGE-L β | BERTScore β | UniEval-Fact β | G-Eval β |
|----------------|-----------|------------|--------------|----------------|-----------|
| GPT-4o | 18.30 | 15.86 | 74.90 | 40.84 | 2.10 |
| GPT-4.1 | 15.80 | 12.94 | 73.99 | 43.99 | 2.10 |
| Qwen2.5-VL | 17.95 | 15.31 | 74.31 | 42.32 | 2.37 |
| VideoLLaMA3 | 19.99 | 17.67 | 75.35 | 40.21 | 2.18 |
| **AIGVE-MACS** | **49.50** | **38.00** | **85.87** | **57.04** | **3.42** |
---
## π Citation
```bibtex
@article{liu2025aigvemacs,
title={AIGVE-MACS: Unified Multi-Aspect Commenting and Scoring Model for AI-Generated Video Evaluation},
author={Xiao Liu and Jiawei Zhang},
journal={arXiv preprint arXiv:2507.01255},
year={2025}
}
```
---
## π Additional Resources
- π [Paper on arXiv](https://arxiv.org/abs/2507.01255)
- π€ [Hugging Face Model](https://huggingface.co/xiaoliux/AIGVE-MACS)
- ποΈ AIGVE-BENCH 2 Dataset: *Coming soon*
``` |