File size: 8,977 Bytes

---
license: mit
pipeline_tag: video-text-to-text
library_name: transformers
language:
- en
metrics:
- spearmanr
- rouge
- accuracy
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
---
# AIGVE-MACS: Unified Multi-Aspect Commenting and Scoring Model for AI-Generated Video Evaluation
<img src="https://cdn-uploads.huggingface.co/production/uploads/6526e6845fe2335fdd327ace/C4udNbYMyA63IgFvJqQi5.png" height="100"/>


## 🔍 Introduction
This repository contains the AIGVE-MACS model, a unified model for AI-Generated Video Evaluation (AIGVE), as presented in the paper [AIGVE-MACS: Unified Multi-Aspect Commenting and Scoring Model for AI-Generated Video Evaluation](https://arxiv.org/abs/2507.01255).
**AIGVE-MACS** is a unified Vision-Language Model (VLM) for evaluating AI-generated videos. It produces both **numerical scores** (from 0 to 5) and **natural language justifications** across **9 human-aligned aspects** of video quality:
 | Metric                     | Description |
|----------------------------|-------------|
| **Technical Quality**      | Assesses the technical aspects of the video, including whether the resolution is sufficient for object recognition, whether the colors are natural, and whether there is an absence of noise or artifacts. |
| **Dynamic**                | Measures the extent of pixel changes throughout the video, focusing on significant object or camera movements and changes in environmental factors such as daylight, weather, or seasons. |
| **Consistency**            | Evaluates whether objects in the video maintain consistent properties, avoiding glitches, flickering, or unexpected changes. |
| **Physics**                | Determines if the scene adheres to physical laws, ensuring that object behaviors and interactions are realistic and aligned with real-world physics. |
| **Element Presence**       | Checks if all objects mentioned in the instructions are present in the video. The score is based on the proportion of objects that are correctly included. |
| **Element Quality**        | Assesses the realism and fidelity of objects in the video, awarding higher scores for detailed, natural, and visually appealing appearances. |
| **Action/Interaction Presence** | Evaluates whether all actions and interactions described in the instructions are accurately represented in the video. |
| **Action/Interaction Quality** | Measures the naturalness and smoothness of actions and interactions, with higher scores for those that are realistic, lifelike, and seamlessly integrated into the scene. |
| **Overall**                | Reflects the comprehensive quality of the video based on all metrics, allowing raters to incorporate their subjective preferences into the evaluation. |

---

## 🚀 Quickstart

### Installation

```bash
pip install transformers accelerate
pip install qwen-vl-utils
```

### Example Usage

```python
from transformers import AutoProcessor
from models import Qwen2_5_VLForConditionalGeneration
import torch
from qwen_vl_utils import process_vision_info

# Load model and processor
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "xiaoliux/AIGVE-MACS", 
    torch_dtype=torch.bfloat16, 
    attn_implementation="flash_attention_2"
).to("cuda:0")

processor = AutoProcessor.from_pretrained("xiaoliux/AIGVE-MACS", use_fast=True)

# Compose input message
def get_user_message(video_frames, prompt):
    messages = [
            {
                "role": "user",
                "content": [
                    {
                        "type": "video",
                        "resized_height": 480,
                        "resized_width": 854,
                        'fps': 1,
                        "video": video_frames,
                    },
                    {
                        "type": "text",
                        "text": "You are an expert in evaluating AI-Generated Videos, you evaluate videos in the following 9 aspects: "
                                "1. technical_quality: including whether the resolution is sufficient for object recognition, whether the colors are natural, and whether there is an absence of noise or artifacts. "
                                "2. dynamic: the extent of pixel changes throughout the video, focusing on significant object or camera movements and changes in environmental factors such as daylight, weather, or seasons. "
                                "3. consistency: whether objects in the video maintain consistent properties, avoiding glitches, flickering, or unexpected changes." 
                                "4. physics: Determines if the scene adheres to physical laws." 
                                "5. element_presentence: Checks if all objects mentioned in the instructions are present in the video. "
                                "6. element_quality: Assesses the realism and fidelity of objects in the video, awarding higher scores for detailed, natural, and visually appealing appearances. "
                                "7. action_presentence: Evaluates whether all actions and interactions described in the instructions are accurately represented in the video. "
                                "8. action_quality: Measures the naturalness and smoothness of actions and interactions, with higher scores for those that are realistic, lifelike, and seamlessly integrated into the scene." 
                                "9. overall: Reflects the comprehensive quality of the video based on all metrics. "
                                "The score can be chosen from [0, 5] with whole numbers. You should also include the comment for each score. "
                                "Please output as a JSON."
                                f"The video instruction is: {prompt}"
                    },
                ],
            },
        ]
    return messages

# Example inputs
video_frames = ["/path/to/frame1.png", "/path/to/frame2.png", ...]
prompt = "A tiger runs across a snowy field while snowflakes fall."

messages = get_user_message(video_frames, prompt)
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)

inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
).to("cuda:0")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=1500)
output_text = processor.batch_decode(
    [out[len(inp):] for inp, out in zip(inputs.input_ids, generated_ids)],
    skip_special_tokens=True
)[0]

print("Evaluation Result:\n", output_text)
```

## 📁 Output Format

```json
{
  "technical_quality": {"score": 5, "comment": "..."},
  "dynamic": {"score": 4, "comment": "..."},
  "consistency": {"score": 5, "comment": "..."},
  ...
  "overall": {"score": 4, "comment": "..."}
}
```


---

## 📊 Main Results on AIGVE-BENCH 2

### 📈 Score Correlation (Spearman’s ρ ↑)

| Method         | TQ   | Dy   | CS   | Phy  | EP   | EQ   | AP   | AQ   | OR   | AVG   |
|----------------|------|------|------|------|------|------|------|------|------|--------|
| GPT-4o         | 34.71| 7.05 |18.12 |20.28 |23.10 |30.47 |36.57 |31.58 |38.57 | 26.72 |
| GPT-4.1        | 36.49| 5.81 |26.68 |19.87 |27.22 |28.77 |32.75 |20.22 |29.98 | 25.31 |
| Qwen2.5-VL     |  8.77| 4.00 | 1.24 |-6.01 | 9.19 |10.19 |18.74 | 0.72 | 9.59 |  6.27 |
| VideoLLaMA3    | 15.94|19.44 |11.70 |13.21 |-3.13 |12.27 |13.61 |-0.69 |11.58 | 10.44 |
| **AIGVE-MACS** |**40.60**|**57.31**|**61.49**|**64.36**|**40.32**|**40.81**|**44.31**|**60.71**|**59.88**|**52.20**|

> TQ: Technical Quality, Dy: Dynamics, CS: Consistency, Phy: Physics  
> EP/EQ: Element Presence/Quality, AP/AQ: Action Presence/Quality, OR: Overall

---

### 💬 Comment Generation Quality

| Method         | ROUGE-1 ↑ | ROUGE-L ↑ | BERTScore ↑ | UniEval-Fact ↑ | G-Eval ↑ |
|----------------|-----------|------------|--------------|----------------|-----------|
| GPT-4o         | 18.30     | 15.86      | 74.90        | 40.84          | 2.10      |
| GPT-4.1        | 15.80     | 12.94      | 73.99        | 43.99          | 2.10      |
| Qwen2.5-VL     | 17.95     | 15.31      | 74.31        | 42.32          | 2.37      |
| VideoLLaMA3    | 19.99     | 17.67      | 75.35        | 40.21          | 2.18      |
| **AIGVE-MACS** | **49.50** | **38.00**  | **85.87**    | **57.04**      | **3.42**  |

---

## 📚 Citation

```bibtex
@article{liu2025aigvemacs,
  title={AIGVE-MACS: Unified Multi-Aspect Commenting and Scoring Model for AI-Generated Video Evaluation},
  author={Xiao Liu and Jiawei Zhang},
  journal={arXiv preprint arXiv:2507.01255},
  year={2025}
}
```

---

## 🔗 Additional Resources

- 📄 [Paper on arXiv](https://arxiv.org/abs/2507.01255)
- 🤗 [Hugging Face Model](https://huggingface.co/xiaoliux/AIGVE-MACS)
- 🗃️ AIGVE-BENCH 2 Dataset: *Coming soon*

```