Ricardo-M commited on
Commit
fe1da65
·
verified ·
1 Parent(s): e12b6bf

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +103 -0
README.md ADDED
@@ -0,0 +1,103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - MAGAer13/mplug-owl2-llama2-7b
5
+ ---
6
+ This repo contains the official implementation of the **AAAI 2026** paper.
7
+
8
+ ## Introduction
9
+ Image Aesthetics Assessment (IAA) evaluates visual quality through user-centered perceptual analysis and can guide various applications. Recent advances in Multimodal Large Language Models (MLLMs) have sparked interest in adapting them for IAA. However, two critical limitations persist in applying MLLMs to IAA:
10
+ 1) the tokenization strategy leads to insensitivity to scores
11
+ 2) the classification-based decoding mechanisms introduce score quantization errors
12
+ 3) Current MLLM-based IAA methods treat the task as coarse rating classification followed by probability-to-score mapping, which loses fine-grained information.
13
+
14
+ To address these challenges, we propose ROC4MLLM, offering complementary solutions from two perspectives:
15
+ 1) Representation: We separate scores from the word token space to avoid tokenizing scores as text. An independent position token bridges these spaces, improving the model’s sensitivity to score positions in text.
16
+ 2) Computation: We apply distinct loss functions for text and score predictions to enhance the model’s sensitivity to score gradients. Decoupling scores from text ensures effective supervision while preventing interference between scores and text in the loss computation.
17
+
18
+ Extensive experiments across five datasets demonstrate ROC4MLLM’s state-of-the-art performance without requiring additional training data. Additionally, ROC4MLLM’s plug-and-play design ensures seamless integration with existing MLLMs, boosting their IAA performance.
19
+
20
+ <img alt="method" src="https://github.com/user-attachments/assets/4e6e7509-6679-4fae-a05d-4191caff42a6" />
21
+
22
+
23
+ ## Checkpoints
24
+ * Download the weight from [Baidu Netdisk](https://pan.baidu.com/s/1GwX6AEsJ3txDpxPeCdY6LA?pwd=bupt) or or [HuggingFace](https://huggingface.co/Ricardo-M/ROC4MLLM).
25
+ * Or Download the Docker environment from [Baidu Netdisk](https://pan.baidu.com/s/1IjRsT691hom9naxtjlIzKw?pwd=bupt).
26
+
27
+ ## Usage
28
+
29
+ ### Install
30
+ 1. Clone this repository and navigate to ROC4MLLM folder
31
+ ```bash
32
+ git clone https://github.com/woshidandan/Assessing-Image-Aesthetics-via-Multimodal-Large-Language-Models.git
33
+ cd ROC4MLLM
34
+ ```
35
+
36
+ 2. Install Package
37
+ ```Shell
38
+ conda create -n roc4mllm python=3.10 -y
39
+ conda activate roc4mllm
40
+ pip install --upgrade pip
41
+ pip install -e .
42
+ ```
43
+
44
+ 3. Install additional packages for training cases
45
+ ```
46
+ pip install -e ".[train]"
47
+ pip install flash-attn --no-build
48
+ ```
49
+
50
+ ### Quick Start Code
51
+ ```python
52
+ from mplug_owl2.assessor import Assessment
53
+ from PIL import Image
54
+
55
+ assessment=Assessment(pretrained="../ROC4MLLM_weights")
56
+ images=["test_images/1_-10.jpg","test_images/1_-10.jpg"]
57
+ input_img=[]
58
+ for image in images:
59
+ img=Image.open(image).convert('RGB')
60
+ input_img.append(img)
61
+ answer=assessment(input_img,precision=4)
62
+ print(answer)
63
+ ```
64
+
65
+ ## Training
66
+ ### Prepare Training Data
67
+ Please refer to [mPLUG-Owl2](https://github.com/X-PLUG/mPLUG-Owl) for data preparation.
68
+
69
+ **Notes**: We have added a `gt_score` field. If you intend to use CE loss or EMD loss, the `target` field is required.
70
+
71
+ Below is an example of a data sample in AVA:
72
+ ```python
73
+ {
74
+ "image": "771257.jpg",
75
+ "gt_score": 3.463414634146341,
76
+ "conversations": [{"from": "human", "value": "<|image|>Could you evaluate the aesthetics of this image?"}, {"from": "gpt", "value": "The aesthetic rate of the image is [SCORE]. "}],
77
+ "target": [0.15853658536585366, 0.10975609756097561, 0.2073170731707317, 0.2926829268292683, 0.16463414634146342, 0.04878048780487805, 0.0, 0.0, 0.006097560975609756, 0.012195121951219513]
78
+ }
79
+ ```
80
+ Place your data file path in the `DATA_FILE` within `scripts/finetune.sh`. You also need to update the `Image_root` in the same script to point to the directory where your original images are stored.
81
+
82
+ ### Prepare model checkpoint
83
+ Download the pretrained model checkpoints and update the `LOAD` in `scripts/finetune.sh` accordingly.
84
+ ### Training scripts
85
+ Run the following command to start training:
86
+ ```
87
+ bash scripts/finetune.sh
88
+ ```
89
+ You can modify `min_score` and `max_score` to define the score range in your dataset. Use `l1_weight`, `ce_weight`, and `emd_weight` to configure the loss functions and their respective weights for the score loss.
90
+
91
+ **Important Note**: If you use CE or EMD loss, ensure that the `num_tokens` matches the length of the `target` field in your training data.
92
+
93
+
94
+
95
+ ## If you find our work is useful, pleaes cite our paper:
96
+ ```
97
+ @inproceedings{MaRegression,
98
+ title = {Regression Over Classification: Assessing Image Aesthetics via Multimodal Large Language Models},
99
+ author = {Ma, Xingyuan and He, Shuai and Ming, Anlong and Zhong, Haobin and Ma, Huadong},
100
+ booktitle = {Proceedings of the 40th AAAI Conference on Artificial Intelligence (AAAI)},
101
+ year = {2026}
102
+ }
103
+ ```