Image-Text-to-Text
English

Improve model card: Update pipeline tag, add license, and enhance content with usage and results

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +195 -2
README.md CHANGED
@@ -3,16 +3,34 @@ datasets:
3
  - chaofengc/IQA-PyTorch-Datasets
4
  language:
5
  - en
6
- pipeline_tag: visual-question-answering
7
  library_name: transformers
 
 
8
  ---
 
 
 
9
  # Visual Prompt Checkpoints for NR-IQA
10
  πŸ”¬ **Paper**: [Parameter-Efficient Adaptation of mPLUG-Owl2 via Pixel-Level Visual Prompts for NR-IQA](https://arxiv.org/abs/2509.03494)
11
  πŸ’» **Code**: [GitHub Repository](https://github.com/yahya-ben/mplug2-vp-for-nriqa)
12
 
 
 
 
13
  ## Overview
14
  Pre-trained visual prompt checkpoints for **No-Reference Image Quality Assessment (NR-IQA)** using mPLUG-Owl2-7B. Achieves competitive performance with only **~600K parameters** vs 7B+ for full fine-tuning.
15
 
 
 
 
 
 
 
 
 
 
 
 
16
  ## Available Checkpoints
17
  **Download**: `visual_prompt_ckpt_trained_on_mplug2.zip`
18
 
@@ -22,4 +40,179 @@ Pre-trained visual prompt checkpoints for **No-Reference Image Quality Assessmen
22
  | KonIQ-10k | 0.852 | `SGD_mplug2_exp_05_koniq_padding_30px_add/` |
23
  | AGIQA-3k | 0.810 | `SGD_mplug2_exp_06_agiqa_padding_30px_add/` |
24
 
25
- **πŸ“– For detailed setup, training, and usage instructions, see the [GitHub repository](https://github.com/your-username/visual-prompt-nr-iqa).**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  - chaofengc/IQA-PyTorch-Datasets
4
  language:
5
  - en
 
6
  library_name: transformers
7
+ pipeline_tag: image-text-to-text
8
+ license: apache-2.0
9
  ---
10
+
11
+ ![Python](https://img.shields.io/badge/python-3.10-blue) ![HuggingFace](https://img.shields.io/badge/hub-checkpoints-orange) [![arXiv](https://img.shields.io/badge/arXiv-2509.03494-lightgrey)](https://arxiv.org/abs/2509.03494)
12
+
13
  # Visual Prompt Checkpoints for NR-IQA
14
  πŸ”¬ **Paper**: [Parameter-Efficient Adaptation of mPLUG-Owl2 via Pixel-Level Visual Prompts for NR-IQA](https://arxiv.org/abs/2509.03494)
15
  πŸ’» **Code**: [GitHub Repository](https://github.com/yahya-ben/mplug2-vp-for-nriqa)
16
 
17
+ ## Abstract
18
+ In this paper, we propose a novel parameter-efficient adaptation method for No- Reference Image Quality Assessment (NR-IQA) using visual prompts optimized in pixel-space. Unlike full fine-tuning of Multimodal Large Language Models (MLLMs), our approach trains only 600K parameters at most (< 0.01% of the base model), while keeping the underlying model fully frozen. During inference, these visual prompts are combined with images via addition and processed by mPLUG-Owl2 with the textual query "Rate the technical quality of the image." Evaluations across distortion types (synthetic, realistic, AI-generated) on KADID- 10k, KonIQ-10k, and AGIQA-3k demonstrate competitive performance against full finetuned methods and specialized NR-IQA models, achieving 0.93 SRCC on KADID-10k. To our knowledge, this is the first work to leverage pixel-space visual prompts for NR-IQA, enabling efficient MLLM adaptation for low-level vision tasks. The source code is publicly available at https: // github. com/ yahya-ben/ mplug2-vp-for-nriqa .
19
+
20
  ## Overview
21
  Pre-trained visual prompt checkpoints for **No-Reference Image Quality Assessment (NR-IQA)** using mPLUG-Owl2-7B. Achieves competitive performance with only **~600K parameters** vs 7B+ for full fine-tuning.
22
 
23
+ ## πŸ”₯ Key Features
24
+
25
+ - **Parameter-Efficient**: Only ~600K trainable parameters vs 7B+ for full fine-tuning
26
+ - **Competitive Performance**: Achieves 0.93 SROCC on KADID-10k dataset
27
+ - **Multiple Visual Prompt Types**: Padding, Fixed Patches (Center/Top-Left), Full Overlay
28
+ - **Multiple MLLM Support**: mPLUG-Owl2-7B
29
+ - **Comprehensive Evaluation**: Supports KADID-10k, KonIQ-10k, and AGIQA-3k datasets
30
+ - **Pre-trained Checkpoints**: Available on HuggingFace Hub for immediate use
31
+
32
+ ![Method Overview](https://github.com/yahya-ben/mplug2-vp-for-nriqa/raw/main/hero_figure.png)
33
+
34
  ## Available Checkpoints
35
  **Download**: `visual_prompt_ckpt_trained_on_mplug2.zip`
36
 
 
40
  | KonIQ-10k | 0.852 | `SGD_mplug2_exp_05_koniq_padding_30px_add/` |
41
  | AGIQA-3k | 0.810 | `SGD_mplug2_exp_06_agiqa_padding_30px_add/` |
42
 
43
+ ## πŸƒ Usage
44
+ This section provides instructions for setting up the environment, preparing datasets, and running inference with the pre-trained visual prompt checkpoints. For detailed setup, training, and further usage instructions, please refer to the [GitHub repository](https://github.com/yahya-ben/mplug2-vp-for-nriqa).
45
+
46
+ ### Prerequisites
47
+
48
+ - Python 3.10+
49
+ - CUDA-capable GPU (tested on NVIDIA RTX A6000)
50
+ - PyTorch
51
+ - HuggingFace Transformers
52
+
53
+ ### Setup Environment
54
+
55
+ #### For mPLUG-Owl2:
56
+ ```bash
57
+ # Clone and setup mPLUG-Owl2
58
+ git clone https://github.com/X-PLUG/mPLUG-Owl.git
59
+ cd mPLUG-Owl/mPLUG-Owl2
60
+ conda create -n mplug_owl2 python=3.10 -y
61
+ conda activate mplug_owl2
62
+ pip install --upgrade pip
63
+ pip install -e .
64
+ pip install 'numpy<2'
65
+ pip install protobuf
66
+ ```
67
+
68
+ #### Additional Dependencies:
69
+ ```bash
70
+ pip install PyYAML scikit-learn tqdm
71
+ ```
72
+
73
+ ### Dataset Setup
74
+
75
+ Download the required IQA datasets:
76
+
77
+ ```bash
78
+ # KonIQ-10k
79
+ wget https://huggingface.co/datasets/chaofengc/IQA-PyTorch-Datasets/resolve/main/koniq10k.tgz
80
+ tar -xzf koniq10k.tgz
81
+
82
+ # KADID-10k
83
+ wget https://huggingface.co/datasets/chaofengc/IQA-PyTorch-Datasets/resolve/main/kadid10k.tgz
84
+ tar -xzf kadid10k.tgz
85
+
86
+ # AGIQA-3K
87
+ wget https://huggingface.co/datasets/chaofengc/IQA-PyTorch-Datasets/resolve/main/AGIQA-3K.zip
88
+ unzip AGIQA-3K.zip
89
+ ```
90
+
91
+ After extraction, organize your datasets in the `data/` folder as follows:
92
+
93
+ ```
94
+ data/
95
+ β”œβ”€β”€ kadid10k/
96
+ β”‚ β”œβ”€β”€ images/ # All KADID-10k images
97
+ β”‚ └── split_kadid10k.csv
98
+ β”œβ”€β”€ koniq10k/
99
+ β”‚ β”œβ”€β”€ 512x384/ # KonIQ-10k images (comes with own split)
100
+ β”‚ └── koniq10k_*.csv # Original split files
101
+ └── AGIQA-3K/
102
+ β”œβ”€β”€ images/ # All AGIQA-3k images
103
+ └── split_agiqa3k.csv
104
+ ```
105
+
106
+ **Important Notes:**
107
+ - **KADID-10k**: Move `split_kadid10k.csv` into the `kadid10k/` folder
108
+ - **KonIQ-10k**: Uses its own original split files, no need to move
109
+ - **AGIQA-3k**: Move `split_agiqa3k.csv` into the `AGIQA-3K/` folder; images are in the `images/` subfolder
110
+
111
+ ### Pre-trained Checkpoints
112
+
113
+ We provide pre-trained visual prompt checkpoints on **HuggingFace Hub** for immediate use:
114
+
115
+ πŸ”— **[Download Checkpoints](https://huggingface.co/yahya007/mplug2-vp-for-nriqa/tree/main)**
116
+
117
+ The checkpoints are provided as `visual_prompt_ckpt_trained_on_mplug2.zip` containing training experiment folders with checkpoint directories (`checkpoint-xxxx`). Each experiment folder contains multiple epochs, and the best performing checkpoint can be identified from the `best_model_checkpoint` info in the final checkpoint folder.
118
+
119
+ To use the pre-trained checkpoints:
120
+
121
+ 1. **Download and extract the checkpoint archive**:
122
+ ```bash
123
+ # Download from HuggingFace Hub
124
+ wget https://huggingface.co/yahya007/mplug2-vp-for-nriqa/blob/main/visual_prompt_ckpt_trained_on_mplug2.zip
125
+ unzip visual_prompt_ckpt_trained_on_mplug2.zip
126
+ ```
127
+
128
+ 2. **Navigate to the desired experiment folder**:
129
+ ```bash
130
+ cd SGD_mplug2_exp_04_kadid_padding_30px_add/
131
+ # Check the latest checkpoint folder (highest number)
132
+ ls -la checkpoint-*/
133
+ # Look for best_model_checkpoint info in the final checkpoint
134
+ ```
135
+
136
+ 3. **Update the configuration and checkpoint** in `src/tester.py`:
137
+ ```python
138
+ # Update config path
139
+ config_path = "configs/final_mplug_owl2_configs/SGD_mplug2_exp_04_kadid_padding_30px_add.yaml"
140
+
141
+ # Update checkpoint name - use "checkpoint-best" or specific checkpoint number
142
+ checkpoint_best = "checkpoint-best" # or "checkpoint-XXXX" for specific epoch
143
+ ```
144
+
145
+ 4. **Run inference**:
146
+ ```bash
147
+ cd src
148
+ python tester.py
149
+ ```
150
+
151
+ The inference script outputs:
152
+ - SRCC (Spearman Rank Correlation Coefficient)
153
+ - PLCC (Pearson Linear Correlation Coefficient)
154
+
155
+ ## πŸ“ˆ Results
156
+
157
+ ### Best Performance (30px Padding + Addition)
158
+
159
+ | Dataset | SROCC | PLCC | Parameters |
160
+ |---------|-------|------|------------|
161
+ | KADID-10k | 0.932 | 0.929 | ~600K |
162
+ | KonIQ-10k | 0.852 | 0.874 | ~600K |
163
+ | AGIQA-3k | 0.810 | 0.860 | ~600K |
164
+
165
+ ### Performance Across Visual Prompt Types
166
+
167
+ | Prompt Type | Size | KADID-10k SROCC | KonIQ-10k SROCC | AGIQA-3k SROCC |
168
+ |-------------|------|-----------------|-----------------|----------------|
169
+ | Padding | 10px | 0.880 | 0.805 | 0.802 |
170
+ | Padding | 30px | **0.932** | **0.852** | **0.810** |
171
+ | Fixed Patch (Center) | 10px | 0.390 | 0.487 | 0.435 |
172
+ | Fixed Patch (Center) | 30px | 0.806 | 0.647 | 0.725 |
173
+ | Fixed Patch (Top-Left) | 10px | 0.465 | 0.551 | 0.564 |
174
+ | Fixed Patch (Top-Left) | 30px | 0.520 | 0.635 | 0.755 |
175
+ | Full Overlay | β€” | 0.887 | 0.693 | 0.624 |
176
+
177
+ ### Comparison with State-of-the-Art Methods
178
+
179
+ | Method | KADID-10k SROCC | KonIQ-10k SROCC | AGIQA-3k SROCC | Parameters |
180
+ |--------|-----------------|-----------------|----------------|------------|
181
+ | **Our Method** | **0.932** | **0.852** | **0.810** | ~600K |
182
+ | Q-Align | 0.919 | 0.940 | 0.727 | 7B |
183
+ | Q-Instruct | 0.706 | 0.911 | 0.772 | 7B |
184
+ | LIQE | 0.930 | 0.919 | - | - |
185
+ | MP-IQE | 0.941 | 0.898 | - | - |
186
+ | MCPF-IQA | - | 0.918 | 0.872 | - |
187
+ | Q-Adapt | 0.769 | 0.878 | 0.757 | - |
188
+
189
+ ### Comparison with Specialized NR-IQA Models
190
+
191
+ | Method | KADID-10k SROCC | KonIQ-10k SROCC |
192
+ |--------|-----------------|-----------------|
193
+ | **Our Method** | **0.932** | 0.852 |
194
+ | HyperIQA | 0.872 | 0.906 |
195
+ | TreS | 0.858 | 0.928 |
196
+ | UNIQUE | 0.878 | 0.896 |
197
+ | MUSIQ | - | 0.916 |
198
+ | DBCNN | 0.878 | 0.864 |
199
+
200
+ ## ✍️ Citation
201
+
202
+ If you use this work, please cite our paper:
203
+
204
+ ```bibtex
205
+ @article{benmahaneHassouni2025mplugvpiqa,
206
+ title = {Parameter-Efficient Adaptation of mPLUG-Owl2 via Pixel-Level Visual Prompts for NR-IQA},
207
+ author = {Benmahane, Yahya and El Hassouni, Mohammed},
208
+ journal = {arXiv preprint arXiv:2509.03494},
209
+ year = {2025},
210
+ url = {https://arxiv.org/abs/2509.03494}
211
+ }
212
+ ```
213
+
214
+ ## πŸ“š Acknowledgments
215
+
216
+ - [mPLUG-Owl2](https://github.com/X-PLUG/mPLUG-Owl) for the base multimodal LLM
217
+ - HuggingFace Transformers for the training framework
218
+ - [Bahng et al. (2022)](https://arxiv.org/abs/2203.17274)