yahya007
/

mplug2-vp-for-nriqa

Image-Text-to-Text

English

Model card Files Files and versions

xet

Community

Improve model card: Update pipeline tag, add license, and enhance content with usage and results

by nielsr HF Staff - opened Sep 4, 2025

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+195

-2

Files changed (1) hide show

README.md +195 -2

README.md CHANGED Viewed

@@ -3,16 +3,34 @@ datasets:
 - chaofengc/IQA-PyTorch-Datasets
 language:
 - en
-pipeline_tag: visual-question-answering
 library_name: transformers
 ---
 # Visual Prompt Checkpoints for NR-IQA
 🔬 **Paper**: [Parameter-Efficient Adaptation of mPLUG-Owl2 via Pixel-Level Visual Prompts for NR-IQA](https://arxiv.org/abs/2509.03494)
 💻 **Code**: [GitHub Repository](https://github.com/yahya-ben/mplug2-vp-for-nriqa)
 ## Overview
 Pre-trained visual prompt checkpoints for **No-Reference Image Quality Assessment (NR-IQA)** using mPLUG-Owl2-7B. Achieves competitive performance with only **~600K parameters** vs 7B+ for full fine-tuning.
 ## Available Checkpoints
 **Download**: `visual_prompt_ckpt_trained_on_mplug2.zip`
@@ -22,4 +40,179 @@ Pre-trained visual prompt checkpoints for **No-Reference Image Quality Assessmen
 | KonIQ-10k | 0.852 | `SGD_mplug2_exp_05_koniq_padding_30px_add/` |
 | AGIQA-3k | 0.810 | `SGD_mplug2_exp_06_agiqa_padding_30px_add/` |
-**📖 For detailed setup, training, and usage instructions, see the [GitHub repository](https://github.com/your-username/visual-prompt-nr-iqa).**

 - chaofengc/IQA-PyTorch-Datasets
 language:
 - en
 library_name: transformers
+pipeline_tag: image-text-to-text
+license: apache-2.0
 ---
+![Python](https://img.shields.io/badge/python-3.10-blue) ![HuggingFace](https://img.shields.io/badge/hub-checkpoints-orange) [![arXiv](https://img.shields.io/badge/arXiv-2509.03494-lightgrey)](https://arxiv.org/abs/2509.03494)
 # Visual Prompt Checkpoints for NR-IQA
 🔬 **Paper**: [Parameter-Efficient Adaptation of mPLUG-Owl2 via Pixel-Level Visual Prompts for NR-IQA](https://arxiv.org/abs/2509.03494)
 💻 **Code**: [GitHub Repository](https://github.com/yahya-ben/mplug2-vp-for-nriqa)
+## Abstract
+In this paper, we propose a novel parameter-efficient adaptation method for No- Reference Image Quality Assessment (NR-IQA) using visual prompts optimized in pixel-space. Unlike full fine-tuning of Multimodal Large Language Models (MLLMs), our approach trains only 600K parameters at most (< 0.01% of the base model), while keeping the underlying model fully frozen. During inference, these visual prompts are combined with images via addition and processed by mPLUG-Owl2 with the textual query "Rate the technical quality of the image." Evaluations across distortion types (synthetic, realistic, AI-generated) on KADID- 10k, KonIQ-10k, and AGIQA-3k demonstrate competitive performance against full finetuned methods and specialized NR-IQA models, achieving 0.93 SRCC on KADID-10k. To our knowledge, this is the first work to leverage pixel-space visual prompts for NR-IQA, enabling efficient MLLM adaptation for low-level vision tasks. The source code is publicly available at https: // github. com/ yahya-ben/ mplug2-vp-for-nriqa .
 ## Overview
 Pre-trained visual prompt checkpoints for **No-Reference Image Quality Assessment (NR-IQA)** using mPLUG-Owl2-7B. Achieves competitive performance with only **~600K parameters** vs 7B+ for full fine-tuning.
+## 🔥 Key Features
+-   **Parameter-Efficient**: Only ~600K trainable parameters vs 7B+ for full fine-tuning
+-   **Competitive Performance**: Achieves 0.93 SROCC on KADID-10k dataset
+-   **Multiple Visual Prompt Types**: Padding, Fixed Patches (Center/Top-Left), Full Overlay
+-   **Multiple MLLM Support**: mPLUG-Owl2-7B
+-   **Comprehensive Evaluation**: Supports KADID-10k, KonIQ-10k, and AGIQA-3k datasets
+-   **Pre-trained Checkpoints**: Available on HuggingFace Hub for immediate use
+![Method Overview](https://github.com/yahya-ben/mplug2-vp-for-nriqa/raw/main/hero_figure.png)
 ## Available Checkpoints
 **Download**: `visual_prompt_ckpt_trained_on_mplug2.zip`
 | KonIQ-10k | 0.852 | `SGD_mplug2_exp_05_koniq_padding_30px_add/` |
 | AGIQA-3k | 0.810 | `SGD_mplug2_exp_06_agiqa_padding_30px_add/` |
+## 🏃 Usage
+This section provides instructions for setting up the environment, preparing datasets, and running inference with the pre-trained visual prompt checkpoints. For detailed setup, training, and further usage instructions, please refer to the [GitHub repository](https://github.com/yahya-ben/mplug2-vp-for-nriqa).
+### Prerequisites
+-   Python 3.10+
+-   CUDA-capable GPU (tested on NVIDIA RTX A6000)
+-   PyTorch
+-   HuggingFace Transformers
+### Setup Environment
+#### For mPLUG-Owl2:
+```bash
+# Clone and setup mPLUG-Owl2
+git clone https://github.com/X-PLUG/mPLUG-Owl.git
+cd mPLUG-Owl/mPLUG-Owl2
+conda create -n mplug_owl2 python=3.10 -y
+conda activate mplug_owl2
+pip install --upgrade pip
+pip install -e .
+pip install 'numpy<2'
+pip install protobuf
+```
+#### Additional Dependencies:
+```bash
+pip install PyYAML scikit-learn tqdm
+```
+### Dataset Setup
+Download the required IQA datasets:
+```bash
+# KonIQ-10k
+wget https://huggingface.co/datasets/chaofengc/IQA-PyTorch-Datasets/resolve/main/koniq10k.tgz
+tar -xzf koniq10k.tgz
+# KADID-10k
+wget https://huggingface.co/datasets/chaofengc/IQA-PyTorch-Datasets/resolve/main/kadid10k.tgz
+tar -xzf kadid10k.tgz
+# AGIQA-3K
+wget https://huggingface.co/datasets/chaofengc/IQA-PyTorch-Datasets/resolve/main/AGIQA-3K.zip
+unzip AGIQA-3K.zip
+```
+After extraction, organize your datasets in the `data/` folder as follows:
+```
+data/
+├── kadid10k/
+│   ├── images/           # All KADID-10k images
+│   └── split_kadid10k.csv
+├── koniq10k/
+│   ├── 512x384/          # KonIQ-10k images (comes with own split)
+│   └── koniq10k_*.csv    # Original split files
+└── AGIQA-3K/
+    ├── images/           # All AGIQA-3k images
+    └── split_agiqa3k.csv
+```
+**Important Notes:**
+-   **KADID-10k**: Move `split_kadid10k.csv` into the `kadid10k/` folder
+-   **KonIQ-10k**: Uses its own original split files, no need to move
+-   **AGIQA-3k**: Move `split_agiqa3k.csv` into the `AGIQA-3K/` folder; images are in the `images/` subfolder
+### Pre-trained Checkpoints
+We provide pre-trained visual prompt checkpoints on **HuggingFace Hub** for immediate use:
+🔗 **[Download Checkpoints](https://huggingface.co/yahya007/mplug2-vp-for-nriqa/tree/main)**
+The checkpoints are provided as `visual_prompt_ckpt_trained_on_mplug2.zip` containing training experiment folders with checkpoint directories (`checkpoint-xxxx`). Each experiment folder contains multiple epochs, and the best performing checkpoint can be identified from the `best_model_checkpoint` info in the final checkpoint folder.
+To use the pre-trained checkpoints:
+1.  **Download and extract the checkpoint archive**:
+    ```bash
+    # Download from HuggingFace Hub
+    wget https://huggingface.co/yahya007/mplug2-vp-for-nriqa/blob/main/visual_prompt_ckpt_trained_on_mplug2.zip
+    unzip visual_prompt_ckpt_trained_on_mplug2.zip
+    ```
+2.  **Navigate to the desired experiment folder**:
+    ```bash
+    cd SGD_mplug2_exp_04_kadid_padding_30px_add/
+    # Check the latest checkpoint folder (highest number)
+    ls -la checkpoint-*/
+    # Look for best_model_checkpoint info in the final checkpoint
+    ```
+3.  **Update the configuration and checkpoint** in `src/tester.py`:
+    ```python
+    # Update config path
+    config_path = "configs/final_mplug_owl2_configs/SGD_mplug2_exp_04_kadid_padding_30px_add.yaml"
+    # Update checkpoint name - use "checkpoint-best" or specific checkpoint number
+    checkpoint_best = "checkpoint-best"  # or "checkpoint-XXXX" for specific epoch
+    ```
+4.  **Run inference**:
+    ```bash
+    cd src
+    python tester.py
+    ```
+The inference script outputs:
+-   SRCC (Spearman Rank Correlation Coefficient)
+-   PLCC (Pearson Linear Correlation Coefficient)
+## 📈 Results
+### Best Performance (30px Padding + Addition)
+| Dataset | SROCC | PLCC | Parameters |
+|---------|-------|------|------------|
+| KADID-10k | 0.932 | 0.929 | ~600K |
+| KonIQ-10k | 0.852 | 0.874 | ~600K |
+| AGIQA-3k | 0.810 | 0.860 | ~600K |
+### Performance Across Visual Prompt Types
+| Prompt Type | Size | KADID-10k SROCC | KonIQ-10k SROCC | AGIQA-3k SROCC |
+|-------------|------|-----------------|-----------------|----------------|
+| Padding | 10px | 0.880 | 0.805 | 0.802 |
+| Padding | 30px | **0.932** | **0.852** | **0.810** |
+| Fixed Patch (Center) | 10px | 0.390 | 0.487 | 0.435 |
+| Fixed Patch (Center) | 30px | 0.806 | 0.647 | 0.725 |
+| Fixed Patch (Top-Left) | 10px | 0.465 | 0.551 | 0.564 |
+| Fixed Patch (Top-Left) | 30px | 0.520 | 0.635 | 0.755 |
+| Full Overlay | — | 0.887 | 0.693 | 0.624 |
+### Comparison with State-of-the-Art Methods
+| Method | KADID-10k SROCC | KonIQ-10k SROCC | AGIQA-3k SROCC | Parameters |
+|--------|-----------------|-----------------|----------------|------------|
+| **Our Method** | **0.932** | **0.852** | **0.810** | ~600K |
+| Q-Align | 0.919 | 0.940 | 0.727 | 7B |
+| Q-Instruct | 0.706 | 0.911 | 0.772 | 7B |
+| LIQE | 0.930 | 0.919 | - | - |
+| MP-IQE | 0.941 | 0.898 | - | - |
+| MCPF-IQA | - | 0.918 | 0.872 | - |
+| Q-Adapt | 0.769 | 0.878 | 0.757 | - |
+### Comparison with Specialized NR-IQA Models
+| Method | KADID-10k SROCC | KonIQ-10k SROCC |
+|--------|-----------------|-----------------|
+| **Our Method** | **0.932** | 0.852 |
+| HyperIQA | 0.872 | 0.906 |
+| TreS | 0.858 | 0.928 |
+| UNIQUE | 0.878 | 0.896 |
+| MUSIQ | - | 0.916 |
+| DBCNN | 0.878 | 0.864 |
+## ✍️ Citation
+If you use this work, please cite our paper:
+```bibtex
+@article{benmahaneHassouni2025mplugvpiqa,
+  title        = {Parameter-Efficient Adaptation of mPLUG-Owl2 via Pixel-Level Visual Prompts for NR-IQA},
+  author       = {Benmahane, Yahya and El Hassouni, Mohammed},
+  journal      = {arXiv preprint arXiv:2509.03494},
+  year         = {2025},
+  url          = {https://arxiv.org/abs/2509.03494}
+}
+```
+## 📚 Acknowledgments
+-   [mPLUG-Owl2](https://github.com/X-PLUG/mPLUG-Owl) for the base multimodal LLM
+-   HuggingFace Transformers for the training framework
+-   [Bahng et al. (2022)](https://arxiv.org/abs/2203.17274)