--- license: apache-2.0 library_name: transformers pipeline_tag: image-text-to-text tags: - evoquality - image-quality-assessment - vlm - multimodal --- # EvoQuality ## 1. Model Overview - **Model Name**: EvoQuality (Self-Evolving VLM for Image Quality Assessment) - **Task**: No-Reference Image Quality Assessment (NR-IQA), supporting both single-image quality scoring and pairwise quality comparison (ranking) - **Core Idea**: Without relying on any human-annotated quality scores or distortion-type labels, EvoQuality generates pseudo-ranking labels via **pairwise majority voting**, and converts them into an optimizable reward signal through **GRPO** to iteratively self-evolve its quality perception capability - **Paper**: [Self-Evolving Vision-Language Models for Image Quality Assessment via Voting and Ranking](https://openreview.net/forum?id=INOi0YqI8p) (ICLR 2026, arXiv:2509.25787) ## 2. Model and Framework Details - **Backbone Model (paper setting)**: `Qwen2.5-VL-7B` (used as the baseline policy) - **Training Paradigm**: Two-stage cycle, supports multi-round iteration (`T=2` in the paper) - **Offline Stage (Pseudo-label)**: Perform `K` comparisons on randomly sampled image pairs, then derive pseudo-preferences `p*(xi, xj)` via majority voting - **Online Stage (RL)**: Convert pseudo-preferences into a fidelity reward and update the policy via **Group Relative Policy Optimization (GRPO)** (full fine-tuning of the VLM) ## 3. Prompts - **Offline Comparison** **`c_compare`**: - ` You are performing an image quality assessment task. Compare the two images and decide which one has better perceptual quality. Answer strictly with the index of the better image: 0 if the first image is better, or 1 if the second image is better.` - **Online Scoring** **`c_score`**: - ` You are doing the image quality assessment task. Here is the question: What is your overall rating on the quality of this picture? The rating should be a float between 1 and 5, rounded to two decimal places, with 1 representing very poor quality and 5 representing excellent quality.` - **Reasoning Suffix (for self-consistency sampling)**: - `You FIRST think about the reasoning process as an internal monologue and then provide the final answer. The reasoning process MUST BE enclosed within tags. The final answer MUST BE put in boxed{}.` ## 4. Training - **Number of Iterations**: `T = 1` (the open-sourced model weights are the result of the first round of self-evolution) - **Training Data**: No additional synthetic distortion data and no extra annotated labels were added when producing the released weights - **Offline Stage**: - Sample `K=32` responses per pair, then derive pseudo-labels via majority voting - Randomly swap image order to mitigate positional bias - **Online Stage (GRPO)**: - Sample `K=32` responses per sample (`c_score`) - Optimizer: AdamW, initial learning rate `3e-7`, with linear decay - KL coefficient: `beta = 0.05` - Resources (as reported in the paper): 8x NVIDIA A100, per-GPU batch size = 4, ~12 hours/epoch ## 5. Evaluation Metrics - **Evaluation Setting**: zero-shot (no training on the target test sets) - **Metrics**: PLCC, SRCC (consistency with human subjective quality) ## 6. Main Results - **Improvement over the Backbone Model (Qwen2.5-VL-7B)**: weighted average (WA VG.) over multiple benchmarks - PLCC: `0.615 -> 0.770` (+31.8%) - SRCC: `0.570 -> 0.726` (+33.7%) - **Generalization**: Achieves significant improvements across diverse distortion types and AI-generated content, matching or surpassing several supervised VLM-IQA approaches on multiple benchmarks (see the paper for detailed tables) ## 7. Intended Use and Usage Guidelines - **Recommended Use** - Research and evaluation: NR-IQA, cross-dataset generalization comparison, quality ranking/filtering, auxiliary signals for data cleaning - Pre-production assessment: as a perceptual quality proxy, but should be combined with business data and manual spot-check validation - **Not Recommended Use** - As the sole quality criterion for high-stakes decisions (content moderation, medical imaging diagnostic conclusions, legal evidence adjudication, etc.) - Treating model outputs as "absolute objective ground truth" (IQA is inherently subjective and correlated with population preferences) - **Output Notes** - The paper's prompts require outputs in the form of `...` with `boxed{score}`; for actual integration, it is recommended to parse only the value inside `boxed{}` and consider how temperature/sampling strategies affect consistency ## 8. Limitations and Known Risks - **Self-supervised Pseudo-label Bias**: Pseudo-rankings are derived from the model's own votes, which may amplify the systematic preferences or blind spots of the backbone model - **Domain Shift**: May fail on images from specific domains (medical, remote sensing, industrial inspection) - **Subjectivity and Population Differences**: Different cultural/aesthetic preferences and task objectives (aesthetics vs. clarity) can change the definition of "quality" - **Prompt Sensitivity**: Variations in prompts, sampling count K, and decoding strategies can affect self-consistency voting and final performance

## 9. Citation ```bibtex @article{wen2025selfevolving, title={Self-Evolving Vision-Language Models for Image Quality Assessment via Voting and Ranking}, author={Wen, Wen and Zhi, Tianwu and Fan, Kanglong and Li, Yang and Peng, Xinge and Zhang, Yabin and Liao, Yiting and Li, Junlin and Zhang, Li}, journal={arXiv preprint arXiv:2509.25787}, year={2025} } ```