| pipeline_tag: any-to-any | |
| # CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models | |
| Official implementation of **CLEAR**, a unified multimodal model that leverages generative capabilities (image restoration) to improve visual understanding of degraded images. | |
| [**Paper**](https://arxiv.org/abs/2604.04780) | [**Project Page**](https://haoxiangzhao12138.github.io/CLEAR/) | [**GitHub**](https://github.com/haoxiangzhao12138/CLEAR) | |
| ## Introduction | |
| Image degradation from blur, noise, compression, and poor illumination severely undermines multimodal understanding in real-world settings. CLEAR (Corruption-aware interleaved reasoning) is a framework that connects understanding and generation pathway through three progressive steps: | |
| 1. **Stage 1 — SFT**: Corruption-aware supervised fine-tuning with interleaved `<think>` / `<image_restore>` / `<answer>` reasoning to establish the reasoning pattern. | |
| 2. **Stage 2 — Bridge Training**: A latent representation bridge that maps denoised VAE latents directly back into the LLM's token space, avoiding costly decode-reencode. | |
| 3. **Stage 3 — Interleaved GRPO**: A reinforcement learning method (Group Relative Policy Optimization) that jointly optimizes text reasoning and visual generation under rewards for accuracy, format, decision, and latent quality. | |
| CLEAR substantially improves robustness on degraded inputs while preserving clean-image performance. | |
| ## MMD-Bench | |
| The authors propose **MMD-Bench**, a comprehensive degradation benchmark covering 16 corruption types across 4 categories (Capture, Transmission, Environment, Post-processing) at 3 severity levels. | |
| ## Citation | |
| ```bibtex | |
| @misc{hao2026clearunlockinggenerativepotential, | |
| title={CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models}, | |
| author={Xiangzhao Hao and Zefeng Zhang and Zhenyu Zhang and Linhao Yu and Yao Chen and Yiqian Zhang and Haiyun Guo and Shuohuan Wang and Yu Sun}, | |
| year={2026}, | |
| eprint={2604.04780}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CV}, | |
| url={https://arxiv.org/abs/2604.04780}, | |
| } | |
| ``` | |
| ## Acknowledgments | |
| CLEAR is built upon [BAGEL](https://github.com/ByteDance-Seed/BAGEL) by ByteDance Seed. We thank the open-source community for [VLMEvalKit](https://github.com/open-compass/VLMEvalKit), [HuggingFace Transformers](https://github.com/huggingface/transformers), and [TRL](https://github.com/huggingface/trl). |