File size: 2,501 Bytes

4860dc8

---
pipeline_tag: any-to-any
---

# CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models

Official implementation of **CLEAR**, a unified multimodal model that leverages generative capabilities (image restoration) to improve visual understanding of degraded images.

[**Paper**](https://arxiv.org/abs/2604.04780) | [**Project Page**](https://haoxiangzhao12138.github.io/CLEAR/) | [**GitHub**](https://github.com/haoxiangzhao12138/CLEAR)

## Introduction

Image degradation from blur, noise, compression, and poor illumination severely undermines multimodal understanding in real-world settings. CLEAR (Corruption-aware interleaved reasoning) is a framework that connects understanding and generation pathway through three progressive steps:

1.  **Stage 1 — SFT**: Corruption-aware supervised fine-tuning with interleaved `<think>` / `<image_restore>` / `<answer>` reasoning to establish the reasoning pattern.
2.  **Stage 2 — Bridge Training**: A latent representation bridge that maps denoised VAE latents directly back into the LLM's token space, avoiding costly decode-reencode.
3.  **Stage 3 — Interleaved GRPO**: A reinforcement learning method (Group Relative Policy Optimization) that jointly optimizes text reasoning and visual generation under rewards for accuracy, format, decision, and latent quality.

CLEAR substantially improves robustness on degraded inputs while preserving clean-image performance.

## MMD-Bench

The authors propose **MMD-Bench**, a comprehensive degradation benchmark covering 16 corruption types across 4 categories (Capture, Transmission, Environment, Post-processing) at 3 severity levels.

## Citation

```bibtex
@misc{hao2026clearunlockinggenerativepotential,
      title={CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models},
      author={Xiangzhao Hao and Zefeng Zhang and Zhenyu Zhang and Linhao Yu and Yao Chen and Yiqian Zhang and Haiyun Guo and Shuohuan Wang and Yu Sun},
      year={2026},
      eprint={2604.04780},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2604.04780},
}
```

## Acknowledgments

CLEAR is built upon [BAGEL](https://github.com/ByteDance-Seed/BAGEL) by ByteDance Seed. We thank the open-source community for [VLMEvalKit](https://github.com/open-compass/VLMEvalKit), [HuggingFace Transformers](https://github.com/huggingface/transformers), and [TRL](https://github.com/huggingface/trl).