CUDAOUTOFMEMORY
/

CLEAR

Image-Text-to-Text

image-restoration

Model card Files Files and versions

CLEAR / README.md

nielsr's picture

nielsr HF Staff

Add model card and metadata for CLEAR

4860dc8 verified 1 day ago

|

2.5 kB

	---
	pipeline_tag: any-to-any
	---

	# CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models

	Official implementation of CLEAR, a unified multimodal model that leverages generative capabilities (image restoration) to improve visual understanding of degraded images.

	[Paper](https://arxiv.org/abs/2604.04780) \| [Project Page](https://haoxiangzhao12138.github.io/CLEAR/) \| [GitHub](https://github.com/haoxiangzhao12138/CLEAR)

	## Introduction

	Image degradation from blur, noise, compression, and poor illumination severely undermines multimodal understanding in real-world settings. CLEAR (Corruption-aware interleaved reasoning) is a framework that connects understanding and generation pathway through three progressive steps:

	1. Stage 1 — SFT: Corruption-aware supervised fine-tuning with interleaved `<think>` / `<image_restore>` / `<answer>` reasoning to establish the reasoning pattern.
	2. Stage 2 — Bridge Training: A latent representation bridge that maps denoised VAE latents directly back into the LLM's token space, avoiding costly decode-reencode.
	3. Stage 3 — Interleaved GRPO: A reinforcement learning method (Group Relative Policy Optimization) that jointly optimizes text reasoning and visual generation under rewards for accuracy, format, decision, and latent quality.

	CLEAR substantially improves robustness on degraded inputs while preserving clean-image performance.

	## MMD-Bench

	The authors propose MMD-Bench, a comprehensive degradation benchmark covering 16 corruption types across 4 categories (Capture, Transmission, Environment, Post-processing) at 3 severity levels.

	## Citation

	```bibtex
	@misc{hao2026clearunlockinggenerativepotential,
	title={CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models},
	author={Xiangzhao Hao and Zefeng Zhang and Zhenyu Zhang and Linhao Yu and Yao Chen and Yiqian Zhang and Haiyun Guo and Shuohuan Wang and Yu Sun},
	year={2026},
	eprint={2604.04780},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2604.04780},
	}
	```

	## Acknowledgments

	CLEAR is built upon [BAGEL](https://github.com/ByteDance-Seed/BAGEL) by ByteDance Seed. We thank the open-source community for [VLMEvalKit](https://github.com/open-compass/VLMEvalKit), [HuggingFace Transformers](https://github.com/huggingface/transformers), and [TRL](https://github.com/huggingface/trl).