Add model card and metadata for CLEAR

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +41 -0
README.md ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: any-to-any
3
+ ---
4
+
5
+ # CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models
6
+
7
+ Official implementation of **CLEAR**, a unified multimodal model that leverages generative capabilities (image restoration) to improve visual understanding of degraded images.
8
+
9
+ [**Paper**](https://arxiv.org/abs/2604.04780) | [**Project Page**](https://haoxiangzhao12138.github.io/CLEAR/) | [**GitHub**](https://github.com/haoxiangzhao12138/CLEAR)
10
+
11
+ ## Introduction
12
+
13
+ Image degradation from blur, noise, compression, and poor illumination severely undermines multimodal understanding in real-world settings. CLEAR (Corruption-aware interleaved reasoning) is a framework that connects understanding and generation pathway through three progressive steps:
14
+
15
+ 1. **Stage 1 — SFT**: Corruption-aware supervised fine-tuning with interleaved `<think>` / `<image_restore>` / `<answer>` reasoning to establish the reasoning pattern.
16
+ 2. **Stage 2 — Bridge Training**: A latent representation bridge that maps denoised VAE latents directly back into the LLM's token space, avoiding costly decode-reencode.
17
+ 3. **Stage 3 — Interleaved GRPO**: A reinforcement learning method (Group Relative Policy Optimization) that jointly optimizes text reasoning and visual generation under rewards for accuracy, format, decision, and latent quality.
18
+
19
+ CLEAR substantially improves robustness on degraded inputs while preserving clean-image performance.
20
+
21
+ ## MMD-Bench
22
+
23
+ The authors propose **MMD-Bench**, a comprehensive degradation benchmark covering 16 corruption types across 4 categories (Capture, Transmission, Environment, Post-processing) at 3 severity levels.
24
+
25
+ ## Citation
26
+
27
+ ```bibtex
28
+ @misc{hao2026clearunlockinggenerativepotential,
29
+ title={CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models},
30
+ author={Xiangzhao Hao and Zefeng Zhang and Zhenyu Zhang and Linhao Yu and Yao Chen and Yiqian Zhang and Haiyun Guo and Shuohuan Wang and Yu Sun},
31
+ year={2026},
32
+ eprint={2604.04780},
33
+ archivePrefix={arXiv},
34
+ primaryClass={cs.CV},
35
+ url={https://arxiv.org/abs/2604.04780},
36
+ }
37
+ ```
38
+
39
+ ## Acknowledgments
40
+
41
+ CLEAR is built upon [BAGEL](https://github.com/ByteDance-Seed/BAGEL) by ByteDance Seed. We thank the open-source community for [VLMEvalKit](https://github.com/open-compass/VLMEvalKit), [HuggingFace Transformers](https://github.com/huggingface/transformers), and [TRL](https://github.com/huggingface/trl).