Add model card and metadata for CLEAR

Hi! I'm Niels from the community science team at Hugging Face. I'm opening this PR to add a model card for CLEAR, a unified multimodal model that leverages generative capabilities (image restoration) to improve the visual understanding of degraded images.

This PR includes:
- Metadata with the `any-to-any` pipeline tag.
- Links to the [paper](https://arxiv.org/abs/2604.04780), project page, and GitHub repository.
- A summary of the three-stage training pipeline (SFT, Bridge Training, and Interleaved GRPO).
- Citation information.

Files changed (1) hide show

README.md +41 -0

README.md ADDED Viewed

	@@ -0,0 +1,41 @@

+---
+pipeline_tag: any-to-any
+---
+# CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models
+Official implementation of **CLEAR**, a unified multimodal model that leverages generative capabilities (image restoration) to improve visual understanding of degraded images.
+[**Paper**](https://arxiv.org/abs/2604.04780) | [**Project Page**](https://haoxiangzhao12138.github.io/CLEAR/) | [**GitHub**](https://github.com/haoxiangzhao12138/CLEAR)
+## Introduction
+Image degradation from blur, noise, compression, and poor illumination severely undermines multimodal understanding in real-world settings. CLEAR (Corruption-aware interleaved reasoning) is a framework that connects understanding and generation pathway through three progressive steps:
+1.  **Stage 1 — SFT**: Corruption-aware supervised fine-tuning with interleaved `<think>` / `<image_restore>` / `<answer>` reasoning to establish the reasoning pattern.
+2.  **Stage 2 — Bridge Training**: A latent representation bridge that maps denoised VAE latents directly back into the LLM's token space, avoiding costly decode-reencode.
+3.  **Stage 3 — Interleaved GRPO**: A reinforcement learning method (Group Relative Policy Optimization) that jointly optimizes text reasoning and visual generation under rewards for accuracy, format, decision, and latent quality.
+CLEAR substantially improves robustness on degraded inputs while preserving clean-image performance.
+## MMD-Bench
+The authors propose **MMD-Bench**, a comprehensive degradation benchmark covering 16 corruption types across 4 categories (Capture, Transmission, Environment, Post-processing) at 3 severity levels.
+## Citation
+```bibtex
+@misc{hao2026clearunlockinggenerativepotential,
+      title={CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models},
+      author={Xiangzhao Hao and Zefeng Zhang and Zhenyu Zhang and Linhao Yu and Yao Chen and Yiqian Zhang and Haiyun Guo and Shuohuan Wang and Yu Sun},
+      year={2026},
+      eprint={2604.04780},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2604.04780},
+}
+```
+## Acknowledgments
+CLEAR is built upon [BAGEL](https://github.com/ByteDance-Seed/BAGEL) by ByteDance Seed. We thank the open-source community for [VLMEvalKit](https://github.com/open-compass/VLMEvalKit), [HuggingFace Transformers](https://github.com/huggingface/transformers), and [TRL](https://github.com/huggingface/trl).