wendell0218
/

Janus-FocusDiff-7B

Safetensors

multi_modality

Model card Files Files and versions

xet

Community

wendell0218 commited on Jun 21, 2025

Commit

71aeaf6

verified ·

1 Parent(s): f4df8cf

Create README.md

Browse files

Files changed (1) hide show

README.md +80 -0

README.md ADDED Viewed

	@@ -0,0 +1,80 @@

+---
+license: apache-2.0
+---
+<h2 align="center" style="line-height: 25px;">
+  FocusDiff: Advancing Fine-Grained Text-Image Alignment for Autoregressive Visual Generation through RL
+</h2>
+<p align="center">
+  <a href="https://arxiv.org/abs/2506.05501">
+    <img src="https://img.shields.io/badge/Paper-red?style=flat&logo=arxiv">
+  </a>
+  <a href="https://focusdiff.github.io/">
+    <img src="https://img.shields.io/badge/Project Page-white?style=flat&logo=google-docs">
+  </a>
+  <a href="https://github.com/wendell0218/FocusDiff">
+    <img src="https://img.shields.io/badge/Code-black?style=flat&logo=github">
+  </a>
+  <a href="https://huggingface.co/wendell0218/Janus-FocusDiff-7B">
+    <img src="https://img.shields.io/badge/-%F0%9F%A4%97%20Checkpoint-orange?style=flat"/>
+  </a>
+  <a href="">
+    <img src="https://img.shields.io/badge/-%F0%9F%A4%97%20Data-orange?style=flat"/>
+  </a>
+  <a href="">
+    <img src="https://img.shields.io/github/last-commit/wendell0218/FocusDiff?color=green">
+  </a>
+</p>
+<div align="center">
+Kaihang Pan<sup>1*</sup>, Wendong Bu<sup>1*</sup>, Yuruo Wu<sup>1*</sup>, Yang Wu<sup>2</sup>, Kai Shen<sup>1</sup>, Yunfei Li<sup>2</sup>,
+Hang Zhao<sup>2</sup>, Juncheng Li<sup>1&dagger;</sup>, Siliang Tang<sup>1</sup>, Yueting Zhuang<sup>1</sup>
+<sup>1</sup>Zhejiang University, <sup>2</sup>Ant Group
+\*Equal Contribution, <sup>&dagger;</sup>Corresponding Authors
+</div>
+![alt text](https://raw.githubusercontent.com/wendell0218/FocusDiff/refs/heads/main/assets/case.png)
+## 🚀 Overview
+**FocusDiff** is a new method for improving fine-grained text-image alignment in autoregressive text-to-image models. By introducing the **FocusDiff-Data** dataset and a novel **Pair-GRPO** reinforcement learning framework, we help models learn subtle semantic differences between similar text-image pairs. Based on paired data in FocusDiff-Data, we further introduce the **PairComp** Benchmark, which focuses on subtle semantic differences.
+Key Contributions:
+1. **PairComp Benchmark**: A new benchmark focusing on fine-grained differences in text prompts.
+    <img src="https://raw.githubusercontent.com/wendell0218/FocusDiff/refs/heads/main/assets/benchmark.png" width="100%"/>
+2. **FocusDiff Approach**: A method using paired data and reinforcement learning to enhance fine-grained text-image alignment.
+   <div style="text-align: center;">
+       <img src="https://raw.githubusercontent.com/wendell0218/FocusDiff/refs/heads/main/assets/grpo.png" width="80%" />
+   </div>
+3. **SOTA Results**: Our model is evaluated with the top performance on multiple benchmarks including **GenEval**, **T2I-CompBench**, **DPG-Bench**, and our newly proposed **PairComp** benchmark.
+## ✨️ Quickstart
+## 🤝 Acknowledgment
+Our project is developed based on the following repositories:
+- [Janus-Series](https://github.com/deepseek-ai/Janus): Unified Multimodal Understanding and Generation Models
+- [Open-R1](https://github.com/huggingface/open-r1): Fully open reproduction of DeepSeek-R1
+## 📜 Citation
+If you find this work useful for your research, please cite our paper and star our git repo:
+```bibtex
+@article{pan2025focusdiff,
+  title={FocusDiff: Advancing Fine-Grained Text-Image Alignment for Autoregressive Visual Generation through RL},
+  author={Pan, Kaihang and Bu, Wendong and Wu, Yuruo and Wu, Yang and Shen, Kai and Li, Yunfei and Zhao, Hang and Li, Juncheng and Tang, Siliang and Zhuang, Yueting},
+  journal={arXiv preprint arXiv:2506.05501},
+  year={2025}
+}
+```