GRPO-Guard / README.md

Update README.md

28dafef verified about 2 months ago

3.61 kB

	---
	language: en
	license: apache-2.0
	tags:
	- image-generation
	- diffusion
	- grpo
	model_name: GRPO-Guard
	pipeline_tag: text-to-image
	---

	<h1 align="center"> GRPO-Guard:<br>Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping </h1>
	<div align="center">
	<a href='https://arxiv.org/abs/2510.22319'><img src='https://img.shields.io/badge/ArXiv-red?logo=arxiv'></a>
	<a href='https://jingw193.github.io/GRPO-Guard/'><img src='https://img.shields.io/badge/Visualization-green?logo=github'></a>
	<a href="https://github.com/yifan123/flow_grpo#%EF%B8%8F-over-optimization-grpo-guard-"><img src="https://img.shields.io/badge/Code-9E95B7?logo=github"></a>
	</div>


	## Over-optimization

	To mitigates implicit over-optimization in flow matching, we propose [GRPO-Guard](https://arxiv.org/abs/2510.22319) ( [🔥Project Page](https://jingw193.github.io/GRPO-Guard/))

	We first observe that the importance ratio exhibits an inherent bias:

	1. Its mean is consistently below 1 and becomes significantly pronounced at low-noise steps (e.g., step 8 in SD3.5-M).

	2. The variance varies notably across different steps.

	Ideally, the importance ratio distribution should have a mean of 1 and stable variance. The clipping operation truncates overly confident positive or negative samples outside the region [1−ϵ,1+ϵ], ensuring stable gradient updates. However, the observed bias in the importance ratio disrupts this mechanism—gradients of positive samples are no longer properly constrained, leading the policy model into over-optimization. As a result, the proxy score continues to rise while the gold score declines, causing a severe degradation in image quality.


	The biased ratio distributions are summarized in the table below.

	\| FlowGRPO \| GRPO-Guard\|
	\| - \| - \|
	\| ![flow_grpo ratio](assets/gif_1.gif) \| ![grpo_guard ratio](assets/gif_2.gif) \|
	\| The clipping mechanism is imbalanced, failing to constrain overconfident positive samples. \| The clipping mechanism is imbalanced, failing to constrain overconfident positive samples.\|


	To address this issue, [GRPO-Guard](https://arxiv.org/abs/2510.22319) introduces two mechanisms that effectively alleviate over-optimization:

	- RatioNorm: Corrects the distributional bias of importance ratios and unifies their statistics across denoising steps.

	- Gradient Reweight: Further reweights the gradients of different denoising steps based on RatioNorm, balancing their contributions and preventing excessive optimization under specific noise levels.

	The following figure compares over-optimization between GRPO-Guard and FlowGRPO on text rendering tasks. GRPO-Guard maintains the same rising trend in proxy scores as FlowGRPO while preventing rapid declines in gold scores, thus preserving high image quality and diversity.

	<p align="center">
	<img src="assets/GRPO-Guard-figure1.png" alt="GRPO-Guard Illustration" width=1000"/>
	</p>


	## ⭐Citation

	If you find GRPO-Guard useful for your research or projects, we would greatly appreciate it if you could cite the following paper:
	```
	@misc{wang2025grpoguardmitigatingimplicitoveroptimization,
	title={GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping},
	author={Jing Wang and Jiajun Liang and Jie Liu and Henglin Liu and Gongye Liu and Jun Zheng and Wanyuan Pang and Ao Ma and Zhenyu Xie and Xintao Wang and Meng Wang and Pengfei Wan and Xiaodan Liang},
	year={2025},
	eprint={2510.22319},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2510.22319},
	}
	```