Spaces:

IP-GRM
/

README

Configuration error

App Files Files Community

README / README.md

ShadeCloak

Update README.md

d229c6f verified about 2 months ago

preview code

raw

history blame contribute delete

1.66 kB

	---
	title: IP-GRM for Creative Writing
	---

	# IP-GRM: Unbiased Principles, Robust Rewards

	IP-GRM (Independent Principle Generative Reward Model) is a decoupled reward-modeling framework for open-ended RLHF tasks. It explicitly separates principle generation from response judging, reducing response-conditioned bias (Principle Drift) and improving reward robustness in GRPO training.

	## Resources

	\| Resource \| Description \|
	\|----------\|-------------\|
	\| [IP-GRM](https://huggingface.co/IP-GRM/IP-GRM) \| 16B generative reward model with decoupled principle-judgment pipeline \|
	\| [CreativeWriting-8B](https://huggingface.co/IP-GRM/CreativeWriting-8B) \| 8B creative writing model trained via GRPO with IP-GRM rewards \|
	\| [IP-rewarding-8K](https://huggingface.co/datasets/IP-GRM/IP-rewarding-8K) \| 8K decoupled reward SFT dataset (principle + judgment pairs) \|
	\| [Paper](https://arxiv.org/abs/) \| arXiv preprint \|
	\| [Code](https://github.com/ShadeCloak/IP-GRM) \| Training scripts and IP-GRM process functions \|

	## Key Idea

	Standard generative reward models (GRMs) couple principle generation with response observation, causing Principle Drift — the evaluation criteria shift to accommodate the response being judged. IP-GRM eliminates this bias through a two-stage factorization:

	- Stage 1 `P(P \| Q)` — generate evaluation principles from the question only
	- Stage 2 `P(J, r \| Q, P, R)` — judge the response under pre-defined principles

	This ensures conditional independence `I(P; R \| Q) = 0`, and enables Principle Cache — generating principles once per prompt and reusing them across all sampled responses in a GRPO group.

	---
	title: IP-GRM for Creative Writing
	---

	# IP-GRM: Unbiased Principles, Robust Rewards

	IP-GRM (Independent Principle Generative Reward Model) is a decoupled reward-modeling framework for open-ended RLHF tasks. It explicitly separates principle generation from response judging, reducing response-conditioned bias (Principle Drift) and improving reward robustness in GRPO training.

	## Resources

	\| Resource \| Description \|
	\|----------\|-------------\|
	\| [IP-GRM](https://huggingface.co/IP-GRM/IP-GRM) \| 16B generative reward model with decoupled principle-judgment pipeline \|
	\| [CreativeWriting-8B](https://huggingface.co/IP-GRM/CreativeWriting-8B) \| 8B creative writing model trained via GRPO with IP-GRM rewards \|
	\| [IP-rewarding-8K](https://huggingface.co/datasets/IP-GRM/IP-rewarding-8K) \| 8K decoupled reward SFT dataset (principle + judgment pairs) \|
	\| [Paper](https://arxiv.org/abs/) \| arXiv preprint \|
	\| [Code](https://github.com/ShadeCloak/IP-GRM) \| Training scripts and IP-GRM process functions \|

	## Key Idea

	Standard generative reward models (GRMs) couple principle generation with response observation, causing Principle Drift — the evaluation criteria shift to accommodate the response being judged. IP-GRM eliminates this bias through a two-stage factorization:

	- Stage 1 `P(P \| Q)` — generate evaluation principles from the question only
	- Stage 2 `P(J, r \| Q, P, R)` — judge the response under pre-defined principles

	This ensures conditional independence `I(P; R \| Q) = 0`, and enables Principle Cache — generating principles once per prompt and reusing them across all sampled responses in a GRPO group.