File size: 1,661 Bytes
5dc78a1
38ac236
5dc78a1
 
38ac236
 
 
 
 
 
 
 
 
 
 
d4c4733
08448cb
38ac236
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
---
title: IP-GRM for Creative Writing
---

# IP-GRM: Unbiased Principles, Robust Rewards

IP-GRM (Independent Principle Generative Reward Model) is a decoupled reward-modeling framework for open-ended RLHF tasks. It explicitly separates **principle generation** from **response judging**, reducing response-conditioned bias (*Principle Drift*) and improving reward robustness in GRPO training.

## Resources

| Resource | Description |
|----------|-------------|
| [IP-GRM](https://huggingface.co/IP-GRM/IP-GRM) | 16B generative reward model with decoupled principle-judgment pipeline |
| [CreativeWriting-8B](https://huggingface.co/IP-GRM/CreativeWriting-8B) | 8B creative writing model trained via GRPO with IP-GRM rewards |
| [IP-rewarding-8K](https://huggingface.co/datasets/IP-GRM/IP-rewarding-8K) | 8K decoupled reward SFT dataset (principle + judgment pairs) |
| [Paper](https://arxiv.org/abs/) | arXiv preprint |
| [Code](https://github.com/ShadeCloak/IP-GRM) | Training scripts and IP-GRM process functions |

## Key Idea

Standard generative reward models (GRMs) couple principle generation with response observation, causing **Principle Drift** — the evaluation criteria shift to accommodate the response being judged. IP-GRM eliminates this bias through a two-stage factorization:

- **Stage 1** `P(P | Q)` — generate evaluation principles from the question only
- **Stage 2** `P(J, r | Q, P, R)` — judge the response under pre-defined principles

This ensures conditional independence `I(P; R | Q) = 0`, and enables **Principle Cache** — generating principles once per prompt and reusing them across all sampled responses in a GRPO group.