ShadeCloak commited on
Commit
38ac236
·
verified ·
1 Parent(s): 5dc78a1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +29 -7
README.md CHANGED
@@ -1,10 +1,32 @@
1
  ---
2
- title: README
3
- emoji: 🔥
4
- colorFrom: pink
5
- colorTo: purple
6
- sdk: static
7
- pinned: false
8
  ---
9
 
10
- Edit this `README.md` markdown file to author your organization card.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: IP-GRM for Creative Writing
 
 
 
 
 
3
  ---
4
 
5
+ # IP-GRM: Unbiased Principles, Robust Rewards
6
+
7
+ IP-GRM (Independent Principle Generative Reward Model) is a decoupled reward-modeling framework for open-ended RLHF tasks. It explicitly separates **principle generation** from **response judging**, reducing response-conditioned bias (*Principle Drift*) and improving reward robustness in GRPO training.
8
+
9
+ ## Resources
10
+
11
+ | Resource | Description |
12
+ |----------|-------------|
13
+ | [IP-GRM](https://huggingface.co/IP-GRM/IP-GRM) | 16B generative reward model with decoupled principle-judgment pipeline |
14
+ | [CreativeWriting-8B](https://huggingface.co/IP-GRM/CreativeWriting-8B) | 8B creative writing model trained via GRPO with IP-GRM rewards |
15
+ | [IP-rewarding-8K](https://huggingface.co/datasets/IP-GRM/IP-rewarding-8K) | 8K decoupled reward SFT dataset (principle + judgment pairs) |
16
+ | [Paper](https://arxiv.org/abs/2602.10019) | arXiv preprint |
17
+ | [Code](https://github.com/IP-GRM/IP-GRM) | Training scripts and IP-GRM process functions |
18
+
19
+ ## Key Idea
20
+
21
+ Standard generative reward models (GRMs) couple principle generation with response observation, causing **Principle Drift** — the evaluation criteria shift to accommodate the response being judged. IP-GRM eliminates this bias through a two-stage factorization:
22
+
23
+ - **Stage 1** `P(P | Q)` — generate evaluation principles from the question only
24
+ - **Stage 2** `P(J, r | Q, P, R)` — judge the response under pre-defined principles
25
+
26
+ This ensures conditional independence `I(P; R | Q) = 0`, and enables **Principle Cache** — generating principles once per prompt and reusing them across all sampled responses in a GRPO group.
27
+
28
+ ## Results
29
+
30
+ - **WritingBench 87.6** / **CW-v3 77.8** with Qwen3-8B + IP-GRM (competitive with GPT-5.2 and Claude-Sonnet-4)
31
+ - **23.66% faster** reward computation than baseline GRM via Principle Cache
32
+