Update README.md
Browse files
README.md
CHANGED
|
@@ -1,10 +1,32 @@
|
|
| 1 |
---
|
| 2 |
-
title:
|
| 3 |
-
emoji: 🔥
|
| 4 |
-
colorFrom: pink
|
| 5 |
-
colorTo: purple
|
| 6 |
-
sdk: static
|
| 7 |
-
pinned: false
|
| 8 |
---
|
| 9 |
|
| 10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
title: IP-GRM for Creative Writing
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
| 4 |
|
| 5 |
+
# IP-GRM: Unbiased Principles, Robust Rewards
|
| 6 |
+
|
| 7 |
+
IP-GRM (Independent Principle Generative Reward Model) is a decoupled reward-modeling framework for open-ended RLHF tasks. It explicitly separates **principle generation** from **response judging**, reducing response-conditioned bias (*Principle Drift*) and improving reward robustness in GRPO training.
|
| 8 |
+
|
| 9 |
+
## Resources
|
| 10 |
+
|
| 11 |
+
| Resource | Description |
|
| 12 |
+
|----------|-------------|
|
| 13 |
+
| [IP-GRM](https://huggingface.co/IP-GRM/IP-GRM) | 16B generative reward model with decoupled principle-judgment pipeline |
|
| 14 |
+
| [CreativeWriting-8B](https://huggingface.co/IP-GRM/CreativeWriting-8B) | 8B creative writing model trained via GRPO with IP-GRM rewards |
|
| 15 |
+
| [IP-rewarding-8K](https://huggingface.co/datasets/IP-GRM/IP-rewarding-8K) | 8K decoupled reward SFT dataset (principle + judgment pairs) |
|
| 16 |
+
| [Paper](https://arxiv.org/abs/2602.10019) | arXiv preprint |
|
| 17 |
+
| [Code](https://github.com/IP-GRM/IP-GRM) | Training scripts and IP-GRM process functions |
|
| 18 |
+
|
| 19 |
+
## Key Idea
|
| 20 |
+
|
| 21 |
+
Standard generative reward models (GRMs) couple principle generation with response observation, causing **Principle Drift** — the evaluation criteria shift to accommodate the response being judged. IP-GRM eliminates this bias through a two-stage factorization:
|
| 22 |
+
|
| 23 |
+
- **Stage 1** `P(P | Q)` — generate evaluation principles from the question only
|
| 24 |
+
- **Stage 2** `P(J, r | Q, P, R)` — judge the response under pre-defined principles
|
| 25 |
+
|
| 26 |
+
This ensures conditional independence `I(P; R | Q) = 0`, and enables **Principle Cache** — generating principles once per prompt and reusing them across all sampled responses in a GRPO group.
|
| 27 |
+
|
| 28 |
+
## Results
|
| 29 |
+
|
| 30 |
+
- **WritingBench 87.6** / **CW-v3 77.8** with Qwen3-8B + IP-GRM (competitive with GPT-5.2 and Claude-Sonnet-4)
|
| 31 |
+
- **23.66% faster** reward computation than baseline GRM via Principle Cache
|
| 32 |
+
|