Update README.md
Browse files
README.md
CHANGED
|
@@ -4,4 +4,73 @@ datasets:
|
|
| 4 |
- HannahRoseKirk/prism-alignment
|
| 5 |
base_model:
|
| 6 |
- Skywork/Skywork-Reward-Llama-3.1-8B-v0.2
|
| 7 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
- HannahRoseKirk/prism-alignment
|
| 5 |
base_model:
|
| 6 |
- Skywork/Skywork-Reward-Llama-3.1-8B-v0.2
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
|
| 10 |
+
# Meta Reward Modeling (MRM)
|
| 11 |
+
|
| 12 |
+
## Overview
|
| 13 |
+
|
| 14 |
+
**Meta Reward Modeling (MRM)** is a personalized reward modeling framework designed to adapt to diverse user preferences with limited feedback.
|
| 15 |
+
Instead of learning a single global reward function, MRM treats each user as a separate learning task and applies a meta-learning approach to learn a shared initialization that enables fast, few-shot personalization.
|
| 16 |
+
|
| 17 |
+
MRM represents user-specific rewards as adaptive combinations over shared base reward functions and optimizes this structure through a bi-level meta-learning framework.
|
| 18 |
+
To improve robustness across heterogeneous users, MRM introduces a **Robust Personalization Objective (RPO)** that emphasizes hard-to-learn users during meta-training.
|
| 19 |
+
|
| 20 |
+
This repository provides trained checkpoints for reward modeling and user-level preference evaluation.
|
| 21 |
+
|
| 22 |
+
---
|
| 23 |
+
|
| 24 |
+
## Links
|
| 25 |
+
|
| 26 |
+
- 📄 **arXiv Paper**: https://arxiv.org/abs/XXXX.XXXXX
|
| 27 |
+
- 🤗 **Hugging Face Paper**: https://huggingface.co/papers/XXXX.XXXXX
|
| 28 |
+
- 💻 **GitHub Code**: https://github.com/ModalityDance/MRM
|
| 29 |
+
- 📦 **Hugging Face Collection**: https://huggingface.co/collections/ModalityDance/mrm
|
| 30 |
+
|
| 31 |
+
---
|
| 32 |
+
|
| 33 |
+
## Evaluation
|
| 34 |
+
|
| 35 |
+
The model is evaluated using user-level preference accuracy with few-shot personalization.
|
| 36 |
+
Inference follows the same adaptation procedure used during training: for each user, the reward weights are initialized from the meta-learned initialization and updated with a small number of gradient steps on user-specific preference data.
|
| 37 |
+
|
| 38 |
+
### Example evaluation script
|
| 39 |
+
|
| 40 |
+
```bash
|
| 41 |
+
python inference.py \
|
| 42 |
+
--embed_pt data/emb/prism/V1.pt \
|
| 43 |
+
--meta_json data/emb/prism/V1.json \
|
| 44 |
+
--ckpt path/to/checkpoint.pt \
|
| 45 |
+
--dataset PRISM \
|
| 46 |
+
--seen_train_limit -1 \
|
| 47 |
+
--unseen_train_limit -1 \
|
| 48 |
+
--hidden_layers 2 \
|
| 49 |
+
--inner_lr 1e-3 \
|
| 50 |
+
--eval_inner_epochs 1 \
|
| 51 |
+
--val_ratio 0.9 \
|
| 52 |
+
--score_threshold -1 \
|
| 53 |
+
--seed 42 \
|
| 54 |
+
--device cuda:0
|
| 55 |
+
````
|
| 56 |
+
|
| 57 |
+
---
|
| 58 |
+
|
| 59 |
+
## Citation
|
| 60 |
+
|
| 61 |
+
If you use this model or code in your research, please cite:
|
| 62 |
+
|
| 63 |
+
```bibtex
|
| 64 |
+
@article{mrm2025,
|
| 65 |
+
title = {Meta Reward Modeling for Personalized Alignment},
|
| 66 |
+
author = {Author Names},
|
| 67 |
+
journal = {arXiv preprint arXiv:XXXX.XXXXX},
|
| 68 |
+
year = {2025}
|
| 69 |
+
}
|
| 70 |
+
```
|
| 71 |
+
|
| 72 |
+
---
|
| 73 |
+
|
| 74 |
+
## License
|
| 75 |
+
|
| 76 |
+
This model is released under the **MIT License**.
|