Text Classification
HongruCai commited on
Commit
e584a1d
·
verified ·
1 Parent(s): 8bcc5b3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +69 -1
README.md CHANGED
@@ -4,4 +4,72 @@ datasets:
4
  - HannahRoseKirk/prism-alignment
5
  base_model:
6
  - Skywork/Skywork-Reward-V2-Llama-3.1-8B
7
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  - HannahRoseKirk/prism-alignment
5
  base_model:
6
  - Skywork/Skywork-Reward-V2-Llama-3.1-8B
7
+ ---
8
+
9
+ # Meta Reward Modeling (MRM)
10
+
11
+ ## Overview
12
+
13
+ **Meta Reward Modeling (MRM)** is a personalized reward modeling framework designed to adapt to diverse user preferences with limited feedback.
14
+ Instead of learning a single global reward function, MRM treats each user as a separate learning task and applies a meta-learning approach to learn a shared initialization that enables fast, few-shot personalization.
15
+
16
+ MRM represents user-specific rewards as adaptive combinations over shared base reward functions and optimizes this structure through a bi-level meta-learning framework.
17
+ To improve robustness across heterogeneous users, MRM introduces a **Robust Personalization Objective (RPO)** that emphasizes hard-to-learn users during meta-training.
18
+
19
+ This repository provides trained checkpoints for reward modeling and user-level preference evaluation.
20
+
21
+ ---
22
+
23
+ ## Links
24
+
25
+ - 📄 **arXiv Paper**: https://arxiv.org/abs/XXXX.XXXXX
26
+ - 🤗 **Hugging Face Paper**: https://huggingface.co/papers/XXXX.XXXXX
27
+ - 💻 **GitHub Code**: https://github.com/ModalityDance/MRM
28
+ - 📦 **Hugging Face Collection**: https://huggingface.co/collections/ModalityDance/mrm
29
+
30
+ ---
31
+
32
+ ## Evaluation
33
+
34
+ The model is evaluated using user-level preference accuracy with few-shot personalization.
35
+ Inference follows the same adaptation procedure used during training: for each user, the reward weights are initialized from the meta-learned initialization and updated with a small number of gradient steps on user-specific preference data.
36
+
37
+ ### Example evaluation script
38
+
39
+ ```bash
40
+ python inference.py \
41
+ --embed_pt data/emb/prism/V2.pt \
42
+ --meta_json data/emb/prism/V2.json \
43
+ --ckpt path/to/checkpoint.pt \
44
+ --dataset PRISM \
45
+ --seen_train_limit -1 \
46
+ --unseen_train_limit -1 \
47
+ --hidden_layers 2 \
48
+ --inner_lr 1e-3 \
49
+ --eval_inner_epochs 1 \
50
+ --val_ratio 0.9 \
51
+ --score_threshold -1 \
52
+ --seed 42 \
53
+ --device cuda:0
54
+ ````
55
+
56
+ ---
57
+
58
+ ## Citation
59
+
60
+ If you use this model or code in your research, please cite:
61
+
62
+ ```bibtex
63
+ @article{mrm2025,
64
+ title = {Meta Reward Modeling for Personalized Alignment},
65
+ author = {Author Names},
66
+ journal = {arXiv preprint arXiv:XXXX.XXXXX},
67
+ year = {2025}
68
+ }
69
+ ```
70
+
71
+ ---
72
+
73
+ ## License
74
+
75
+ This model is released under the **MIT License**.