Text Classification
Transformers
HongruCai commited on
Commit
90e6fcb
·
verified ·
1 Parent(s): dd1c69e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +70 -1
README.md CHANGED
@@ -6,4 +6,73 @@ metrics:
6
  - accuracy
7
  base_model:
8
  - Skywork/Skywork-Reward-V2-Llama-3.1-8B
9
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  - accuracy
7
  base_model:
8
  - Skywork/Skywork-Reward-V2-Llama-3.1-8B
9
+ ---
10
+
11
+
12
+ # Meta Reward Modeling (MRM)
13
+
14
+ ## Overview
15
+
16
+ **Meta Reward Modeling (MRM)** is a personalized reward modeling framework designed to adapt to diverse user preferences with limited feedback.
17
+ Instead of learning a single global reward function, MRM treats each user as a separate learning task and applies a meta-learning approach to learn a shared initialization that enables fast, few-shot personalization.
18
+
19
+ MRM represents user-specific rewards as adaptive combinations over shared base reward functions and optimizes this structure through a bi-level meta-learning framework.
20
+ To improve robustness across heterogeneous users, MRM introduces a **Robust Personalization Objective (RPO)** that emphasizes hard-to-learn users during meta-training.
21
+
22
+ This repository provides trained checkpoints for reward modeling and user-level preference evaluation.
23
+
24
+ ---
25
+
26
+ ## Links
27
+
28
+ - 📄 **arXiv Paper**: https://arxiv.org/abs/XXXX.XXXXX
29
+ - 🤗 **Hugging Face Paper**: https://huggingface.co/papers/XXXX.XXXXX
30
+ - 💻 **GitHub Code**: https://github.com/ModalityDance/MRM
31
+ - 📦 **Hugging Face Collection**: https://huggingface.co/collections/ModalityDance/mrm
32
+
33
+ ---
34
+
35
+ ## Evaluation
36
+
37
+ The model is evaluated using user-level preference accuracy with few-shot personalization.
38
+ Inference follows the same adaptation procedure used during training: for each user, the reward weights are initialized from the meta-learned initialization and updated with a small number of gradient steps on user-specific preference data.
39
+
40
+ ### Example evaluation script
41
+
42
+ ```bash
43
+ python inference.py \
44
+ --embed_pt data/emb/reddit/V2.pt \
45
+ --meta_json data/emb/reddit/V2.json \
46
+ --ckpt path/to/checkpoint.pt \
47
+ --dataset REDDIT \
48
+ --seen_train_limit 150 \
49
+ --unseen_train_limit 50 \
50
+ --hidden_layers 2 \
51
+ --inner_lr 5e-3 \
52
+ --eval_inner_epochs 1 \
53
+ --val_ratio 0.9 \
54
+ --score_threshold -1 \
55
+ --seed 42 \
56
+ --device cuda:0
57
+ ````
58
+
59
+ ---
60
+
61
+ ## Citation
62
+
63
+ If you use this model or code in your research, please cite:
64
+
65
+ ```bibtex
66
+ @article{mrm2025,
67
+ title = {Meta Reward Modeling for Personalized Alignment},
68
+ author = {Author Names},
69
+ journal = {arXiv preprint arXiv:XXXX.XXXXX},
70
+ year = {2025}
71
+ }
72
+ ```
73
+
74
+ ---
75
+
76
+ ## License
77
+
78
+ This model is released under the **MIT License**.