Text Classification
HongruCai commited on
Commit
81bb12d
·
verified ·
1 Parent(s): 5a2d613

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +80 -1
README.md CHANGED
@@ -4,4 +4,83 @@ datasets:
4
  - openai/summarize_from_feedback
5
  base_model:
6
  - Skywork/Skywork-Reward-V2-Llama-3.1-8B
7
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  - openai/summarize_from_feedback
5
  base_model:
6
  - Skywork/Skywork-Reward-V2-Llama-3.1-8B
7
+ ---
8
+
9
+ ---
10
+ license: mit
11
+ datasets:
12
+ - openai/summarize_from_feedback
13
+ metrics:
14
+ - accuracy
15
+ base_model:
16
+ - Skywork/Skywork-Reward-V2-Llama-3.1-8B
17
+ ---
18
+
19
+
20
+ # Meta Reward Modeling (MRM)
21
+
22
+ ## Overview
23
+
24
+ **Meta Reward Modeling (MRM)** is a personalized reward modeling framework designed to adapt to diverse user preferences with limited feedback.
25
+ Instead of learning a single global reward function, MRM treats each user as a separate learning task and applies a meta-learning approach to learn a shared initialization that enables fast, few-shot personalization.
26
+
27
+ MRM represents user-specific rewards as adaptive combinations over shared base reward functions and optimizes this structure through a bi-level meta-learning framework.
28
+ To improve robustness across heterogeneous users, MRM introduces a **Robust Personalization Objective (RPO)** that emphasizes hard-to-learn users during meta-training.
29
+
30
+ This repository provides trained checkpoints for reward modeling and user-level preference evaluation.
31
+
32
+ ---
33
+
34
+ ## Links
35
+
36
+ - 📄 **arXiv Paper**: https://arxiv.org/abs/XXXX.XXXXX
37
+ - 🤗 **Hugging Face Paper**: https://huggingface.co/papers/XXXX.XXXXX
38
+ - 💻 **GitHub Code**: https://github.com/ModalityDance/MRM
39
+ - 📦 **Hugging Face Collection**: https://huggingface.co/collections/ModalityDance/mrm
40
+
41
+ ---
42
+
43
+ ## Evaluation
44
+
45
+ The model is evaluated using user-level preference accuracy with few-shot personalization.
46
+ Inference follows the same adaptation procedure used during training: for each user, the reward weights are initialized from the meta-learned initialization and updated with a small number of gradient steps on user-specific preference data.
47
+
48
+ ### Example evaluation script
49
+
50
+ ```bash
51
+ python inference.py \
52
+ --embed_pt data/emb/reddit/V2.pt \
53
+ --meta_json data/emb/reddit/V2.json \
54
+ --ckpt path/to/checkpoint.pt \
55
+ --dataset REDDIT \
56
+ --seen_train_limit 100 \
57
+ --unseen_train_limit 50 \
58
+ --hidden_layers 2 \
59
+ --inner_lr 5e-3 \
60
+ --eval_inner_epochs 1 \
61
+ --val_ratio 0.9 \
62
+ --score_threshold -1 \
63
+ --seed 42 \
64
+ --device cuda:0
65
+ ````
66
+
67
+ ---
68
+
69
+ ## Citation
70
+
71
+ If you use this model or code in your research, please cite:
72
+
73
+ ```bibtex
74
+ @article{mrm2025,
75
+ title = {Meta Reward Modeling for Personalized Alignment},
76
+ author = {Author Names},
77
+ journal = {arXiv preprint arXiv:XXXX.XXXXX},
78
+ year = {2025}
79
+ }
80
+ ```
81
+
82
+ ---
83
+
84
+ ## License
85
+
86
+ This model is released under the **MIT License**.