Text Classification
File size: 6,237 Bytes
9d688bc
 
 
ea5caf3
 
 
 
e584a1d
 
 
 
 
 
ea5caf3
e584a1d
ea5caf3
e584a1d
ea5caf3
e584a1d
 
 
 
 
077cbcb
 
e584a1d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
418fd34
 
 
 
 
 
f15549f
418fd34
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e584a1d
 
 
 
 
 
 
 
077cbcb
 
 
 
 
 
 
 
e584a1d
 
 
 
 
 
 
ea5caf3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
---
base_model:
- Skywork/Skywork-Reward-V2-Llama-3.1-8B
datasets:
- HannahRoseKirk/prism-alignment
license: mit
pipeline_tag: text-classification
---

# Meta Reward Modeling (MRM)

## Overview

**Meta Reward Modeling (MRM)** is a personalized reward modeling framework designed to adapt to diverse user preferences with limited feedback. This repository provides trained checkpoints as described in the paper [One Adapts to Any: Meta Reward Modeling for Personalized LLM Alignment](https://huggingface.co/papers/2601.18731).

Instead of learning a single global reward function, MRM treats each user as a separate learning task and applies a meta-learning approach to learn a shared initialization that enables fast, few-shot personalization.

MRM represents user-specific rewards as adaptive combinations over shared base reward functions and optimizes this structure through a bi-level meta-learning framework. To improve robustness across heterogeneous users, MRM introduces a **Robust Personalization Objective (RPO)** that emphasizes hard-to-learn users during meta-training.

---

## Links

- 📄 **arXiv Paper**: https://arxiv.org/abs/2601.18731  
- 🤗 **Hugging Face Paper**: https://huggingface.co/papers/2601.18731  
- 💻 **GitHub Code**: https://github.com/ModalityDance/MRM  
- 📦 **Hugging Face Collection**: https://huggingface.co/collections/ModalityDance/mrm

---

## Evaluation

The model is evaluated using user-level preference accuracy with few-shot personalization.  
Inference follows the same adaptation procedure used during training: for each user, the reward weights are initialized from the meta-learned initialization and updated with a small number of gradient steps on user-specific preference data.

### Example evaluation script

```bash
python inference.py \
  --embed_pt data/emb/prism/V2.pt \
  --meta_json data/emb/prism/V2.json \
  --ckpt path/to/checkpoint.pt \
  --dataset PRISM \
  --seen_train_limit -1 \
  --unseen_train_limit -1 \
  --hidden_layers 2 \
  --inner_lr 1e-3 \
  --eval_inner_epochs 1 \
  --val_ratio 0.9 \
  --score_threshold -1 \
  --seed 42 \
  --device cuda:0
```
---

## Usage Example

This example shows a typical workflow for a **single user**:
1) encode text pairs with Skywork/Skywork-Reward-V2-Llama-3.1-8B into embeddings,
2) adapt the MRM on the user's few-shot examples (update `shared_weight` only),
3) run inference on new pairs for that same user.

```python
import torch
from copy import deepcopy
from transformers import AutoTokenizer, AutoModelForSequenceClassification

from utils import bt_loss
from train import MRM
from inference import load_ckpt_into_model


@torch.no_grad()
def encode_pairs(model, tokenizer, pairs, device="cuda"):
    model.eval()
    ch, rj = [], []
    for ex in pairs:
        conv = ex["prompt"]
        for key, buf in [("chosen", ch), ("rejected", rj)]:
            ids = tokenizer.apply_chat_template(
                conv + [{"role": "assistant", "content": ex[key]}],
                tokenize=True, return_tensors="pt"
            ).to(device)
            out = model(ids, output_hidden_states=True)
            buf.append(out.hidden_states[-1][0, -1].float().cpu())
    return torch.stack(ch), torch.stack(rj)


def adapt_single_user(base_model, support_ch, support_rj, inner_lr=1e-3, inner_epochs=5, device="cuda"):
    model = deepcopy(base_model).to(device).train()
    opt = torch.optim.Adam([model.shared_weight], lr=inner_lr)
    support_ch, support_rj = support_ch.to(device), support_rj.to(device)
    for _ in range(inner_epochs):
        opt.zero_grad()
        loss = bt_loss(model(support_ch), model(support_rj))
        loss.backward()
        opt.step()
    return model.eval()


@torch.no_grad()
def infer_on_pairs(model, ch, rj, device="cuda"):
    return model(ch.to(device)), model(rj.to(device))


device = "cuda" if torch.cuda.is_available() else "cpu"

MODEL_PATH = "Skywork/Skywork-Reward-V2-Llama-3.1-8B"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
llm = AutoModelForSequenceClassification.from_pretrained(
    MODEL_PATH, num_labels=1, torch_dtype=torch.bfloat16, device_map=device
)

CKPT_PATH = "ckpt/model.pt"
mrm = MRM(in_dim=4096, hidden_sizes=[2], use_bias=False)
load_ckpt_into_model(mrm, CKPT_PATH, device)

support_pairs = [
    {
        "prompt": [{"role": "user", "content": "TL;DR this post: I tried waking up at 5am for a month and tracked my productivity."}],
        "chosen": "Waking up early helped at first, but long-term productivity depended more on sleep quality than wake-up time.",
        "rejected": "The post is about waking up early and productivity.",
    },
    {
        "prompt": [{"role": "user", "content": "Summarize the main point: I switched from iPhone to Android after 10 years."}],
        "chosen": "The author values customization and battery life more than ecosystem lock-in, which motivated the switch.",
        "rejected": "The author bought a new phone.",
    },
]

sup_ch, sup_rj = encode_pairs(llm, tokenizer, support_pairs, device)
user_mrm = adapt_single_user(mrm, sup_ch, sup_rj, device=device)

test_pairs = [
    {
        "prompt": [{"role": "user", "content": "TL;DR: I quit my job to freelance and here is what I learned in 6 months."}],
        "chosen": "Freelancing offers flexibility but requires strong self-discipline and financial planning to be sustainable.",
        "rejected": "The author talks about quitting a job and freelancing.",
    }
]

test_ch, test_rj = encode_pairs(llm, tokenizer, test_pairs, device)
s_ch, s_rj = infer_on_pairs(user_mrm, test_ch, test_rj, device)

print("reward(chosen)  =", s_ch.tolist())
print("reward(rejected)=", s_rj.tolist())

```

---

## Citation

If you use this model or code in your research, please cite:

```bibtex
@misc{cai2026adaptsanymetareward,
      title={One Adapts to Any: Meta Reward Modeling for Personalized LLM Alignment}, 
      author={Hongru Cai and Yongqi Li and Tiezheng Yu and Fengbin Zhu and Wenjie Wang and Fuli Feng and Wenjie Li},
      year={2026},
      eprint={2601.18731},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2601.18731}, 
}
```

---

## License

This model is released under the **MIT License**.