Add model card for Personalized-Qwen2.5-7B-Instruct
#1
by nielsr HF Staff - opened
README.md
ADDED
|
@@ -0,0 +1,34 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
library_name: transformers
|
| 4 |
+
pipeline_tag: text-generation
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# Personalized-Qwen2.5-7B-Instruct
|
| 8 |
+
|
| 9 |
+
This repository hosts the `Personalized-Qwen2.5-7B-Instruct` model, developed as part of the research presented in the paper:
|
| 10 |
+
|
| 11 |
+
**[Towards Faithful and Controllable Personalization via Critique-Post-Edit Reinforcement Learning](https://huggingface.co/papers/2510.18849)**
|
| 12 |
+
|
| 13 |
+
## Abstract
|
| 14 |
+
Faithfully personalizing large language models (LLMs) to align with individual user preferences is a critical but challenging task. While supervised fine-tuning (SFT) quickly reaches a performance plateau, standard reinforcement learning from human feedback (RLHF) also struggles with the nuances of personalization. Scalar-based reward models are prone to reward hacking which leads to verbose and superficially personalized responses. To address these limitations, we propose Critique-Post-Edit, a robust reinforcement learning framework that enables more faithful and controllable personalization. Our framework integrates two key components: (1) a Personalized Generative Reward Model (GRM) that provides multi-dimensional scores and textual critiques to resist reward hacking, and (2) a Critique-Post-Edit mechanism where the policy model revises its own outputs based on these critiques for more targeted and efficient learning. Under a rigorous length-controlled evaluation, our method substantially outperforms standard PPO on personalization benchmarks. Personalized Qwen2.5-7B achieves an average 11\% win-rate improvement, and personalized Qwen2.5-14B model surpasses the performance of GPT-4.1. These results demonstrate a practical path to faithful, efficient, and controllable personalization.
|
| 15 |
+
|
| 16 |
+
## Project Overview
|
| 17 |
+
This project provides a complete pipeline for training and evaluating large language models using the **Critique-Post-Edit** method. It includes scripts and configurations for Supervised Fine-Tuning (SFT) and Reinforcement Learning (PPO), leveraging powerful open-source frameworks like `LLaMA-Factory` and `verl`. The evaluation is conducted using `AlpacaEval` to ensure fair and comprehensive assessment of model performance. Our released models, including `Personalized-Qwen2.5-7B-Instruct` and `Personalized-Qwen2.5-14B-Instruct`, demonstrate significant improvements over the baseline models.
|
| 18 |
+
|
| 19 |
+
For more details on the framework, training, and evaluation, please refer to the [official GitHub repository](https://github.com/OPPO-PersonalAI/Critique-Post-Edit).
|
| 20 |
+
|
| 21 |
+
## License
|
| 22 |
+
This project is licensed under the Apache License Version 2.0.
|
| 23 |
+
|
| 24 |
+
## Citation
|
| 25 |
+
If you find this work helpful, please consider citing the original paper:
|
| 26 |
+
|
| 27 |
+
```bibtex
|
| 28 |
+
@article{zhu2025towards,
|
| 29 |
+
title={Towards Faithful and Controllable Personalization via Critique-Post-Edit Reinforcement Learning},
|
| 30 |
+
author={Zhu, Chenghao and Tao, Meiling and Wang, Tiannan and Ding, Dongyi and Jiang, Yuchen Eleanor and Zhou, Wangchunshu},
|
| 31 |
+
journal={arXiv preprint arXiv:2510.18849},
|
| 32 |
+
year={2025}
|
| 33 |
+
}
|
| 34 |
+
```
|