This is the model checkpoint for ACL 2025 paper "Aligning Large Language Models with Implicit Preferences from User-Generated Content" (https://arxiv.org/abs/2506.04463)

The model is trained from Mistral-7B-Instruct-v0.2 with DPO, using preference data harvested from user-generated content.

If you find this model helpful to your research, please cite the following paper:

@inproceedings{tan-etal-2025-aligning,
    title = "Aligning Large Language Models with Implicit Preferences from User-Generated Content",
    author = "Tan, Zhaoxuan  and
      Li, Zheng  and
      Liu, Tianyi  and
      Wang, Haodong  and
      Yun, Hyokun  and
      Zeng, Ming  and
      Chen, Pei  and
      Zhang, Zhihan  and
      Gao, Yifan  and
      Wang, Ruijie  and
      Nigam, Priyanka  and
      Yin, Bing  and
      Jiang, Meng",
    editor = "Che, Wanxiang  and
      Nabende, Joyce  and
      Shutova, Ekaterina  and
      Pilehvar, Mohammad Taher",
    booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.acl-long.384/",
    doi = "10.18653/v1/2025.acl-long.384",
    pages = "7792--7820",
    ISBN = "979-8-89176-251-0",
    abstract = "Learning from preference feedback is essential for aligning large language models (LLMs) with human values and improving the quality of generated responses. However, existing preference learning methods rely heavily on curated data from humans or advanced LLMs, which is costly and difficult to scale. In this work, we present PUGC, a novel framework that leverages implicit human Preferences in unlabeled User-Generated Content (UGC) to generate preference data. Although UGC is not explicitly created to guide LLMs in generating human-preferred responses, it often reflects valuable insights and implicit preferences from its creators that has the potential to address readers' questions. PUGC transforms UGC into user queries and generates responses from the policy model. The UGC is then leveraged as a reference text for response scoring, aligning the model with these implicit preferences. This approach improves the quality of preference data while enabling scalable, domain-specific alignment. Experimental results on Alpaca Eval 2 show that models trained with DPO and PUGC achieve a 9.37{\%} performance improvement over traditional methods, setting a 35.93{\%} state-of-the-art length-controlled win rate using Mistral-7B-Instruct. Further studies highlight gains in reward quality, domain-specific alignment effectiveness, robustness against UGC quality, and theory of mind capabilities. Our code and dataset are available at https://zhaoxuan.info/PUGC.github.io/."
}