arxiv:2601.05882

An Empirical Study on Preference Tuning Generalization and Diversity Under Domain Shift

Published on Jan 9

· Submitted by

Xingwei Tan on Jan 12

Upvote

Authors:

Xingwei Tan ,

Abstract

Preference tuning of language models shows varying generalization capabilities under domain shift, with pseudo-labeling adaptation strategies effectively reducing performance degradation in summarization and question-answering tasks.

AI-generated summary

Preference tuning aligns pretrained language models to human judgments of quality, helpfulness, or safety by optimizing over explicit preference signals rather than likelihood alone. Prior work has shown that preference-tuning degrades performance and reduces helpfulness when evaluated outside the training domain. However, the extent to which adaptation strategies mitigate this domain shift remains unexplored. We address this challenge by conducting a comprehensive and systematic study of alignment generalization under domain shift. We compare five popular alignment objectives and various adaptation strategies from source to target, including target-domain supervised fine-tuning and pseudo-labeling, across summarization and question-answering helpfulness tasks. Our findings reveal systematic differences in generalization across alignment objectives under domain shift. We show that adaptation strategies based on pseudo-labeling can substantially reduce domain-shift degradation

View arXiv page View PDF GitHub 0 Add to collection

Community

XingweiT

Paper author Paper submitter about 14 hours ago

Our paper presents a systematic study of preference-optimization under domain shift. We compare five popular alignment objectives and various adaptation strategies from source to target, including target-domain supervised fine-tuning and pseudo-labeling, across summarization and question-answering helpfulness tasks. We found:

The adaptation strategy is more influential than the alignment objective.
We identify that synthetic supervision is a double-edged sword. While pseudo-labeling yields the highest target-domain win rates, it induces severe mode collapse. This diversity tax results in models that are highly reliable but linguistically monotonous, mirroring the latent templates of the teacher model.
Our findings suggest a deployment recommendation: use pseudo-labeling for high-stakes and constrained tasks where reliability is paramount, but favor mixed-domain SFT and online RL for applications requiring creative or varied linguistic expression.