An Empirical Study on Preference Tuning Generalization and Diversity Under Domain Shift
Abstract
Preference tuning of language models shows varying generalization capabilities under domain shift, with pseudo-labeling adaptation strategies effectively reducing performance degradation in summarization and question-answering tasks.
Preference tuning aligns pretrained language models to human judgments of quality, helpfulness, or safety by optimizing over explicit preference signals rather than likelihood alone. Prior work has shown that preference-tuning degrades performance and reduces helpfulness when evaluated outside the training domain. However, the extent to which adaptation strategies mitigate this domain shift remains unexplored. We address this challenge by conducting a comprehensive and systematic study of alignment generalization under domain shift. We compare five popular alignment objectives and various adaptation strategies from source to target, including target-domain supervised fine-tuning and pseudo-labeling, across summarization and question-answering helpfulness tasks. Our findings reveal systematic differences in generalization across alignment objectives under domain shift. We show that adaptation strategies based on pseudo-labeling can substantially reduce domain-shift degradation
Community
Our paper presents a systematic study of preference-optimization under domain shift. We compare five popular alignment objectives and various adaptation strategies from source to target, including target-domain supervised fine-tuning and pseudo-labeling, across summarization and question-answering helpfulness tasks. We found:
- The adaptation strategy is more influential than the alignment objective.
- We identify that synthetic supervision is a double-edged sword. While pseudo-labeling yields the highest target-domain win rates, it induces severe mode collapse. This diversity tax results in models that are highly reliable but linguistically monotonous, mirroring the latent templates of the teacher model.
- Our findings suggest a deployment recommendation: use pseudo-labeling for high-stakes and constrained tasks where reliability is paramount, but favor mixed-domain SFT and online RL for applications requiring creative or varied linguistic expression.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- GEM: Generative Entropy-Guided Preference Modeling for Few-shot Alignment of LLMs (2025)
- SR-GRPO: Stable Rank as an Intrinsic Geometric Reward for Large Language Model Alignment (2025)
- Learning from the Undesirable: Robust Adaptation of Language Models without Forgetting (2025)
- Direct Diffusion Score Preference Optimization via Stepwise Contrastive Policy-Pair Supervision (2025)
- Fine-Tuning LLMs with Fine-Grained Human Feedback on Text Spans (2025)
- PIRA: Preference-Oriented Instruction-Tuned Reward Models with Dual Aggregation (2025)
- AMIR-GRPO: Inducing Implicit Preference Signals into GRPO (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper