Model Card for psp-dada/Qwen2.5-Math-7B-Uni-DPO | ICLR 2026 | Uni-DPO:
A Unified Paradigm for Dynamic Preference Optimization of LLMs
π News
- [2026.02.16] π Code, data, and models are released!
- [2026.01.26] π Our Uni-DPO is accepted by ICLR 2026!
π Overview
Uni-DPO introduces a unified dynamic preference optimization paradigm for training large language models (LLMs) from preference data. Unlike prior DPO-based methods that treat all preference pairs equally, Uni-DPO jointly considers intrinsic data quality and model learning dynamics, enabling more effective and robust preference learning.
Key advantages:
- Quality-aware: Adaptively prioritizes high-quality preference pairs while down-weighting ambiguous ones.
- Dynamics-aware: Shifts training focus toward under-fitted samples to mitigate overfitting.
- Unified & lightweight: Seamlessly integrates dual-perspective weighting and calibrated NLL into standard DPO with minimal overhead.
π Key Features
- Dual-perspective dynamic weighting for preference optimization.
Uni-DPO jointly models what data is worth learning (intrinsic quality) and what the model still struggles with (learning dynamics). By combining a quality-aware weight and a performance-aware weight, Uni-DPO dynamically reallocates training focus throughout optimization.
- Quality-aware weighting filters ambiguous preference pairs.
Preference data varies widely in reliability. Uni-DPO leverages score margins between preferred and rejected responses to assign higher weights to clear, high-quality pairs while suppressing noisy or ambiguous ones.
- Performance-aware weighting mitigates overfitting during training.
High-quality samples are not always the most informative once the model has already mastered them. Uni-DPO introduces a stabilized focal-style performance weight that down-weights well-fitted pairs and emphasizes hard-but-informative examples, effectively reducing overfitting.
- Decoupling data quality from learning difficulty.
Empirical analysis reveals that data quality (score margin) and learning difficulty (reward margin) are weakly correlated. Uni-DPO explicitly models this mismatch, ensuring that optimization is guided by both dimensions rather than relying on either alone.
- State-of-the-art performance across text, math, and multimodal benchmarks.
Uni-DPO consistently outperforms DPO and SimPO across diverse settings.
How to use
For the details of this model, please refer to the documentation of the GitHub repo.
π Citation
If you find our model/code/data/paper helpful, please consider citing our papers π and starring us βοΈοΌ
@article{peng2025omni,
title={Uni-DPO: A Unified Paradigm for Dynamic Preference Optimization of LLMs},
author={Peng, Shangpin and Wang, Weinong and Tian, Zhuotao and Yang, Senqiao and Wu, Xing and Xu, Haotian and Zhang, Chengquan and Isobe, Takashi and Hu, Baotian and Zhang, Min},
journal={arXiv preprint arXiv:2506.10054},
year={2025}
}
π§ Contact us
If you have any questions, comments, or suggestions, please do not hesitate to submit an issue or PR to help advance research in this area.