Instructions to use psp-dada/Qwen2.5-Math-7B-Uni-DPO with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use psp-dada/Qwen2.5-Math-7B-Uni-DPO with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="psp-dada/Qwen2.5-Math-7B-Uni-DPO")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("psp-dada/Qwen2.5-Math-7B-Uni-DPO")
model = AutoModelForCausalLM.from_pretrained("psp-dada/Qwen2.5-Math-7B-Uni-DPO")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use psp-dada/Qwen2.5-Math-7B-Uni-DPO with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "psp-dada/Qwen2.5-Math-7B-Uni-DPO"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "psp-dada/Qwen2.5-Math-7B-Uni-DPO",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/psp-dada/Qwen2.5-Math-7B-Uni-DPO

SGLang

How to use psp-dada/Qwen2.5-Math-7B-Uni-DPO with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "psp-dada/Qwen2.5-Math-7B-Uni-DPO" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "psp-dada/Qwen2.5-Math-7B-Uni-DPO",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "psp-dada/Qwen2.5-Math-7B-Uni-DPO" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "psp-dada/Qwen2.5-Math-7B-Uni-DPO",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use psp-dada/Qwen2.5-Math-7B-Uni-DPO with Docker Model Runner:
```
docker model run hf.co/psp-dada/Qwen2.5-Math-7B-Uni-DPO
```

Configuration Parsing Warning:Config file tokenizer_config.json cannot be fetched (too big)

Model Card for `psp-dada/Qwen2.5-Math-7B-Uni-DPO` | ICLR 2026 | Uni-DPO:
A Unified Paradigm for Dynamic Preference Optimization of LLMs

🎊 News

[2026.02.16] 📖 Code, data, and models are released!
[2026.01.26] 🎉 Our Uni-DPO is accepted by ICLR 2026!

🚀 Overview

Uni-DPO introduces a unified dynamic preference optimization paradigm for training large language models (LLMs) from preference data. Unlike prior DPO-based methods that treat all preference pairs equally, Uni-DPO jointly considers intrinsic data quality and model learning dynamics, enabling more effective and robust preference learning.

Key advantages:

Quality-aware: Adaptively prioritizes high-quality preference pairs while down-weighting ambiguous ones.
Dynamics-aware: Shifts training focus toward under-fitted samples to mitigate overfitting.
Unified & lightweight: Seamlessly integrates dual-perspective weighting and calibrated NLL into standard DPO with minimal overhead.

🔑 Key Features

Dual-perspective dynamic weighting for preference optimization. Uni-DPO jointly models what data is worth learning (intrinsic quality) and what the model still struggles with (learning dynamics). By combining a quality-aware weight and a performance-aware weight, Uni-DPO dynamically reallocates training focus throughout optimization.

Quality-aware weighting filters ambiguous preference pairs. Preference data varies widely in reliability. Uni-DPO leverages score margins between preferred and rejected responses to assign higher weights to clear, high-quality pairs while suppressing noisy or ambiguous ones.

Performance-aware weighting mitigates overfitting during training. High-quality samples are not always the most informative once the model has already mastered them. Uni-DPO introduces a stabilized focal-style performance weight that down-weights well-fitted pairs and emphasizes hard-but-informative examples, effectively reducing overfitting.

Decoupling data quality from learning difficulty. Empirical analysis reveals that data quality (score margin) and learning difficulty (reward margin) are weakly correlated. Uni-DPO explicitly models this mismatch, ensuring that optimization is guided by both dimensions rather than relying on either alone.

State-of-the-art performance across text, math, and multimodal benchmarks. Uni-DPO consistently outperforms DPO and SimPO across diverse settings.

How to use

For the details of this model, please refer to the documentation of the GitHub repo.

📝 Citation

If you find our model/code/data/paper helpful, please consider citing our papers 📝 and starring us ⭐️！

@inproceedings{peng2026unidpo,
  title     = {Uni-{DPO}: A Unified Paradigm for Dynamic Preference Optimization of {LLM}s},
  author    = {Shangpin Peng and Weinong Wang and Zhuotao Tian and Senqiao Yang and Xing W and Haotian Xu and Chengquan Zhang and Takashi Isobe and Baotian Hu and Min Zhang},
  booktitle = {The Fourteenth International Conference on Learning Representations},
  year      = {2026},
  url       = {https://openreview.net/forum?id=G7DBGlgjjp}
}