JiachengPang's picture
License: link to in-repo file
b63adb3 verified
---
license: other
license_name: usc-research
license_link: LICENSE
language:
- en
library_name: transformers
base_model: Qwen/Qwen2-Audio-7B-Instruct
tags:
- audio
- speech
- audio-llm
- paralinguistic
- pclm
- dpo
- voxparadox
pipeline_tag: audio-text-to-text
---
# Qwen2-Audio + PCLM + DPO
[![ICML 2026](https://img.shields.io/badge/ICML-2026-1d4ed8.svg)](https://icml.cc/Conferences/2026)
[![Paper](https://img.shields.io/badge/Paper-arXiv-AD1C18.svg)](https://arxiv.org/abs/2605.27772)
[![Project Page](https://img.shields.io/badge/Project-Page-0EA5E9.svg)](https://voxparadox.github.io/)
[![Code](https://img.shields.io/badge/GitHub-ihp--lab%2FVoxParadox-181717.svg?logo=github)](https://github.com/ihp-lab/VoxParadox)
[![Dataset](https://img.shields.io/badge/πŸ€—%20Dataset-IHP--Lab%2FVoxParadox-FFD21E.svg)](https://huggingface.co/datasets/IHP-Lab/VoxParadox)
[![AF3 + PCLM + DPO](https://img.shields.io/badge/πŸ€—%20Sibling%20model-AF3+PCLM+DPO-FFD21E.svg)](https://huggingface.co/IHP-Lab/AF3_PCLM_DPO)
[![License](https://img.shields.io/badge/License-USC%20Research-228B22.svg)](LICENSE)
PCLM- and DPO-finetuned [Qwen2-Audio-7B-Instruct](https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct) from
*Do Audio LLMs Listen or Read? Analyzing and Mitigating Paralinguistic Failures with VoxParadox*
(ICML 2026).
The base model is augmented with the **Prompt-Conditioned Layer Mixer (PCLM)** β€” a lightweight module that
adaptively mixes representations from intermediate audio-encoder layers based on the user prompt β€” and then
post-trained with **Direct Preference Optimization (DPO)** to prefer acoustically-grounded answers over
language-implied alternatives on paralinguistic MCQs.
## Usage
This checkpoint cannot be loaded with stock `transformers` β€” PCLM requires the custom
modeling code shipped in the [release repo](https://github.com/ihp-lab/VoxParadox).
```bash
git clone https://github.com/ihp-lab/VoxParadox
cd VoxParadox
conda create -n qwen2audio python=3.10 -y && conda activate qwen2audio
pip install torch torchaudio transformers accelerate librosa soundfile
```
Inference on VoxParadox (or any MCQ JSON in the same schema):
```bash
python -m qwen2audio.eval.run_eval \
--model_path IHP-Lab/Qwen2-Audio_PCLM_DPO \
--data_path /path/to/voxparadox.json \
--audio_base /path/to/audio_root \
--output_dir runs/eval/qwen2audio_pclm_dpo
```
Score with the dataset-shipped `eval.py`:
```bash
python eval.py --predictions runs/eval/qwen2audio_pclm_dpo/predictions.jsonl
```
The loader auto-detects `use_pclm=True` from `config.json` and activates PCLM with
`expose_layers=[5, 15, 25, 30]` over the audio encoder.
## Project resources
| Resource | Link |
|---|---|
| Paper (arXiv) | <https://arxiv.org/abs/2605.27772> |
| Project page | <https://voxparadox.github.io/> |
| Code | <https://github.com/ihp-lab/VoxParadox> |
| Benchmark | <https://huggingface.co/datasets/IHP-Lab/VoxParadox> |
| Sibling model (AF3) | <https://huggingface.co/IHP-Lab/AF3_PCLM_DPO> |
## Citation
```bibtex
@inproceedings{pang2026voxparadox,
title = {Do Audio LLMs Listen or Read? Analyzing and Mitigating Paralinguistic Failures with VoxParadox},
author = {Pang, Jiacheng and Chaubey, Ashutosh and Soleymani, Mohammad},
booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},
year = {2026}
}
```
## License
USC Research License (research / non-profit only). See [`LICENSE`](LICENSE).
The base model (`Qwen/Qwen2-Audio-7B-Instruct`) carries its own Tongyi Qianwen license terms,
which continue to apply to the inherited weights.