Audio-Text-to-Text
Transformers
Safetensors
English
qwen2_audio
text2text-generation
audio
speech
audio-llm
paralinguistic
pclm
dpo
voxparadox
Instructions to use IHP-Lab/Qwen2-Audio_PCLM_DPO with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use IHP-Lab/Qwen2-Audio_PCLM_DPO with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModelForSeq2SeqLM processor = AutoProcessor.from_pretrained("IHP-Lab/Qwen2-Audio_PCLM_DPO") model = AutoModelForSeq2SeqLM.from_pretrained("IHP-Lab/Qwen2-Audio_PCLM_DPO") - Notebooks
- Google Colab
- Kaggle
| license: other | |
| license_name: usc-research | |
| license_link: LICENSE | |
| language: | |
| - en | |
| library_name: transformers | |
| base_model: Qwen/Qwen2-Audio-7B-Instruct | |
| tags: | |
| - audio | |
| - speech | |
| - audio-llm | |
| - paralinguistic | |
| - pclm | |
| - dpo | |
| - voxparadox | |
| pipeline_tag: audio-text-to-text | |
| # Qwen2-Audio + PCLM + DPO | |
| [](https://icml.cc/Conferences/2026) | |
| [](https://arxiv.org/abs/2605.27772) | |
| [](https://voxparadox.github.io/) | |
| [](https://github.com/ihp-lab/VoxParadox) | |
| [](https://huggingface.co/datasets/IHP-Lab/VoxParadox) | |
| [](https://huggingface.co/IHP-Lab/AF3_PCLM_DPO) | |
| [](LICENSE) | |
| PCLM- and DPO-finetuned [Qwen2-Audio-7B-Instruct](https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct) from | |
| *Do Audio LLMs Listen or Read? Analyzing and Mitigating Paralinguistic Failures with VoxParadox* | |
| (ICML 2026). | |
| The base model is augmented with the **Prompt-Conditioned Layer Mixer (PCLM)** β a lightweight module that | |
| adaptively mixes representations from intermediate audio-encoder layers based on the user prompt β and then | |
| post-trained with **Direct Preference Optimization (DPO)** to prefer acoustically-grounded answers over | |
| language-implied alternatives on paralinguistic MCQs. | |
| ## Usage | |
| This checkpoint cannot be loaded with stock `transformers` β PCLM requires the custom | |
| modeling code shipped in the [release repo](https://github.com/ihp-lab/VoxParadox). | |
| ```bash | |
| git clone https://github.com/ihp-lab/VoxParadox | |
| cd VoxParadox | |
| conda create -n qwen2audio python=3.10 -y && conda activate qwen2audio | |
| pip install torch torchaudio transformers accelerate librosa soundfile | |
| ``` | |
| Inference on VoxParadox (or any MCQ JSON in the same schema): | |
| ```bash | |
| python -m qwen2audio.eval.run_eval \ | |
| --model_path IHP-Lab/Qwen2-Audio_PCLM_DPO \ | |
| --data_path /path/to/voxparadox.json \ | |
| --audio_base /path/to/audio_root \ | |
| --output_dir runs/eval/qwen2audio_pclm_dpo | |
| ``` | |
| Score with the dataset-shipped `eval.py`: | |
| ```bash | |
| python eval.py --predictions runs/eval/qwen2audio_pclm_dpo/predictions.jsonl | |
| ``` | |
| The loader auto-detects `use_pclm=True` from `config.json` and activates PCLM with | |
| `expose_layers=[5, 15, 25, 30]` over the audio encoder. | |
| ## Project resources | |
| | Resource | Link | | |
| |---|---| | |
| | Paper (arXiv) | <https://arxiv.org/abs/2605.27772> | | |
| | Project page | <https://voxparadox.github.io/> | | |
| | Code | <https://github.com/ihp-lab/VoxParadox> | | |
| | Benchmark | <https://huggingface.co/datasets/IHP-Lab/VoxParadox> | | |
| | Sibling model (AF3) | <https://huggingface.co/IHP-Lab/AF3_PCLM_DPO> | | |
| ## Citation | |
| ```bibtex | |
| @inproceedings{pang2026voxparadox, | |
| title = {Do Audio LLMs Listen or Read? Analyzing and Mitigating Paralinguistic Failures with VoxParadox}, | |
| author = {Pang, Jiacheng and Chaubey, Ashutosh and Soleymani, Mohammad}, | |
| booktitle = {Proceedings of the International Conference on Machine Learning (ICML)}, | |
| year = {2026} | |
| } | |
| ``` | |
| ## License | |
| USC Research License (research / non-profit only). See [`LICENSE`](LICENSE). | |
| The base model (`Qwen/Qwen2-Audio-7B-Instruct`) carries its own Tongyi Qianwen license terms, | |
| which continue to apply to the inherited weights. | |