File size: 4,078 Bytes
08b5d96
 
 
37b7e1f
08b5d96
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37b7e1f
 
 
 
 
 
 
 
08b5d96
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37b7e1f
 
 
 
 
 
 
 
 
 
08b5d96
 
 
 
 
 
 
 
 
 
 
 
 
37b7e1f
08b5d96
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
---
license: other
license_name: usc-research
license_link: LICENSE
language:
- en
base_model: nvidia/audio-flamingo-3
tags:
- audio
- speech
- audio-llm
- paralinguistic
- pclm
- dpo
- voxparadox
pipeline_tag: audio-text-to-text
---

# Audio Flamingo 3 + PCLM + DPO

[![ICML 2026](https://img.shields.io/badge/ICML-2026-1d4ed8.svg)](https://icml.cc/Conferences/2026)
[![Paper](https://img.shields.io/badge/Paper-arXiv-AD1C18.svg)](https://arxiv.org/abs/2605.27772)
[![Project Page](https://img.shields.io/badge/Project-Page-0EA5E9.svg)](https://voxparadox.github.io/)
[![Code](https://img.shields.io/badge/GitHub-ihp--lab%2FVoxParadox-181717.svg?logo=github)](https://github.com/ihp-lab/VoxParadox)
[![Dataset](https://img.shields.io/badge/πŸ€—%20Dataset-IHP--Lab%2FVoxParadox-FFD21E.svg)](https://huggingface.co/datasets/IHP-Lab/VoxParadox)
[![Qwen2-Audio + PCLM + DPO](https://img.shields.io/badge/πŸ€—%20Sibling%20model-Qwen2--Audio+PCLM+DPO-FFD21E.svg)](https://huggingface.co/IHP-Lab/Qwen2-Audio_PCLM_DPO)
[![License](https://img.shields.io/badge/License-USC%20Research-228B22.svg)](LICENSE)

PCLM- and DPO-finetuned [Audio Flamingo 3](https://huggingface.co/nvidia/audio-flamingo-3) from
*Do Audio LLMs Listen or Read? Analyzing and Mitigating Paralinguistic Failures with VoxParadox*
(ICML 2026).

The base model is augmented with the **Prompt-Conditioned Layer Mixer (PCLM)** β€” a lightweight module that
adaptively mixes representations from intermediate AF-Whisper encoder layers based on the user prompt β€” and
then post-trained with **Direct Preference Optimization (DPO)** to prefer acoustically-grounded answers over
language-implied alternatives on paralinguistic MCQs.

## Layout

Unlike a stock HF model, AF3 ships its weights split across subfolders:

```
.
β”œβ”€β”€ config.json                    # top-level AF3 config (PCLM fields included)
β”œβ”€β”€ llm/                            # frozen + DPO-tuned Qwen2 LLM
β”œβ”€β”€ sound_tower/                    # AF-Whisper audio encoder
β”œβ”€β”€ sound_mm_projector/             # final-layer audioβ†’LLM projector
β”œβ”€β”€ sound_mid_mm_projector_{5,15,25,30}/   # intermediate-layer projectors (PCLM)
β”œβ”€β”€ sound_pclm/                     # BERT-small prompt encoder + gate MLP
└── tokenizer files (vocab.json, merges.txt, …)
```

## Usage

This checkpoint cannot be loaded with stock `transformers` β€” AF3 + PCLM requires the
custom modeling code shipped in the [release repo](https://github.com/ihp-lab/VoxParadox).

```bash
git clone https://github.com/ihp-lab/VoxParadox
cd VoxParadox/af3/audio-flamingo
bash environment_setup.sh af3
conda activate af3
```

Inference on VoxParadox (or any MCQ JSON in the same schema):

```bash
bash scripts/eval_voxparadox.sh \
    IHP-Lab/AF3_PCLM_DPO \
    /path/to/voxparadox.json \
    /path/to/audio_root \
    runs/eval/af3_pclm_dpo
```

Score with the dataset-shipped `eval.py`:

```bash
python eval.py --predictions runs/eval/af3_pclm_dpo/predictions.jsonl
```

PCLM activation is read from this checkpoint's `config.json`
(`expose_layers=[5, 15, 25, 30]`, `use_sound_pclm=true`).

## Project resources

| Resource | Link |
|---|---|
| Paper (arXiv) | <https://arxiv.org/abs/2605.27772> |
| Project page | <https://voxparadox.github.io/> |
| Code | <https://github.com/ihp-lab/VoxParadox> |
| Benchmark | <https://huggingface.co/datasets/IHP-Lab/VoxParadox> |
| Sibling model (Qwen2-Audio) | <https://huggingface.co/IHP-Lab/Qwen2-Audio_PCLM_DPO> |

## Citation

```bibtex
@inproceedings{pang2026voxparadox,
  title     = {Do Audio LLMs Listen or Read? Analyzing and Mitigating Paralinguistic Failures with VoxParadox},
  author    = {Pang, Jiacheng and Chaubey, Ashutosh and Soleymani, Mohammad},
  booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},
  year      = {2026}
}
```

## License

USC Research License (research / non-profit only). See [`LICENSE`](LICENSE).

The base model (`nvidia/audio-flamingo-3`) carries the NVIDIA non-commercial license
terms, which continue to apply to the inherited weights.