File size: 4,078 Bytes
08b5d96 37b7e1f 08b5d96 37b7e1f 08b5d96 37b7e1f 08b5d96 37b7e1f 08b5d96 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 | ---
license: other
license_name: usc-research
license_link: LICENSE
language:
- en
base_model: nvidia/audio-flamingo-3
tags:
- audio
- speech
- audio-llm
- paralinguistic
- pclm
- dpo
- voxparadox
pipeline_tag: audio-text-to-text
---
# Audio Flamingo 3 + PCLM + DPO
[](https://icml.cc/Conferences/2026)
[](https://arxiv.org/abs/2605.27772)
[](https://voxparadox.github.io/)
[](https://github.com/ihp-lab/VoxParadox)
[](https://huggingface.co/datasets/IHP-Lab/VoxParadox)
[](https://huggingface.co/IHP-Lab/Qwen2-Audio_PCLM_DPO)
[](LICENSE)
PCLM- and DPO-finetuned [Audio Flamingo 3](https://huggingface.co/nvidia/audio-flamingo-3) from
*Do Audio LLMs Listen or Read? Analyzing and Mitigating Paralinguistic Failures with VoxParadox*
(ICML 2026).
The base model is augmented with the **Prompt-Conditioned Layer Mixer (PCLM)** β a lightweight module that
adaptively mixes representations from intermediate AF-Whisper encoder layers based on the user prompt β and
then post-trained with **Direct Preference Optimization (DPO)** to prefer acoustically-grounded answers over
language-implied alternatives on paralinguistic MCQs.
## Layout
Unlike a stock HF model, AF3 ships its weights split across subfolders:
```
.
βββ config.json # top-level AF3 config (PCLM fields included)
βββ llm/ # frozen + DPO-tuned Qwen2 LLM
βββ sound_tower/ # AF-Whisper audio encoder
βββ sound_mm_projector/ # final-layer audioβLLM projector
βββ sound_mid_mm_projector_{5,15,25,30}/ # intermediate-layer projectors (PCLM)
βββ sound_pclm/ # BERT-small prompt encoder + gate MLP
βββ tokenizer files (vocab.json, merges.txt, β¦)
```
## Usage
This checkpoint cannot be loaded with stock `transformers` β AF3 + PCLM requires the
custom modeling code shipped in the [release repo](https://github.com/ihp-lab/VoxParadox).
```bash
git clone https://github.com/ihp-lab/VoxParadox
cd VoxParadox/af3/audio-flamingo
bash environment_setup.sh af3
conda activate af3
```
Inference on VoxParadox (or any MCQ JSON in the same schema):
```bash
bash scripts/eval_voxparadox.sh \
IHP-Lab/AF3_PCLM_DPO \
/path/to/voxparadox.json \
/path/to/audio_root \
runs/eval/af3_pclm_dpo
```
Score with the dataset-shipped `eval.py`:
```bash
python eval.py --predictions runs/eval/af3_pclm_dpo/predictions.jsonl
```
PCLM activation is read from this checkpoint's `config.json`
(`expose_layers=[5, 15, 25, 30]`, `use_sound_pclm=true`).
## Project resources
| Resource | Link |
|---|---|
| Paper (arXiv) | <https://arxiv.org/abs/2605.27772> |
| Project page | <https://voxparadox.github.io/> |
| Code | <https://github.com/ihp-lab/VoxParadox> |
| Benchmark | <https://huggingface.co/datasets/IHP-Lab/VoxParadox> |
| Sibling model (Qwen2-Audio) | <https://huggingface.co/IHP-Lab/Qwen2-Audio_PCLM_DPO> |
## Citation
```bibtex
@inproceedings{pang2026voxparadox,
title = {Do Audio LLMs Listen or Read? Analyzing and Mitigating Paralinguistic Failures with VoxParadox},
author = {Pang, Jiacheng and Chaubey, Ashutosh and Soleymani, Mohammad},
booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},
year = {2026}
}
```
## License
USC Research License (research / non-profit only). See [`LICENSE`](LICENSE).
The base model (`nvidia/audio-flamingo-3`) carries the NVIDIA non-commercial license
terms, which continue to apply to the inherited weights.
|