File size: 1,328 Bytes
aa24db3 8aea612 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | ---
language:
- en
base_model:
- GSAI-ML/LLaDA-8B-Instruct
tags:
- audio
- speech
- music
---
# DIFFA-2: A Practical Diffusion Large Language Model for General Audio Understanding
[](https://arxiv.org/abs/2601.23161v1)
[](https://huggingface.co/zhoujiaming777/DIFFA-2)
[](https://github.com/NKU-HLT/DIFFA)
In this paper, We introduce DIFFA-2, a practical diffusion-based LALM for general audio understanding. DIFFA-2 upgrades the speech encoder, employs dual semantic and acoustic adapters, and is trained with a four-stage curriculum that combines semantic and acoustic alignment, large-scale supervised fine-tuning, and variance-reduced preference optimization, using only fully open-source corpora. Experiments on MMSU, MMAU, and MMAR show that DIFFA-2 consistently improves over DIFFA and is competitive to strong AR LALMs under practical training budgets, supporting diffusion-based modeling is a viable backbone for large-scale audio understanding.
We have open-sourced the checkpoints for stage 1 and stage 4. The files in the root directory of the repository are for stage4, and stage1 is located in the stage1 folder. |