Qwen-Audio-AHA / README.md
ychen745's picture
Add pipeline tag, library name, and improve model card (#1)
e3fbd7b verified
---
base_model: Qwen/Qwen2.5-Omni-7B
datasets:
- ASU-GSL/AHA
library_name: peft
license: apache-2.0
pipeline_tag: audio-text-to-text
tags:
- lora
- qwen2.5-omni
- multimodal
- audio
---
# Qwen-Audio-AHA (LoRA Adapter)
This repository contains the official LoRA adapter for **Qwen2.5-Omni-7B** (Thinker), fine-tuned using the **AHA (Audio Hallucination Alignment)** framework.
## Model Description
AHA is a framework designed to mitigate hallucinations in Large Audio-Language Models (LALMs) by focusing on fine-grained temporal reasoning and counterfactual alignment. By leveraging counterfactual hard negative mining, the pipeline constructs high-quality preference data that forces models to distinguish strict acoustic evidence from linguistically plausible fabrications.
- **Paper:** [AHA: Aligning Large Audio-Language Models for Reasoning Hallucinations via Counterfactual Hard Negatives](https://huggingface.co/papers/2512.24052)
- **GitHub Repository:** [https://github.com/LLM-VLM-GSL/AHA](https://github.com/LLM-VLM-GSL/AHA)
- **Base Model:** [Qwen/Qwen2.5-Omni-7B](https://huggingface.co/Qwen/Qwen2.5-Omni-7B)
## Intended Use
- **Primary Task:** Audio reasoning and reducing hallucinations in audio-to-text tasks.
- **Languages Supported:** All languages supported by the base Qwen2.5-Omni-7B model.
## Sample Usage
You can load this model using the `peft` and `transformers` libraries. Note that `librosa` is required for audio loading in this example.
```python
import torch
import librosa
from transformers import Qwen2_5OmniThinkerForConditionalGeneration, Qwen2_5OmniProcessor
from peft import PeftModel
device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "Qwen/Qwen2.5-Omni-7B"
adapter_id = "ASU-GSL/Qwen-Audio-AHA"
# Load base model and processor
processor = Qwen2_5OmniProcessor.from_pretrained(model_id)
model = Qwen2_5OmniThinkerForConditionalGeneration.from_pretrained(
model_id, torch_dtype="auto", device_map="auto"
)
# Load LoRA adapter
model = PeftModel.from_pretrained(model, adapter_id)
# Load Audio
# Replace "example.wav" with the path to your audio file
audio, _ = librosa.load("example.wav", sr=processor.feature_extractor.sampling_rate)
prompt = "<|audio|>
Describe the temporal order of events in this audio."
inputs = processor(text=prompt, audios=audio, return_tensors="pt").to(device)
# Generate
generate_ids = model.generate(**inputs, max_new_tokens=256)
print(processor.batch_decode(generate_ids, skip_special_tokens=True)[0])
```
## Citation
```bibtex
@article{chen2025aha,
title={AHA: Aligning Large Audio-Language Models for Reasoning Hallucinations via Counterfactual Hard Negatives},
author={Chen, Yanxi and Zhu, Wenhui and Chen, Xiwen and Wang, Zhipeng and Li, Xin and Qiu, Peijie and Wang, Hao and Dong, Xuanzhao and Xiong, Yujian and Schneider, Anderson and others},
journal={arXiv preprint arXiv:2512.24052},
year={2025}
}
```