|
|
--- |
|
|
base_model: Qwen/Qwen2.5-Omni-7B |
|
|
datasets: |
|
|
- ASU-GSL/AHA |
|
|
library_name: peft |
|
|
license: apache-2.0 |
|
|
pipeline_tag: audio-text-to-text |
|
|
tags: |
|
|
- lora |
|
|
- qwen2.5-omni |
|
|
- multimodal |
|
|
- audio |
|
|
--- |
|
|
|
|
|
# Qwen-Audio-AHA (LoRA Adapter) |
|
|
|
|
|
This repository contains the official LoRA adapter for **Qwen2.5-Omni-7B** (Thinker), fine-tuned using the **AHA (Audio Hallucination Alignment)** framework. |
|
|
|
|
|
## Model Description |
|
|
AHA is a framework designed to mitigate hallucinations in Large Audio-Language Models (LALMs) by focusing on fine-grained temporal reasoning and counterfactual alignment. By leveraging counterfactual hard negative mining, the pipeline constructs high-quality preference data that forces models to distinguish strict acoustic evidence from linguistically plausible fabrications. |
|
|
|
|
|
- **Paper:** [AHA: Aligning Large Audio-Language Models for Reasoning Hallucinations via Counterfactual Hard Negatives](https://huggingface.co/papers/2512.24052) |
|
|
- **GitHub Repository:** [https://github.com/LLM-VLM-GSL/AHA](https://github.com/LLM-VLM-GSL/AHA) |
|
|
- **Base Model:** [Qwen/Qwen2.5-Omni-7B](https://huggingface.co/Qwen/Qwen2.5-Omni-7B) |
|
|
|
|
|
## Intended Use |
|
|
- **Primary Task:** Audio reasoning and reducing hallucinations in audio-to-text tasks. |
|
|
- **Languages Supported:** All languages supported by the base Qwen2.5-Omni-7B model. |
|
|
|
|
|
## Sample Usage |
|
|
|
|
|
You can load this model using the `peft` and `transformers` libraries. Note that `librosa` is required for audio loading in this example. |
|
|
|
|
|
```python |
|
|
import torch |
|
|
import librosa |
|
|
from transformers import Qwen2_5OmniThinkerForConditionalGeneration, Qwen2_5OmniProcessor |
|
|
from peft import PeftModel |
|
|
|
|
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
|
model_id = "Qwen/Qwen2.5-Omni-7B" |
|
|
adapter_id = "ASU-GSL/Qwen-Audio-AHA" |
|
|
|
|
|
# Load base model and processor |
|
|
processor = Qwen2_5OmniProcessor.from_pretrained(model_id) |
|
|
model = Qwen2_5OmniThinkerForConditionalGeneration.from_pretrained( |
|
|
model_id, torch_dtype="auto", device_map="auto" |
|
|
) |
|
|
|
|
|
# Load LoRA adapter |
|
|
model = PeftModel.from_pretrained(model, adapter_id) |
|
|
|
|
|
# Load Audio |
|
|
# Replace "example.wav" with the path to your audio file |
|
|
audio, _ = librosa.load("example.wav", sr=processor.feature_extractor.sampling_rate) |
|
|
prompt = "<|audio|> |
|
|
Describe the temporal order of events in this audio." |
|
|
inputs = processor(text=prompt, audios=audio, return_tensors="pt").to(device) |
|
|
|
|
|
# Generate |
|
|
generate_ids = model.generate(**inputs, max_new_tokens=256) |
|
|
print(processor.batch_decode(generate_ids, skip_special_tokens=True)[0]) |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
```bibtex |
|
|
@article{chen2025aha, |
|
|
title={AHA: Aligning Large Audio-Language Models for Reasoning Hallucinations via Counterfactual Hard Negatives}, |
|
|
author={Chen, Yanxi and Zhu, Wenhui and Chen, Xiwen and Wang, Zhipeng and Li, Xin and Qiu, Peijie and Wang, Hao and Dong, Xuanzhao and Xiong, Yujian and Schneider, Anderson and others}, |
|
|
journal={arXiv preprint arXiv:2512.24052}, |
|
|
year={2025} |
|
|
} |
|
|
``` |