Qwen-Audio-AHA / README.md

Add pipeline tag, library name, and improve model card (#1)

e3fbd7b verified 14 days ago

2.92 kB

	---
	base_model: Qwen/Qwen2.5-Omni-7B
	datasets:
	- ASU-GSL/AHA
	library_name: peft
	license: apache-2.0
	pipeline_tag: audio-text-to-text
	tags:
	- lora
	- qwen2.5-omni
	- multimodal
	- audio
	---

	# Qwen-Audio-AHA (LoRA Adapter)

	This repository contains the official LoRA adapter for Qwen2.5-Omni-7B (Thinker), fine-tuned using the AHA (Audio Hallucination Alignment) framework.

	## Model Description
	AHA is a framework designed to mitigate hallucinations in Large Audio-Language Models (LALMs) by focusing on fine-grained temporal reasoning and counterfactual alignment. By leveraging counterfactual hard negative mining, the pipeline constructs high-quality preference data that forces models to distinguish strict acoustic evidence from linguistically plausible fabrications.

	- Paper: [AHA: Aligning Large Audio-Language Models for Reasoning Hallucinations via Counterfactual Hard Negatives](https://huggingface.co/papers/2512.24052)
	- GitHub Repository: [https://github.com/LLM-VLM-GSL/AHA](https://github.com/LLM-VLM-GSL/AHA)
	- Base Model: [Qwen/Qwen2.5-Omni-7B](https://huggingface.co/Qwen/Qwen2.5-Omni-7B)

	## Intended Use
	- Primary Task: Audio reasoning and reducing hallucinations in audio-to-text tasks.
	- Languages Supported: All languages supported by the base Qwen2.5-Omni-7B model.

	## Sample Usage

	You can load this model using the `peft` and `transformers` libraries. Note that `librosa` is required for audio loading in this example.

	```python
	import torch
	import librosa
	from transformers import Qwen2_5OmniThinkerForConditionalGeneration, Qwen2_5OmniProcessor
	from peft import PeftModel

	device = "cuda" if torch.cuda.is_available() else "cpu"
	model_id = "Qwen/Qwen2.5-Omni-7B"
	adapter_id = "ASU-GSL/Qwen-Audio-AHA"

	# Load base model and processor
	processor = Qwen2_5OmniProcessor.from_pretrained(model_id)
	model = Qwen2_5OmniThinkerForConditionalGeneration.from_pretrained(
	model_id, torch_dtype="auto", device_map="auto"
	)

	# Load LoRA adapter
	model = PeftModel.from_pretrained(model, adapter_id)

	# Load Audio
	# Replace "example.wav" with the path to your audio file
	audio, _ = librosa.load("example.wav", sr=processor.feature_extractor.sampling_rate)
	prompt = "<\|audio\|>
	Describe the temporal order of events in this audio."
	inputs = processor(text=prompt, audios=audio, return_tensors="pt").to(device)

	# Generate
	generate_ids = model.generate(**inputs, max_new_tokens=256)
	print(processor.batch_decode(generate_ids, skip_special_tokens=True)[0])
	```

	## Citation
	```bibtex
	@article{chen2025aha,
	title={AHA: Aligning Large Audio-Language Models for Reasoning Hallucinations via Counterfactual Hard Negatives},
	author={Chen, Yanxi and Zhu, Wenhui and Chen, Xiwen and Wang, Zhipeng and Li, Xin and Qiu, Peijie and Wang, Hao and Dong, Xuanzhao and Xiong, Yujian and Schneider, Anderson and others},
	journal={arXiv preprint arXiv:2512.24052},
	year={2025}
	}
	```