SpaAudioLM / README.md
shiran-yu's picture
Update README.md
ce49ca6 verified
---
license: mit
language:
- en
metrics:
- f1
- accuracy
base_model:
- Qwen/Qwen2.5-Omni-7B
pipeline_tag: text-generation
library_name: transformers
tags:
- audio
- geospatial
- environmental-sound-classification
- multimodal
- chain-of-thought
- reinforcement-learning
- grpo
datasets:
- shiran-yu/SpaAudioLM-Dataset
---
# SpaAudioLM
**Spatial Context-Assisted Audio Language Model for Geospatially Aware Sound Understanding**
<p>
<a href="#"><img alt="Paper" src="https://img.shields.io/badge/Paper-PDF-red"></a>
<a href="https://yushiran.github.io/SpaAudioLM/"><img alt="Page" src="https://img.shields.io/badge/Project-Page-orange"></a>
<a href="https://huggingface.co/shiran-yu/SpaAudioLM/tree/main"><img alt="Model" src="https://img.shields.io/badge/HuggingFace-Model-yellow"></a>
<a href="https://huggingface.co/datasets/shiran-yu/SpaAudioLM-Dataset"><img alt="Dataset" src="https://img.shields.io/badge/HuggingFace-Dataset-blue"></a>
<a href="#license"><img alt="License" src="https://img.shields.io/badge/License-MIT-green"></a>
</p>
## Model Summary
SpaAudioLM is a multimodal audio language model fine-tuned from [Qwen2.5-Omni-7B](https://huggingface.co/Qwen/Qwen2.5-Omni-7B) for **geospatially aware environmental sound classification**. It jointly reasons over audio signals and geospatial Point-of-Interest (POI) metadata across 28 environmental sound categories.
Existing environmental sound classification (ESC) methods treat sounds as isolated signals, ignoring *where* they occur. SpaAudioLM bridges this gap by enabling spatially grounded sound understanding.
### Training Hyperparameters
| Phase | Base Model | Epochs | Learning Rate | Key Details |
|-------|-----------|--------|--------------|-------------|
| SFT | Qwen2.5-Omni-7B | 6 | 1e-5 | DeepSpeed Zero-2, batch size 4/GPU, full parameter fine-tuning |
| GRPO | SFT checkpoint | 3 | 1e-6 | Group size 8, KL coeff 0.05, rewards: F1 (1.0) + format (0.1) + POI (0.3) |
**Hardware:** 4× GPUs, 32GB+ VRAM each
## Results
Comparison on multi-label audio event classification (mean ± std over 5 runs, %):
| Model | F1-Micro | F1-Macro | F1-Weighted | Jaccard | Exact Match |
|-------|----------|----------|-------------|---------|-------------|
| Qwen2-Audio-7B | 4.73 | 2.86 | 5.27 | 1.96 | 0.00 |
| Qwen2.5-Omni-7B | 34.36 | 25.90 | 37.35 | 18.31 | 9.97 |
| Qwen3-Omni-30B | 29.66 | 20.26 | 28.80 | 14.81 | 14.02 |
| GPT-4o Audio | 30.09 | 26.47 | 34.07 | 17.18 | 9.43 |
| Gemini 2.5 Pro | 44.24 | 40.35 | 47.65 | 28.04 | 15.58 |
| **SpaAudioLM (Ours)** | **73.36** | **63.48** | **72.98** | **53.57** | **54.47** |
## Quick Start
### Download & Inference
```bash
# Download model weights
huggingface-cli download shiran-yu/SpaAudioLM --local-dir models/SpaAudioLM
```
```bash
# Clone the repo for inference scripts
git clone https://github.com/<your-username>/SpaAudioLM.git
cd SpaAudioLM
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync
source .venv/bin/activate
# Run inference
bash app/src/grpo/GeoOmniR1-grpo-strength-infer.sh
```
### Dataset
```bash
git clone https://huggingface.co/datasets/shiran-yu/SpaAudioLM-Dataset data
```
The dataset contains 3,854 WAV files with POI metadata, split into train (2,697), validation (578), and test (579) samples.
### Training
```bash
# Phase 1: SFT
bash app/src/sft/GeoOmniR1Strength-sft.sh
# Phase 2: GRPO (requires SFT checkpoint)
bash app/src/grpo/GeoOmniR1-grpo-strength.sh
```
### Evaluation
```bash
# Single run
uv run app/src/GeoOmniR1Strength_evaluate.py --output_dir <path_to_output.json> --save_results
# 5-run aggregation (mean ± std)
uv run app/src/evaluateAverageScore.py --base_dir <path_to_5runs_dir>
```
## Intended Use
This model is designed for multi-label environmental sound classification in geospatial contexts. It takes audio input along with POI metadata and produces chain-of-thought reasoning followed by sound event labels.
### Limitations
- Requires POI metadata for optimal performance; audio-only inference may degrade results.
- Trained on 28 environmental sound categories; may not generalize to other sound taxonomies.
- Requires significant GPU resources (4× 32GB+ VRAM) for training.
## Citation
```bibtex
@article{hou2025spaaudioLM,
title={SpaAudioLM: Spatial Context-Assisted Audio Language Model for Geospatially Aware Sound Understanding},
author={Hou, Yuanbo and Yu, Shiran and Zhi, Zhuo},
year={2025}
}
```