Instructions to use shiran-yu/SpaAudioLM with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use shiran-yu/SpaAudioLM with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="shiran-yu/SpaAudioLM")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForTextToWaveform

processor = AutoProcessor.from_pretrained("shiran-yu/SpaAudioLM")
model = AutoModelForTextToWaveform.from_pretrained("shiran-yu/SpaAudioLM")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use shiran-yu/SpaAudioLM with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "shiran-yu/SpaAudioLM"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "shiran-yu/SpaAudioLM",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/shiran-yu/SpaAudioLM

SGLang

How to use shiran-yu/SpaAudioLM with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "shiran-yu/SpaAudioLM" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "shiran-yu/SpaAudioLM",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "shiran-yu/SpaAudioLM" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "shiran-yu/SpaAudioLM",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use shiran-yu/SpaAudioLM with Docker Model Runner:
```
docker model run hf.co/shiran-yu/SpaAudioLM
```

SpaAudioLM / README.md

shiran-yu

Update README.md

ce49ca6 verified about 2 months ago

preview code

raw

history blame contribute delete

4.43 kB

	---
	license: mit
	language:
	- en
	metrics:
	- f1
	- accuracy
	base_model:
	- Qwen/Qwen2.5-Omni-7B
	pipeline_tag: text-generation
	library_name: transformers
	tags:
	- audio
	- geospatial
	- environmental-sound-classification
	- multimodal
	- chain-of-thought
	- reinforcement-learning
	- grpo
	datasets:
	- shiran-yu/SpaAudioLM-Dataset
	---

	# SpaAudioLM

	Spatial Context-Assisted Audio Language Model for Geospatially Aware Sound Understanding

	<p>
	<a href="#"><img alt="Paper" src="https://img.shields.io/badge/Paper-PDF-red"></a>
	<a href="https://yushiran.github.io/SpaAudioLM/"><img alt="Page" src="https://img.shields.io/badge/Project-Page-orange"></a>
	<a href="https://huggingface.co/shiran-yu/SpaAudioLM/tree/main"><img alt="Model" src="https://img.shields.io/badge/HuggingFace-Model-yellow"></a>
	<a href="https://huggingface.co/datasets/shiran-yu/SpaAudioLM-Dataset"><img alt="Dataset" src="https://img.shields.io/badge/HuggingFace-Dataset-blue"></a>
	<a href="#license"><img alt="License" src="https://img.shields.io/badge/License-MIT-green"></a>
	</p>

	## Model Summary

	SpaAudioLM is a multimodal audio language model fine-tuned from [Qwen2.5-Omni-7B](https://huggingface.co/Qwen/Qwen2.5-Omni-7B) for geospatially aware environmental sound classification. It jointly reasons over audio signals and geospatial Point-of-Interest (POI) metadata across 28 environmental sound categories.

	Existing environmental sound classification (ESC) methods treat sounds as isolated signals, ignoring where they occur. SpaAudioLM bridges this gap by enabling spatially grounded sound understanding.

	### Training Hyperparameters

	\| Phase \| Base Model \| Epochs \| Learning Rate \| Key Details \|
	\|-------\|-----------\|--------\|--------------\|-------------\|
	\| SFT \| Qwen2.5-Omni-7B \| 6 \| 1e-5 \| DeepSpeed Zero-2, batch size 4/GPU, full parameter fine-tuning \|
	\| GRPO \| SFT checkpoint \| 3 \| 1e-6 \| Group size 8, KL coeff 0.05, rewards: F1 (1.0) + format (0.1) + POI (0.3) \|

	Hardware: 4× GPUs, 32GB+ VRAM each

	## Results

	Comparison on multi-label audio event classification (mean ± std over 5 runs, %):

	\| Model \| F1-Micro \| F1-Macro \| F1-Weighted \| Jaccard \| Exact Match \|
	\|-------\|----------\|----------\|-------------\|---------\|-------------\|
	\| Qwen2-Audio-7B \| 4.73 \| 2.86 \| 5.27 \| 1.96 \| 0.00 \|
	\| Qwen2.5-Omni-7B \| 34.36 \| 25.90 \| 37.35 \| 18.31 \| 9.97 \|
	\| Qwen3-Omni-30B \| 29.66 \| 20.26 \| 28.80 \| 14.81 \| 14.02 \|
	\| GPT-4o Audio \| 30.09 \| 26.47 \| 34.07 \| 17.18 \| 9.43 \|
	\| Gemini 2.5 Pro \| 44.24 \| 40.35 \| 47.65 \| 28.04 \| 15.58 \|
	\| SpaAudioLM (Ours) \| 73.36 \| 63.48 \| 72.98 \| 53.57 \| 54.47 \|

	## Quick Start

	### Download & Inference

	```bash
	# Download model weights
	huggingface-cli download shiran-yu/SpaAudioLM --local-dir models/SpaAudioLM
	```

	```bash
	# Clone the repo for inference scripts
	git clone https://github.com/<your-username>/SpaAudioLM.git
	cd SpaAudioLM

	curl -LsSf https://astral.sh/uv/install.sh \| sh
	uv sync
	source .venv/bin/activate

	# Run inference
	bash app/src/grpo/GeoOmniR1-grpo-strength-infer.sh
	```

	### Dataset

	```bash
	git clone https://huggingface.co/datasets/shiran-yu/SpaAudioLM-Dataset data
	```

	The dataset contains 3,854 WAV files with POI metadata, split into train (2,697), validation (578), and test (579) samples.

	### Training

	```bash
	# Phase 1: SFT
	bash app/src/sft/GeoOmniR1Strength-sft.sh

	# Phase 2: GRPO (requires SFT checkpoint)
	bash app/src/grpo/GeoOmniR1-grpo-strength.sh
	```

	### Evaluation

	```bash
	# Single run
	uv run app/src/GeoOmniR1Strength_evaluate.py --output_dir <path_to_output.json> --save_results

	# 5-run aggregation (mean ± std)
	uv run app/src/evaluateAverageScore.py --base_dir <path_to_5runs_dir>
	```

	## Intended Use

	This model is designed for multi-label environmental sound classification in geospatial contexts. It takes audio input along with POI metadata and produces chain-of-thought reasoning followed by sound event labels.

	### Limitations

	- Requires POI metadata for optimal performance; audio-only inference may degrade results.
	- Trained on 28 environmental sound categories; may not generalize to other sound taxonomies.
	- Requires significant GPU resources (4× 32GB+ VRAM) for training.

	## Citation

	```bibtex
	@article{hou2025spaaudioLM,
	title={SpaAudioLM: Spatial Context-Assisted Audio Language Model for Geospatially Aware Sound Understanding},
	author={Hou, Yuanbo and Yu, Shiran and Zhi, Zhuo},
	year={2025}
	}
	```