Text Generation
Transformers
Safetensors
English
qwen2_5_omni
text-to-audio
audio
geospatial
environmental-sound-classification
multimodal
chain-of-thought
reinforcement-learning
grpo
conversational
Instructions to use shiran-yu/SpaAudioLM with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use shiran-yu/SpaAudioLM with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="shiran-yu/SpaAudioLM") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoProcessor, AutoModelForTextToWaveform processor = AutoProcessor.from_pretrained("shiran-yu/SpaAudioLM") model = AutoModelForTextToWaveform.from_pretrained("shiran-yu/SpaAudioLM") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use shiran-yu/SpaAudioLM with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "shiran-yu/SpaAudioLM" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "shiran-yu/SpaAudioLM", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/shiran-yu/SpaAudioLM
- SGLang
How to use shiran-yu/SpaAudioLM with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "shiran-yu/SpaAudioLM" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "shiran-yu/SpaAudioLM", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "shiran-yu/SpaAudioLM" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "shiran-yu/SpaAudioLM", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use shiran-yu/SpaAudioLM with Docker Model Runner:
docker model run hf.co/shiran-yu/SpaAudioLM
| license: mit | |
| language: | |
| - en | |
| metrics: | |
| - f1 | |
| - accuracy | |
| base_model: | |
| - Qwen/Qwen2.5-Omni-7B | |
| pipeline_tag: text-generation | |
| library_name: transformers | |
| tags: | |
| - audio | |
| - geospatial | |
| - environmental-sound-classification | |
| - multimodal | |
| - chain-of-thought | |
| - reinforcement-learning | |
| - grpo | |
| datasets: | |
| - shiran-yu/SpaAudioLM-Dataset | |
| # SpaAudioLM | |
| **Spatial Context-Assisted Audio Language Model for Geospatially Aware Sound Understanding** | |
| <p> | |
| <a href="#"><img alt="Paper" src="https://img.shields.io/badge/Paper-PDF-red"></a> | |
| <a href="https://yushiran.github.io/SpaAudioLM/"><img alt="Page" src="https://img.shields.io/badge/Project-Page-orange"></a> | |
| <a href="https://huggingface.co/shiran-yu/SpaAudioLM/tree/main"><img alt="Model" src="https://img.shields.io/badge/HuggingFace-Model-yellow"></a> | |
| <a href="https://huggingface.co/datasets/shiran-yu/SpaAudioLM-Dataset"><img alt="Dataset" src="https://img.shields.io/badge/HuggingFace-Dataset-blue"></a> | |
| <a href="#license"><img alt="License" src="https://img.shields.io/badge/License-MIT-green"></a> | |
| </p> | |
| ## Model Summary | |
| SpaAudioLM is a multimodal audio language model fine-tuned from [Qwen2.5-Omni-7B](https://huggingface.co/Qwen/Qwen2.5-Omni-7B) for **geospatially aware environmental sound classification**. It jointly reasons over audio signals and geospatial Point-of-Interest (POI) metadata across 28 environmental sound categories. | |
| Existing environmental sound classification (ESC) methods treat sounds as isolated signals, ignoring *where* they occur. SpaAudioLM bridges this gap by enabling spatially grounded sound understanding. | |
| ### Training Hyperparameters | |
| | Phase | Base Model | Epochs | Learning Rate | Key Details | | |
| |-------|-----------|--------|--------------|-------------| | |
| | SFT | Qwen2.5-Omni-7B | 6 | 1e-5 | DeepSpeed Zero-2, batch size 4/GPU, full parameter fine-tuning | | |
| | GRPO | SFT checkpoint | 3 | 1e-6 | Group size 8, KL coeff 0.05, rewards: F1 (1.0) + format (0.1) + POI (0.3) | | |
| **Hardware:** 4× GPUs, 32GB+ VRAM each | |
| ## Results | |
| Comparison on multi-label audio event classification (mean ± std over 5 runs, %): | |
| | Model | F1-Micro | F1-Macro | F1-Weighted | Jaccard | Exact Match | | |
| |-------|----------|----------|-------------|---------|-------------| | |
| | Qwen2-Audio-7B | 4.73 | 2.86 | 5.27 | 1.96 | 0.00 | | |
| | Qwen2.5-Omni-7B | 34.36 | 25.90 | 37.35 | 18.31 | 9.97 | | |
| | Qwen3-Omni-30B | 29.66 | 20.26 | 28.80 | 14.81 | 14.02 | | |
| | GPT-4o Audio | 30.09 | 26.47 | 34.07 | 17.18 | 9.43 | | |
| | Gemini 2.5 Pro | 44.24 | 40.35 | 47.65 | 28.04 | 15.58 | | |
| | **SpaAudioLM (Ours)** | **73.36** | **63.48** | **72.98** | **53.57** | **54.47** | | |
| ## Quick Start | |
| ### Download & Inference | |
| ```bash | |
| # Download model weights | |
| huggingface-cli download shiran-yu/SpaAudioLM --local-dir models/SpaAudioLM | |
| ``` | |
| ```bash | |
| # Clone the repo for inference scripts | |
| git clone https://github.com/<your-username>/SpaAudioLM.git | |
| cd SpaAudioLM | |
| curl -LsSf https://astral.sh/uv/install.sh | sh | |
| uv sync | |
| source .venv/bin/activate | |
| # Run inference | |
| bash app/src/grpo/GeoOmniR1-grpo-strength-infer.sh | |
| ``` | |
| ### Dataset | |
| ```bash | |
| git clone https://huggingface.co/datasets/shiran-yu/SpaAudioLM-Dataset data | |
| ``` | |
| The dataset contains 3,854 WAV files with POI metadata, split into train (2,697), validation (578), and test (579) samples. | |
| ### Training | |
| ```bash | |
| # Phase 1: SFT | |
| bash app/src/sft/GeoOmniR1Strength-sft.sh | |
| # Phase 2: GRPO (requires SFT checkpoint) | |
| bash app/src/grpo/GeoOmniR1-grpo-strength.sh | |
| ``` | |
| ### Evaluation | |
| ```bash | |
| # Single run | |
| uv run app/src/GeoOmniR1Strength_evaluate.py --output_dir <path_to_output.json> --save_results | |
| # 5-run aggregation (mean ± std) | |
| uv run app/src/evaluateAverageScore.py --base_dir <path_to_5runs_dir> | |
| ``` | |
| ## Intended Use | |
| This model is designed for multi-label environmental sound classification in geospatial contexts. It takes audio input along with POI metadata and produces chain-of-thought reasoning followed by sound event labels. | |
| ### Limitations | |
| - Requires POI metadata for optimal performance; audio-only inference may degrade results. | |
| - Trained on 28 environmental sound categories; may not generalize to other sound taxonomies. | |
| - Requires significant GPU resources (4× 32GB+ VRAM) for training. | |
| ## Citation | |
| ```bibtex | |
| @article{hou2025spaaudioLM, | |
| title={SpaAudioLM: Spatial Context-Assisted Audio Language Model for Geospatially Aware Sound Understanding}, | |
| author={Hou, Yuanbo and Yu, Shiran and Zhi, Zhuo}, | |
| year={2025} | |
| } | |
| ``` |