Loie
/

SpotSound

audio temporal grounding

large audio-language models

Model card Files Files and versions

SpotSound / README.md

Loie's picture

Update README.md

6765e3a verified 29 days ago

|

history blame contribute delete

2.83 kB

	---
	license: mit
	language:
	- en
	base_model:
	- nvidia/audio-flamingo-3-hf
	tags:
	- audio
	- audio temporal grounding
	- large audio-language models
	---

	# SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding

	[![GitHub](https://img.shields.io/badge/GitHub-Repo-black?logo=github)](https://github.com/LoieSun/SpotSound)
	[![Paper](https://img.shields.io/badge/arXiv-Paper-red?logo=arxiv)](https://arxiv.org/abs/2604.13023)
	[![Benchmark](https://img.shields.io/badge/🤗_HuggingFace-Benchmark-yellow)](https://huggingface.co/datasets/Loie/SpotSound-Bench)

	## Model Summary

	SpotSound is a model designed to enhance Large Audio-Language Models (ALMs) with fine-grained temporal grounding capabilities. Built on top of [Audio Flamingo 3](https://huggingface.co/nvidia/audio-flamingo-3), SpotSound is capable of accurately pinpointing the exact start and end timestamps of specific acoustic events within long, untrimmed audio recordings based on natural language queries.

	This model is particularly effective for "needle-in-a-haystack" audio retrieval tasks, where short target sounds are embedded within complex background noise.


	## Usage / Quick Start

	To use SpotSound for inference, you need to download both the base Audio Flamingo 3 model and the SpotSound checkpoint.

	### 1. Installation

	First, clone the official [SpotSound GitHub repository](#) and set up the environment:

	```bash
	conda create -n SpotSound python=3.10
	conda activate SpotSound
	pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121
	pip install -r requirements.txt
	```

	### 2. Run Inference

	You can run inference directly from the command line using the provided script in the GitHub repository. Specify the path to the downloaded Audio Flamingo 3 model, your SpotSound checkpoint, the target audio file, and your text query.

	```bash
	export CUDA_VISIBLE_DEVICES=0
	python inference.py \
	--pretrain_model path_to_audioflamingo3 \
	--checkpoint ckpt/spotsound \
	--audio_path data/audio.wav \
	--query "dog barking"
	```

	## Citation

	If you use SpotSound or our benchmark in your research, please cite our paper:

	```bibtex
	@inproceedings{sun2026spotsound,
	title={SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding},
	author={Sun, Luoyi and Zhou, Xiao and Li, Zeqian and Zhang, Ya and Wang, Yanking and Xie, Weidi},
	journal={arXiv preprint arXiv:2604.13023},
	year={2026}
	}
	```

	## Acknowledgements

	This project builds upon several excellent open-source efforts, notably:
	- [Audio Flamingo 3](https://github.com/NVIDIA/audio-flamingo/tree/audio_flamingo_3)
	- [UniTime](https://github.com/Lzq5/UniTime).

	## Contact

	For any questions or issues, please contact: loiesun411@gmail.com.