--- license: mit language: - en base_model: - nvidia/audio-flamingo-3-hf tags: - audio - audio temporal grounding - large audio-language models --- # SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding [![GitHub](https://img.shields.io/badge/GitHub-Repo-black?logo=github)](https://github.com/LoieSun/SpotSound) [![Paper](https://img.shields.io/badge/arXiv-Paper-red?logo=arxiv)](https://arxiv.org/abs/2604.13023) [![Benchmark](https://img.shields.io/badge/🤗_HuggingFace-Benchmark-yellow)](https://huggingface.co/datasets/Loie/SpotSound-Bench) ## Model Summary **SpotSound** is a model designed to enhance Large Audio-Language Models (ALMs) with fine-grained temporal grounding capabilities. Built on top of [Audio Flamingo 3](https://huggingface.co/nvidia/audio-flamingo-3), SpotSound is capable of accurately pinpointing the exact start and end timestamps of specific acoustic events within long, untrimmed audio recordings based on natural language queries. This model is particularly effective for "needle-in-a-haystack" audio retrieval tasks, where short target sounds are embedded within complex background noise. ## Usage / Quick Start To use SpotSound for inference, you need to download both the base **Audio Flamingo 3** model and the **SpotSound** checkpoint. ### 1. Installation First, clone the official [SpotSound GitHub repository](#) and set up the environment: ```bash conda create -n SpotSound python=3.10 conda activate SpotSound pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121 pip install -r requirements.txt ``` ### 2. Run Inference You can run inference directly from the command line using the provided script in the GitHub repository. Specify the path to the downloaded Audio Flamingo 3 model, your SpotSound checkpoint, the target audio file, and your text query. ```bash export CUDA_VISIBLE_DEVICES=0 python inference.py \ --pretrain_model path_to_audioflamingo3 \ --checkpoint ckpt/spotsound \ --audio_path data/audio.wav \ --query "dog barking" ``` ## Citation If you use SpotSound or our benchmark in your research, please cite our paper: ```bibtex @inproceedings{sun2026spotsound, title={SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding}, author={Sun, Luoyi and Zhou, Xiao and Li, Zeqian and Zhang, Ya and Wang, Yanking and Xie, Weidi}, journal={arXiv preprint arXiv:2604.13023}, year={2026} } ``` ## Acknowledgements This project builds upon several excellent open-source efforts, notably: - **[Audio Flamingo 3](https://github.com/NVIDIA/audio-flamingo/tree/audio_flamingo_3)** - **[UniTime](https://github.com/Lzq5/UniTime)**. ## Contact For any questions or issues, please contact: loiesun411@gmail.com.