| --- |
| license: mit |
| language: |
| - en |
| base_model: |
| - nvidia/audio-flamingo-3-hf |
| tags: |
| - audio |
| - audio temporal grounding |
| - large audio-language models |
| --- |
| |
| # SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding |
|
|
| [](https://github.com/LoieSun/SpotSound) |
| [](https://arxiv.org/abs/2604.13023) |
| [](https://huggingface.co/datasets/Loie/SpotSound-Bench) |
|
|
| ## Model Summary |
|
|
| **SpotSound** is a model designed to enhance Large Audio-Language Models (ALMs) with fine-grained temporal grounding capabilities. Built on top of [Audio Flamingo 3](https://huggingface.co/nvidia/audio-flamingo-3), SpotSound is capable of accurately pinpointing the exact start and end timestamps of specific acoustic events within long, untrimmed audio recordings based on natural language queries. |
|
|
| This model is particularly effective for "needle-in-a-haystack" audio retrieval tasks, where short target sounds are embedded within complex background noise. |
|
|
|
|
| ## Usage / Quick Start |
|
|
| To use SpotSound for inference, you need to download both the base **Audio Flamingo 3** model and the **SpotSound** checkpoint. |
|
|
| ### 1. Installation |
|
|
| First, clone the official [SpotSound GitHub repository](#) and set up the environment: |
|
|
| ```bash |
| conda create -n SpotSound python=3.10 |
| conda activate SpotSound |
| pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121 |
| pip install -r requirements.txt |
| ``` |
|
|
| ### 2. Run Inference |
|
|
| You can run inference directly from the command line using the provided script in the GitHub repository. Specify the path to the downloaded Audio Flamingo 3 model, your SpotSound checkpoint, the target audio file, and your text query. |
|
|
| ```bash |
| export CUDA_VISIBLE_DEVICES=0 |
| python inference.py \ |
| --pretrain_model path_to_audioflamingo3 \ |
| --checkpoint ckpt/spotsound \ |
| --audio_path data/audio.wav \ |
| --query "dog barking" |
| ``` |
|
|
| ## Citation |
|
|
| If you use SpotSound or our benchmark in your research, please cite our paper: |
|
|
| ```bibtex |
| @inproceedings{sun2026spotsound, |
| title={SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding}, |
| author={Sun, Luoyi and Zhou, Xiao and Li, Zeqian and Zhang, Ya and Wang, Yanking and Xie, Weidi}, |
| journal={arXiv preprint arXiv:2604.13023}, |
| year={2026} |
| } |
| ``` |
|
|
| ## Acknowledgements |
|
|
| This project builds upon several excellent open-source efforts, notably: |
| - **[Audio Flamingo 3](https://github.com/NVIDIA/audio-flamingo/tree/audio_flamingo_3)** |
| - **[UniTime](https://github.com/Lzq5/UniTime)**. |
|
|
| ## Contact |
|
|
| For any questions or issues, please contact: loiesun411@gmail.com. |