Loie
/

SpotSound

@@ -1,9 +1,10 @@
 ---
-license: mit
 language:
 - en
-base_model:
-- nvidia/audio-flamingo-3-hf
 tags:
 - audio
 - audio temporal grounding
@@ -14,14 +15,14 @@ tags:
 [![GitHub](https://img.shields.io/badge/GitHub-Repo-black?logo=github)](https://github.com/LoieSun/SpotSound)
 [![Paper](https://img.shields.io/badge/arXiv-Paper-red?logo=arxiv)](https://arxiv.org/abs/2604.13023)
 [![Benchmark](https://img.shields.io/badge/🤗_HuggingFace-Benchmark-yellow)](https://huggingface.co/datasets/Loie/SpotSound-Bench)
 ## Model Summary
 **SpotSound** is a model designed to enhance Large Audio-Language Models (ALMs) with fine-grained temporal grounding capabilities. Built on top of [Audio Flamingo 3](https://huggingface.co/nvidia/audio-flamingo-3), SpotSound is capable of accurately pinpointing the exact start and end timestamps of specific acoustic events within long, untrimmed audio recordings based on natural language queries.
-This model is particularly effective for "needle-in-a-haystack" audio retrieval tasks, where short target sounds are embedded within complex background noise.
 ## Usage / Quick Start
@@ -29,7 +30,7 @@ To use SpotSound for inference, you need to download both the base **Audio Flami
 ### 1. Installation
-First, clone the official [SpotSound GitHub repository](#) and set up the environment:
 ```bash
 conda create -n SpotSound python=3.10
@@ -53,13 +54,13 @@ python inference.py \
 ## Citation
-If you use SpotSound or our benchmark in your research, please cite our paper:
 ```bibtex
 @inproceedings{sun2026spotsound,
     title={SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding},
-    author={Sun, Luoyi and Zhou, Xiao and Li, Zeqian and Zhang, Ya and Wang, Yanking and Xie, Weidi},
-	journal={arXiv preprint arXiv:2604.13023},
     year={2026}
 }
 ```

 ---
+base_model: nvidia/audio-flamingo-3-hf
 language:
 - en
+license: mit
+pipeline_tag: audio-text-to-text
+library_name: peft
 tags:
 - audio
 - audio temporal grounding
 [![GitHub](https://img.shields.io/badge/GitHub-Repo-black?logo=github)](https://github.com/LoieSun/SpotSound)
 [![Paper](https://img.shields.io/badge/arXiv-Paper-red?logo=arxiv)](https://arxiv.org/abs/2604.13023)
+[![Project Page](https://img.shields.io/badge/Project-Page-blue)](https://loiesun.github.io/spotsound/)
 [![Benchmark](https://img.shields.io/badge/🤗_HuggingFace-Benchmark-yellow)](https://huggingface.co/datasets/Loie/SpotSound-Bench)
 ## Model Summary
 **SpotSound** is a model designed to enhance Large Audio-Language Models (ALMs) with fine-grained temporal grounding capabilities. Built on top of [Audio Flamingo 3](https://huggingface.co/nvidia/audio-flamingo-3), SpotSound is capable of accurately pinpointing the exact start and end timestamps of specific acoustic events within long, untrimmed audio recordings based on natural language queries.
+This model is particularly effective for "needle-in-a-haystack" audio retrieval tasks, where short target sounds are embedded within complex background noise. For more details, see the paper: [SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding](https://huggingface.co/papers/2604.13023).
 ## Usage / Quick Start
 ### 1. Installation
+First, clone the official [SpotSound GitHub repository](https://github.com/LoieSun/SpotSound) and set up the environment:
 ```bash
 conda create -n SpotSound python=3.10
 ## Citation
+If you use SpotSound or the benchmark in your research, please cite the paper:
 ```bibtex
 @inproceedings{sun2026spotsound,
     title={SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding},
+    author={Sun, Luoyi and Zhou, Xiao and Li, Zeqian and Zhang, Ya and Wang, Yanfeng and Xie, Weidi},
+    journal={arXiv preprint arXiv:2604.13023},
     year={2026}
 }
 ```