|
|
--- |
|
|
license: cc-by-nc-4.0 |
|
|
--- |
|
|
# DSpAST: Disentangled Spatial Audio Spectrogram Transformer |
|
|
|
|
|
[arXiv](https://arxiv.org/abs/2509.13927) | [GitHub](https://github.com/wilkinghoff/DSpAST) |
|
|
|
|
|
Checkpoints of [DSpAST: Disentangled Representations for Spatial Audio Reasoning with Large Language Models](https://arxiv.org/abs/2509.13927). |
|
|
|
|
|
*** |
|
|
|
|
|
## Performance |
|
|
|
|
|
On our system, the performances obtained with our provided checkpoints are: |
|
|
|
|
|
| Binaural Encoder | mAP (↑) | ER20° (↓) | MAE (↓) | DER (↓) | |
|
|
| :---: | :---: | :---: | :---: | :---: | |
|
|
| [SpatialAST](https://huggingface.co/datasets/zhisheng01/SpatialAudio/blob/main/SpatialAST/finetuned.pth) | 49.90 | 24.43 | 17.87 | 32.50 | |
|
|
| [DSpAST (stage 1)](https://huggingface.co/kwilk90/DSpAST/blob/main/DSpAST-stage1) | 53.05 | 98.56 | 95.57 | 97.58 | |
|
|
| [DSpAST (stage 2)](https://huggingface.co/kwilk90/DSpAST/blob/main/DSpAST-stage2) | 52.64 | 20.31 | **14.44** | 28.35 | |
|
|
| [DSpAST (stage 3)](https://huggingface.co/kwilk90/DSpAST/blob/main/DSpAST-stage3) | **54.53** | **20.28** | **14.44** | **28.03** | |
|
|
|
|
|
Similar performance improvements can also be observed when using DSpAST as a binaural encoder for spatial audio reasoning with LLMs. Please have a look at our [paper](https://arxiv.org/abs/2509.13927) for further information. |
|
|
|
|
|
*** |
|
|
|
|
|
## References |
|
|
|
|
|
If you use the checkpoints for your work, we kindly ask you to cite the following papers: |
|
|
|
|
|
``` latex |
|
|
@article{wilkinghoff2025dspast, |
|
|
author = {Wilkinghoff, Kevin and |
|
|
Tan, Zheng-Hua}, |
|
|
title = {{DSpAST:} Disentangled Representations for Spatial Audio Reasoning with Large Language Models}, |
|
|
journal = {arXiv:2509.13927}, |
|
|
year = {2025} |
|
|
} |
|
|
``` |
|
|
and the original [BAT](https://zhishengzheng.com/bat/) paper, which is the foundation of this work: |
|
|
``` latex |
|
|
@inproceedings{zheng2024bat, |
|
|
author = {Zheng, Zhisheng and |
|
|
Peng, Puyuan and |
|
|
Ma, Ziyang and |
|
|
Chen, Xie and |
|
|
Choi, Eunsol and |
|
|
Harwath, David}, |
|
|
title = {{BAT:} Learning to Reason about Spatial Sounds with Large Language Models}, |
|
|
booktitle = {Proc. ICML}, |
|
|
year = {2024} |
|
|
} |
|
|
``` |