DSpAST / README.md
kwilk90's picture
Update README.md
70f4a99 verified
metadata
license: cc-by-nc-4.0

DSpAST: Disentangled Spatial Audio Spectrogram Transformer

arXiv | GitHub

Checkpoints of DSpAST: Disentangled Representations for Spatial Audio Reasoning with Large Language Models.


Performance

On our system, the performances obtained with our provided checkpoints are:

Binaural Encoder mAP (↑) ER20° (↓) MAE (↓) DER (↓)
SpatialAST 49.90 24.43 17.87 32.50
DSpAST (stage 1) 53.05 98.56 95.57 97.58
DSpAST (stage 2) 52.64 20.31 14.44 28.35
DSpAST (stage 3) 54.53 20.28 14.44 28.03

Similar performance improvements can also be observed when using DSpAST as a binaural encoder for spatial audio reasoning with LLMs. Please have a look at our paper for further information.


References

If you use the checkpoints for your work, we kindly ask you to cite the following papers:

@article{wilkinghoff2025dspast,
    author     = {Wilkinghoff, Kevin and
                  Tan, Zheng-Hua},
    title      = {{DSpAST:} Disentangled Representations for Spatial Audio Reasoning with Large Language Models},
    journal    = {arXiv:2509.13927},
    year       = {2025}
}

and the original BAT paper, which is the foundation of this work:

@inproceedings{zheng2024bat,
  author       = {Zheng, Zhisheng and
                  Peng, Puyuan and
                  Ma, Ziyang and
                  Chen, Xie and
                  Choi, Eunsol and
                  Harwath, David},
  title        = {{BAT:} Learning to Reason about Spatial Sounds with Large Language Models},
  booktitle    = {Proc. ICML},
  year         = {2024}
}