kwilk90
/

DSpAST

Model card Files Files and versions

DSpAST / README.md

kwilk90's picture

Update README.md

70f4a99 verified 4 months ago

|

history blame contribute delete

2.2 kB

	---
	license: cc-by-nc-4.0
	---
	# DSpAST: Disentangled Spatial Audio Spectrogram Transformer

	[arXiv](https://arxiv.org/abs/2509.13927) \| [GitHub](https://github.com/wilkinghoff/DSpAST)

	Checkpoints of [DSpAST: Disentangled Representations for Spatial Audio Reasoning with Large Language Models](https://arxiv.org/abs/2509.13927).

	***

	## Performance

	On our system, the performances obtained with our provided checkpoints are:

	\| Binaural Encoder \| mAP (↑) \| ER20° (↓) \| MAE (↓) \| DER (↓) \|
	\| :---: \| :---: \| :---: \| :---: \| :---: \|
	\| [SpatialAST](https://huggingface.co/datasets/zhisheng01/SpatialAudio/blob/main/SpatialAST/finetuned.pth) \| 49.90 \| 24.43 \| 17.87 \| 32.50 \|
	\| [DSpAST (stage 1)](https://huggingface.co/kwilk90/DSpAST/blob/main/DSpAST-stage1) \| 53.05 \| 98.56 \| 95.57 \| 97.58 \|
	\| [DSpAST (stage 2)](https://huggingface.co/kwilk90/DSpAST/blob/main/DSpAST-stage2) \| 52.64 \| 20.31 \| 14.44 \| 28.35 \|
	\| [DSpAST (stage 3)](https://huggingface.co/kwilk90/DSpAST/blob/main/DSpAST-stage3) \| 54.53 \| 20.28 \| 14.44 \| 28.03 \|

	Similar performance improvements can also be observed when using DSpAST as a binaural encoder for spatial audio reasoning with LLMs. Please have a look at our [paper](https://arxiv.org/abs/2509.13927) for further information.

	***

	## References

	If you use the checkpoints for your work, we kindly ask you to cite the following papers:

	``` latex
	@article{wilkinghoff2025dspast,
	author = {Wilkinghoff, Kevin and
	Tan, Zheng-Hua},
	title = {{DSpAST:} Disentangled Representations for Spatial Audio Reasoning with Large Language Models},
	journal = {arXiv:2509.13927},
	year = {2025}
	}
	```
	and the original [BAT](https://zhishengzheng.com/bat/) paper, which is the foundation of this work:
	``` latex
	@inproceedings{zheng2024bat,
	author = {Zheng, Zhisheng and
	Peng, Puyuan and
	Ma, Ziyang and
	Chen, Xie and
	Choi, Eunsol and
	Harwath, David},
	title = {{BAT:} Learning to Reason about Spatial Sounds with Large Language Models},
	booktitle = {Proc. ICML},
	year = {2024}
	}
	```