Upload README.md with huggingface_hub

888eeeb verified about 2 months ago

7.79 kB

	---
	language: en
	license: mit
	tags:
	- text-to-speech
	- visual-tts
	- spatial-audio
	- speech-synthesis
	- icassp2025
	datasets:
	- soundspaces-speech
	pipeline_tag: text-to-speech
	---

	<div align="center">

	<h1>MS<sup>2</sup>KU-VTTS</h1>
	<h3>Multi-Source Spatial Knowledge Understanding <br> for Immersive Visual Text-to-Speech</h3>

	[Shuwei He](https://he-shuwei.github.io/), [Rui Liu](https://ttslr.github.io/people.html)<sup>*</sup>

	Inner Mongolia University    <sup>*</sup> Corresponding Author

	(Accepted by ICASSP 2025)

	</div>

	<div align="center">
	<a href="https://huggingface.co/he-shuwei/MS2KU-VTTS">
	<img src='https://img.shields.io/badge/HuggingFace-Checkpoints-orange?style=flat&logo=huggingface&logoColor=white' alt='huggingface'>
	</a>
	<a href="LICENSE">
	<img src='https://img.shields.io/badge/License-MIT-yellow.svg' alt='license'>
	</a>
	</div>

	<br>

	## Abstract

	Visual Text-to-Speech (VTTS) aims to take the environmental image as the prompt to synthesize reverberant speech for the spoken content. Previous works focus on the RGB modality for global environmental modeling, overlooking the potential of multi-source spatial knowledge like depth, speaker position, and environmental semantics. To address these issues, we propose a novel multi-source spatial knowledge understanding scheme for immersive VTTS, termed MS<sup>2</sup>KU-VTTS. Specifically, we first prioritize RGB image as the dominant source and consider depth image, speaker position knowledge from object detection, and Gemini-generated semantic captions as supplementary sources. Afterwards, we propose a serial interaction mechanism to effectively integrate both dominant and supplementary sources. The resulting multi-source knowledge is dynamically integrated based on the respective contributions of each source. This enriched interaction and integration of multi-source spatial knowledge guides the speech generation model, enhancing the immersive speech experience. Experimental results demonstrate that MS<sup>2</sup>KU-VTTS surpasses existing baselines in generating immersive speech.

	## Overview

	<p align="center">
	<img src="assets/model.png" width="100%" alt="MS2KU-VTTS Architecture">
	</p>

	The proposed MS<sup>2</sup>KU-VTTS architecture consists of four components:
	- Multi-source Spatial Knowledge: RGB image (dominant), depth image, speaker position, and Gemini-generated semantic captions (supplementary)
	- Dominant-Supplement Serial Interaction (D-SSI): RGB-Depth Interaction, Speaker Position Enhanced Interaction, and RGB-Semantic Interaction
	- Dynamic Fusion: Entropy-based dynamic weighting to aggregate multi-source spatial knowledge
	- Speech Generation: ControlNet-style DiT denoiser (based on F5-TTS) with BigVGAN vocoder

	## Installation

	```bash
	git clone https://github.com/he-shuwei/MS2KU-VTTS.git
	cd MS2KU-VTTS
	pip install -r requirements.txt
	```

	Checkpoints & Data — download from [HuggingFace](https://huggingface.co/he-shuwei/MS2KU-VTTS):

	\| Resource \| Path \| Description \|
	\|---\|---\|---\|
	\| MS2KU-VTTS (finetuned) \| `checkpoints/ms2ku_vtts/` \| Finetuned model for inference \|
	\| Pretrain Encoder \| `checkpoints/pretrain_encoder/` \| Pretrained TTS encoder \|
	\| Pretrain Decoder \| `checkpoints/pretrain_decoder/` \| Pretrained DiT decoder (ControlNet backbone) \|
	\| BigVGAN v2 \| `checkpoints/bigvgan/` \| Retrained vocoder (16 kHz) \|
	\| Spatial environment captions \| `data/raw_data/captions/` \| Gemini-generated captions for all splits \|
	\| MFA alignment results \| `data/processed_data/mfa/mfa_outputs.tar.gz` \| Pre-computed forced alignment (TextGrid) \|

	The following third-party checkpoints are also required. Please download from their official sources:

	\| Model \| Path \| Source \|
	\|---\|---\|---\|
	\| BERT-large-uncased \| `checkpoints/bert-large-uncased/` \| [Google](https://huggingface.co/google-bert/bert-large-uncased) \|
	\| ResNet-18 \| `checkpoints/resnet-18/` \| [Microsoft](https://huggingface.co/microsoft/resnet-18) \|
	\| RMVPE \| `checkpoints/RMVPE/rmvpe.pt` \| [RMVPE](https://github.com/Dream-High/RMVPE) \|

	Data — this project uses the [SoundSpaces-Speech](https://github.com/facebookresearch/learning-audio-visual-dereverberation) dataset. Please follow their instructions to obtain the raw data, then run the preprocessing pipeline:

	1. Download pretrained models:
	```bash
	python scripts/download_bert.py
	python scripts/download_resnet18.py
	```

	2. ResNet18 features (RGB & depth):
	```bash
	bash scripts/extract_resnet18_features/run.sh start
	```

	3. Caption features (Gemini + BERT):
	```bash
	python scripts/generate_gemini_captions.py --api_key YOUR_KEY --image_dir data/processed_data/images --output_dir data/processed_data/captions
	bash scripts/extract_caption_features/run.sh start
	```

	4. Speaker position features:
	```bash
	bash scripts/extract_speaker_position/run.sh start
	```

	5. Binarize data:
	```bash
	bash scripts/binarize/run.sh start
	```

	## Training

	```bash
	bash scripts/train/run.sh start
	```

	Monitor training:
	```bash
	bash scripts/train/run.sh log
	```

	Check status:
	```bash
	bash scripts/train/run.sh status
	```

	## Inference

	```bash
	bash scripts/infer/run_infer.sh \
	--ckpt checkpoints/ms2ku_vtts/model_ckpt_best.pt \
	--outdir results/ms2ku_vtts/test_seen \
	--batch_size 16
	```

	## Citation

	If you find this work useful, please consider citing:

	```bibtex
	@inproceedings{he2025multi,
	title={Multi-source spatial knowledge understanding for immersive visual text-to-speech},
	author={He, Shuwei and Liu, Rui},
	booktitle={ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
	pages={1--5},
	year={2025},
	organization={IEEE}
	}
	```

	## Acknowledgements

	This work was funded by the Young Scientists Fund (No. 62206136) and the General Program (No. 62476146) of the National Natural Science Foundation of China, the "Inner Mongolia Science and Technology Achievement Transfer and Transformation Demonstration Zone, University Collaborative Innovation Base, and University Entrepreneurship Training Base" Construction Project (Supercomputing Power Project) (No. 21300-231510).

	This project builds upon several excellent open-source projects. We gratefully acknowledge:

	Model Architectures & Code
	- [F5-TTS](https://github.com/SWivid/F5-TTS) — Diffusion Transformer (DiT) architecture
	- [BigVGAN](https://github.com/NVIDIA/BigVGAN) — Neural vocoder by NVIDIA
	- [RMVPE](https://github.com/Dream-high/RMVPE) — Robust pitch (F0) estimation
	- [x-transformers](https://github.com/lucidrains/x-transformers) — Rotary positional embeddings
	- [FlashAttention](https://github.com/Dao-AILab/flash-attention) — Memory-efficient attention kernels

	Pretrained Models
	- [BERT-large-uncased](https://huggingface.co/google-bert/bert-large-uncased) (Google) — Caption feature extraction
	- [ResNet-18](https://huggingface.co/microsoft/resnet-18) (Microsoft) — RGB and depth visual feature extraction

	Datasets & Tools
	- [SoundSpaces-Speech](https://github.com/facebookresearch/learning-audio-visual-dereverberation) (Meta Research) — Audio-visual spatial speech dataset
	- [Montreal Forced Aligner (MFA)](https://montreal-forced-aligner.readthedocs.io/) — Phoneme-level forced alignment
	- [Google Gemini](https://ai.google.dev/) — Panoramic scene caption generation

	Libraries
	- [PyTorch](https://pytorch.org/) — Deep learning framework
	- [librosa](https://librosa.org/) — Audio analysis and processing
	- [HuggingFace Transformers](https://github.com/huggingface/transformers) — Pretrained model loading
	- [matplotlib](https://matplotlib.org/) — Visualization