he-shuwei
/

MS2KU-VTTS

@@ -26,13 +26,10 @@ Inner Mongolia University &nbsp;&nbsp; <sup>*</sup> Corresponding Author
 </div>
 <div align="center">
-  <a href="https://github.com/he-shuwei/MS2KU-VTTS">
-    <img src='https://img.shields.io/badge/GitHub-Code-black?style=flat&logo=github' alt='github'>
-  </a>
   <a href="https://huggingface.co/he-shuwei/MS2KU-VTTS">
     <img src='https://img.shields.io/badge/HuggingFace-Checkpoints-orange?style=flat&logo=huggingface&logoColor=white' alt='huggingface'>
   </a>
-  <a href="https://github.com/he-shuwei/MS2KU-VTTS/blob/main/LICENSE">
     <img src='https://img.shields.io/badge/License-MIT-yellow.svg' alt='license'>
   </a>
 </div>
@@ -45,13 +42,25 @@ Visual Text-to-Speech (VTTS) aims to take the environmental image as the prompt
 ## Overview
 The proposed MS<sup>2</sup>KU-VTTS architecture consists of four components:
 - **Multi-source Spatial Knowledge**: RGB image (dominant), depth image, speaker position, and Gemini-generated semantic captions (supplementary)
 - **Dominant-Supplement Serial Interaction (D-SSI)**: RGB-Depth Interaction, Speaker Position Enhanced Interaction, and RGB-Semantic Interaction
 - **Dynamic Fusion**: Entropy-based dynamic weighting to aggregate multi-source spatial knowledge
 - **Speech Generation**: ControlNet-style DiT denoiser (based on F5-TTS) with BigVGAN vocoder
-## Files
 | Resource | Path | Description |
 |---|---|---|
@@ -60,7 +69,7 @@ The proposed MS<sup>2</sup>KU-VTTS architecture consists of four components:
 | Pretrain Decoder | `checkpoints/pretrain_decoder/` | Pretrained DiT decoder (ControlNet backbone) |
 | BigVGAN v2 | `checkpoints/bigvgan/` | Retrained vocoder (16 kHz) |
 | Spatial environment captions | `data/raw_data/captions/` | Gemini-generated captions for all splits |
-| MFA alignment results | `data/processed_data/mfa/outputs/` | Pre-computed forced alignment (TextGrid) |
 The following third-party checkpoints are also required. Please download from their official sources:
@@ -70,18 +79,64 @@ The following third-party checkpoints are also required. Please download from th
 | ResNet-18 | `checkpoints/resnet-18/` | [Microsoft](https://huggingface.co/microsoft/resnet-18) |
 | RMVPE | `checkpoints/RMVPE/rmvpe.pt` | [RMVPE](https://github.com/Dream-High/RMVPE) |
-## Usage
 ```bash
-git clone https://github.com/he-shuwei/MS2KU-VTTS.git
-cd MS2KU-VTTS
-pip install -r requirements.txt
 ```
-See the [GitHub repository](https://github.com/he-shuwei/MS2KU-VTTS) for full training and inference instructions.
 ## Citation
 ```bibtex
 @inproceedings{he2025multi,
   title={Multi-source spatial knowledge understanding for immersive visual text-to-speech},
@@ -100,17 +155,23 @@ This work was funded by the Young Scientists Fund (No. 62206136) and the General
 This project builds upon several excellent open-source projects. We gratefully acknowledge:
 **Model Architectures & Code**
-- [F5-TTS](https://github.com/SWivid/F5-TTS) — Diffusion Transformer (DiT) architecture
-- [BigVGAN](https://github.com/NVIDIA/BigVGAN) — Neural vocoder by NVIDIA
-- [RMVPE](https://github.com/Dream-high/RMVPE) — Robust pitch (F0) estimation
-- [x-transformers](https://github.com/lucidrains/x-transformers) — Rotary positional embeddings
-- [FlashAttention](https://github.com/Dao-AILab/flash-attention) — Memory-efficient attention kernels
 **Pretrained Models**
-- [BERT-large-uncased](https://huggingface.co/google-bert/bert-large-uncased) (Google) — Caption feature extraction
-- [ResNet-18](https://huggingface.co/microsoft/resnet-18) (Microsoft) — RGB and depth visual feature extraction
 **Datasets & Tools**
-- [SoundSpaces-Speech](https://github.com/facebookresearch/learning-audio-visual-dereverberation) (Meta Research) — Audio-visual spatial speech dataset
-- [Montreal Forced Aligner (MFA)](https://montreal-forced-aligner.readthedocs.io/) — Phoneme-level forced alignment
-- [Google Gemini](https://ai.google.dev/) — Panoramic scene caption generation

 </div>
 <div align="center">
   <a href="https://huggingface.co/he-shuwei/MS2KU-VTTS">
     <img src='https://img.shields.io/badge/HuggingFace-Checkpoints-orange?style=flat&logo=huggingface&logoColor=white' alt='huggingface'>
   </a>
+  <a href="LICENSE">
     <img src='https://img.shields.io/badge/License-MIT-yellow.svg' alt='license'>
   </a>
 </div>
 ## Overview
+<p align="center">
+  <img src="assets/model.png" width="100%" alt="MS2KU-VTTS Architecture">
+</p>
 The proposed MS<sup>2</sup>KU-VTTS architecture consists of four components:
 - **Multi-source Spatial Knowledge**: RGB image (dominant), depth image, speaker position, and Gemini-generated semantic captions (supplementary)
 - **Dominant-Supplement Serial Interaction (D-SSI)**: RGB-Depth Interaction, Speaker Position Enhanced Interaction, and RGB-Semantic Interaction
 - **Dynamic Fusion**: Entropy-based dynamic weighting to aggregate multi-source spatial knowledge
 - **Speech Generation**: ControlNet-style DiT denoiser (based on F5-TTS) with BigVGAN vocoder
+## Installation
+```bash
+git clone https://github.com/he-shuwei/MS2KU-VTTS.git
+cd MS2KU-VTTS
+pip install -r requirements.txt
+```
+**Checkpoints & Data** &mdash; download from [HuggingFace](https://huggingface.co/he-shuwei/MS2KU-VTTS):
 | Resource | Path | Description |
 |---|---|---|
 | Pretrain Decoder | `checkpoints/pretrain_decoder/` | Pretrained DiT decoder (ControlNet backbone) |
 | BigVGAN v2 | `checkpoints/bigvgan/` | Retrained vocoder (16 kHz) |
 | Spatial environment captions | `data/raw_data/captions/` | Gemini-generated captions for all splits |
+| MFA alignment results | `data/processed_data/mfa/mfa_outputs.tar.gz` | Pre-computed forced alignment (TextGrid) |
 The following third-party checkpoints are also required. Please download from their official sources:
 | ResNet-18 | `checkpoints/resnet-18/` | [Microsoft](https://huggingface.co/microsoft/resnet-18) |
 | RMVPE | `checkpoints/RMVPE/rmvpe.pt` | [RMVPE](https://github.com/Dream-High/RMVPE) |
+**Data** &mdash; this project uses the [SoundSpaces-Speech](https://github.com/facebookresearch/learning-audio-visual-dereverberation) dataset. Please follow their instructions to obtain the raw data, then run the preprocessing pipeline:
+1. **Download pretrained models**:
+   ```bash
+   python scripts/download_bert.py
+   python scripts/download_resnet18.py
+   ```
+2. **ResNet18 features** (RGB & depth):
+   ```bash
+   bash scripts/extract_resnet18_features/run.sh start
+   ```
+3. **Caption features** (Gemini + BERT):
+   ```bash
+   python scripts/generate_gemini_captions.py --api_key YOUR_KEY --image_dir data/processed_data/images --output_dir data/processed_data/captions
+   bash scripts/extract_caption_features/run.sh start
+   ```
+4. **Speaker position features**:
+   ```bash
+   bash scripts/extract_speaker_position/run.sh start
+   ```
+5. **Binarize data**:
+   ```bash
+   bash scripts/binarize/run.sh start
+   ```
+## Training
 ```bash
+bash scripts/train/run.sh start
+```
+Monitor training:
+```bash
+bash scripts/train/run.sh log
 ```
+Check status:
+```bash
+bash scripts/train/run.sh status
+```
+## Inference
+```bash
+bash scripts/infer/run_infer.sh \
+    --ckpt checkpoints/ms2ku_vtts/model_ckpt_best.pt \
+    --outdir results/ms2ku_vtts/test_seen \
+    --batch_size 16
+```
 ## Citation
+If you find this work useful, please consider citing:
 ```bibtex
 @inproceedings{he2025multi,
   title={Multi-source spatial knowledge understanding for immersive visual text-to-speech},
 This project builds upon several excellent open-source projects. We gratefully acknowledge:
 **Model Architectures & Code**
+- [F5-TTS](https://github.com/SWivid/F5-TTS) &mdash; Diffusion Transformer (DiT) architecture
+- [BigVGAN](https://github.com/NVIDIA/BigVGAN) &mdash; Neural vocoder by NVIDIA
+- [RMVPE](https://github.com/Dream-high/RMVPE) &mdash; Robust pitch (F0) estimation
+- [x-transformers](https://github.com/lucidrains/x-transformers) &mdash; Rotary positional embeddings
+- [FlashAttention](https://github.com/Dao-AILab/flash-attention) &mdash; Memory-efficient attention kernels
 **Pretrained Models**
+- [BERT-large-uncased](https://huggingface.co/google-bert/bert-large-uncased) (Google) &mdash; Caption feature extraction
+- [ResNet-18](https://huggingface.co/microsoft/resnet-18) (Microsoft) &mdash; RGB and depth visual feature extraction
 **Datasets & Tools**
+- [SoundSpaces-Speech](https://github.com/facebookresearch/learning-audio-visual-dereverberation) (Meta Research) &mdash; Audio-visual spatial speech dataset
+- [Montreal Forced Aligner (MFA)](https://montreal-forced-aligner.readthedocs.io/) &mdash; Phoneme-level forced alignment
+- [Google Gemini](https://ai.google.dev/) &mdash; Panoramic scene caption generation
+**Libraries**
+- [PyTorch](https://pytorch.org/) &mdash; Deep learning framework
+- [librosa](https://librosa.org/) &mdash; Audio analysis and processing
+- [HuggingFace Transformers](https://github.com/huggingface/transformers) &mdash; Pretrained model loading
+- [matplotlib](https://matplotlib.org/) &mdash; Visualization