Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -26,13 +26,10 @@ Inner Mongolia University <sup>*</sup> Corresponding Author
|
|
| 26 |
</div>
|
| 27 |
|
| 28 |
<div align="center">
|
| 29 |
-
<a href="https://github.com/he-shuwei/MS2KU-VTTS">
|
| 30 |
-
<img src='https://img.shields.io/badge/GitHub-Code-black?style=flat&logo=github' alt='github'>
|
| 31 |
-
</a>
|
| 32 |
<a href="https://huggingface.co/he-shuwei/MS2KU-VTTS">
|
| 33 |
<img src='https://img.shields.io/badge/HuggingFace-Checkpoints-orange?style=flat&logo=huggingface&logoColor=white' alt='huggingface'>
|
| 34 |
</a>
|
| 35 |
-
<a href="
|
| 36 |
<img src='https://img.shields.io/badge/License-MIT-yellow.svg' alt='license'>
|
| 37 |
</a>
|
| 38 |
</div>
|
|
@@ -45,13 +42,25 @@ Visual Text-to-Speech (VTTS) aims to take the environmental image as the prompt
|
|
| 45 |
|
| 46 |
## Overview
|
| 47 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
The proposed MS<sup>2</sup>KU-VTTS architecture consists of four components:
|
| 49 |
- **Multi-source Spatial Knowledge**: RGB image (dominant), depth image, speaker position, and Gemini-generated semantic captions (supplementary)
|
| 50 |
- **Dominant-Supplement Serial Interaction (D-SSI)**: RGB-Depth Interaction, Speaker Position Enhanced Interaction, and RGB-Semantic Interaction
|
| 51 |
- **Dynamic Fusion**: Entropy-based dynamic weighting to aggregate multi-source spatial knowledge
|
| 52 |
- **Speech Generation**: ControlNet-style DiT denoiser (based on F5-TTS) with BigVGAN vocoder
|
| 53 |
|
| 54 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
|
| 56 |
| Resource | Path | Description |
|
| 57 |
|---|---|---|
|
|
@@ -60,7 +69,7 @@ The proposed MS<sup>2</sup>KU-VTTS architecture consists of four components:
|
|
| 60 |
| Pretrain Decoder | `checkpoints/pretrain_decoder/` | Pretrained DiT decoder (ControlNet backbone) |
|
| 61 |
| BigVGAN v2 | `checkpoints/bigvgan/` | Retrained vocoder (16 kHz) |
|
| 62 |
| Spatial environment captions | `data/raw_data/captions/` | Gemini-generated captions for all splits |
|
| 63 |
-
| MFA alignment results | `data/processed_data/mfa/
|
| 64 |
|
| 65 |
The following third-party checkpoints are also required. Please download from their official sources:
|
| 66 |
|
|
@@ -70,18 +79,64 @@ The following third-party checkpoints are also required. Please download from th
|
|
| 70 |
| ResNet-18 | `checkpoints/resnet-18/` | [Microsoft](https://huggingface.co/microsoft/resnet-18) |
|
| 71 |
| RMVPE | `checkpoints/RMVPE/rmvpe.pt` | [RMVPE](https://github.com/Dream-High/RMVPE) |
|
| 72 |
|
| 73 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 74 |
|
| 75 |
```bash
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
|
|
|
|
|
|
|
|
|
| 79 |
```
|
| 80 |
|
| 81 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 82 |
|
| 83 |
## Citation
|
| 84 |
|
|
|
|
|
|
|
| 85 |
```bibtex
|
| 86 |
@inproceedings{he2025multi,
|
| 87 |
title={Multi-source spatial knowledge understanding for immersive visual text-to-speech},
|
|
@@ -100,17 +155,23 @@ This work was funded by the Young Scientists Fund (No. 62206136) and the General
|
|
| 100 |
This project builds upon several excellent open-source projects. We gratefully acknowledge:
|
| 101 |
|
| 102 |
**Model Architectures & Code**
|
| 103 |
-
- [F5-TTS](https://github.com/SWivid/F5-TTS)
|
| 104 |
-
- [BigVGAN](https://github.com/NVIDIA/BigVGAN)
|
| 105 |
-
- [RMVPE](https://github.com/Dream-high/RMVPE)
|
| 106 |
-
- [x-transformers](https://github.com/lucidrains/x-transformers)
|
| 107 |
-
- [FlashAttention](https://github.com/Dao-AILab/flash-attention)
|
| 108 |
|
| 109 |
**Pretrained Models**
|
| 110 |
-
- [BERT-large-uncased](https://huggingface.co/google-bert/bert-large-uncased) (Google)
|
| 111 |
-
- [ResNet-18](https://huggingface.co/microsoft/resnet-18) (Microsoft)
|
| 112 |
|
| 113 |
**Datasets & Tools**
|
| 114 |
-
- [SoundSpaces-Speech](https://github.com/facebookresearch/learning-audio-visual-dereverberation) (Meta Research)
|
| 115 |
-
- [Montreal Forced Aligner (MFA)](https://montreal-forced-aligner.readthedocs.io/)
|
| 116 |
-
- [Google Gemini](https://ai.google.dev/)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
</div>
|
| 27 |
|
| 28 |
<div align="center">
|
|
|
|
|
|
|
|
|
|
| 29 |
<a href="https://huggingface.co/he-shuwei/MS2KU-VTTS">
|
| 30 |
<img src='https://img.shields.io/badge/HuggingFace-Checkpoints-orange?style=flat&logo=huggingface&logoColor=white' alt='huggingface'>
|
| 31 |
</a>
|
| 32 |
+
<a href="LICENSE">
|
| 33 |
<img src='https://img.shields.io/badge/License-MIT-yellow.svg' alt='license'>
|
| 34 |
</a>
|
| 35 |
</div>
|
|
|
|
| 42 |
|
| 43 |
## Overview
|
| 44 |
|
| 45 |
+
<p align="center">
|
| 46 |
+
<img src="assets/model.png" width="100%" alt="MS2KU-VTTS Architecture">
|
| 47 |
+
</p>
|
| 48 |
+
|
| 49 |
The proposed MS<sup>2</sup>KU-VTTS architecture consists of four components:
|
| 50 |
- **Multi-source Spatial Knowledge**: RGB image (dominant), depth image, speaker position, and Gemini-generated semantic captions (supplementary)
|
| 51 |
- **Dominant-Supplement Serial Interaction (D-SSI)**: RGB-Depth Interaction, Speaker Position Enhanced Interaction, and RGB-Semantic Interaction
|
| 52 |
- **Dynamic Fusion**: Entropy-based dynamic weighting to aggregate multi-source spatial knowledge
|
| 53 |
- **Speech Generation**: ControlNet-style DiT denoiser (based on F5-TTS) with BigVGAN vocoder
|
| 54 |
|
| 55 |
+
## Installation
|
| 56 |
+
|
| 57 |
+
```bash
|
| 58 |
+
git clone https://github.com/he-shuwei/MS2KU-VTTS.git
|
| 59 |
+
cd MS2KU-VTTS
|
| 60 |
+
pip install -r requirements.txt
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
**Checkpoints & Data** — download from [HuggingFace](https://huggingface.co/he-shuwei/MS2KU-VTTS):
|
| 64 |
|
| 65 |
| Resource | Path | Description |
|
| 66 |
|---|---|---|
|
|
|
|
| 69 |
| Pretrain Decoder | `checkpoints/pretrain_decoder/` | Pretrained DiT decoder (ControlNet backbone) |
|
| 70 |
| BigVGAN v2 | `checkpoints/bigvgan/` | Retrained vocoder (16 kHz) |
|
| 71 |
| Spatial environment captions | `data/raw_data/captions/` | Gemini-generated captions for all splits |
|
| 72 |
+
| MFA alignment results | `data/processed_data/mfa/mfa_outputs.tar.gz` | Pre-computed forced alignment (TextGrid) |
|
| 73 |
|
| 74 |
The following third-party checkpoints are also required. Please download from their official sources:
|
| 75 |
|
|
|
|
| 79 |
| ResNet-18 | `checkpoints/resnet-18/` | [Microsoft](https://huggingface.co/microsoft/resnet-18) |
|
| 80 |
| RMVPE | `checkpoints/RMVPE/rmvpe.pt` | [RMVPE](https://github.com/Dream-High/RMVPE) |
|
| 81 |
|
| 82 |
+
**Data** — this project uses the [SoundSpaces-Speech](https://github.com/facebookresearch/learning-audio-visual-dereverberation) dataset. Please follow their instructions to obtain the raw data, then run the preprocessing pipeline:
|
| 83 |
+
|
| 84 |
+
1. **Download pretrained models**:
|
| 85 |
+
```bash
|
| 86 |
+
python scripts/download_bert.py
|
| 87 |
+
python scripts/download_resnet18.py
|
| 88 |
+
```
|
| 89 |
+
|
| 90 |
+
2. **ResNet18 features** (RGB & depth):
|
| 91 |
+
```bash
|
| 92 |
+
bash scripts/extract_resnet18_features/run.sh start
|
| 93 |
+
```
|
| 94 |
+
|
| 95 |
+
3. **Caption features** (Gemini + BERT):
|
| 96 |
+
```bash
|
| 97 |
+
python scripts/generate_gemini_captions.py --api_key YOUR_KEY --image_dir data/processed_data/images --output_dir data/processed_data/captions
|
| 98 |
+
bash scripts/extract_caption_features/run.sh start
|
| 99 |
+
```
|
| 100 |
+
|
| 101 |
+
4. **Speaker position features**:
|
| 102 |
+
```bash
|
| 103 |
+
bash scripts/extract_speaker_position/run.sh start
|
| 104 |
+
```
|
| 105 |
+
|
| 106 |
+
5. **Binarize data**:
|
| 107 |
+
```bash
|
| 108 |
+
bash scripts/binarize/run.sh start
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
## Training
|
| 112 |
|
| 113 |
```bash
|
| 114 |
+
bash scripts/train/run.sh start
|
| 115 |
+
```
|
| 116 |
+
|
| 117 |
+
Monitor training:
|
| 118 |
+
```bash
|
| 119 |
+
bash scripts/train/run.sh log
|
| 120 |
```
|
| 121 |
|
| 122 |
+
Check status:
|
| 123 |
+
```bash
|
| 124 |
+
bash scripts/train/run.sh status
|
| 125 |
+
```
|
| 126 |
+
|
| 127 |
+
## Inference
|
| 128 |
+
|
| 129 |
+
```bash
|
| 130 |
+
bash scripts/infer/run_infer.sh \
|
| 131 |
+
--ckpt checkpoints/ms2ku_vtts/model_ckpt_best.pt \
|
| 132 |
+
--outdir results/ms2ku_vtts/test_seen \
|
| 133 |
+
--batch_size 16
|
| 134 |
+
```
|
| 135 |
|
| 136 |
## Citation
|
| 137 |
|
| 138 |
+
If you find this work useful, please consider citing:
|
| 139 |
+
|
| 140 |
```bibtex
|
| 141 |
@inproceedings{he2025multi,
|
| 142 |
title={Multi-source spatial knowledge understanding for immersive visual text-to-speech},
|
|
|
|
| 155 |
This project builds upon several excellent open-source projects. We gratefully acknowledge:
|
| 156 |
|
| 157 |
**Model Architectures & Code**
|
| 158 |
+
- [F5-TTS](https://github.com/SWivid/F5-TTS) — Diffusion Transformer (DiT) architecture
|
| 159 |
+
- [BigVGAN](https://github.com/NVIDIA/BigVGAN) — Neural vocoder by NVIDIA
|
| 160 |
+
- [RMVPE](https://github.com/Dream-high/RMVPE) — Robust pitch (F0) estimation
|
| 161 |
+
- [x-transformers](https://github.com/lucidrains/x-transformers) — Rotary positional embeddings
|
| 162 |
+
- [FlashAttention](https://github.com/Dao-AILab/flash-attention) — Memory-efficient attention kernels
|
| 163 |
|
| 164 |
**Pretrained Models**
|
| 165 |
+
- [BERT-large-uncased](https://huggingface.co/google-bert/bert-large-uncased) (Google) — Caption feature extraction
|
| 166 |
+
- [ResNet-18](https://huggingface.co/microsoft/resnet-18) (Microsoft) — RGB and depth visual feature extraction
|
| 167 |
|
| 168 |
**Datasets & Tools**
|
| 169 |
+
- [SoundSpaces-Speech](https://github.com/facebookresearch/learning-audio-visual-dereverberation) (Meta Research) — Audio-visual spatial speech dataset
|
| 170 |
+
- [Montreal Forced Aligner (MFA)](https://montreal-forced-aligner.readthedocs.io/) — Phoneme-level forced alignment
|
| 171 |
+
- [Google Gemini](https://ai.google.dev/) — Panoramic scene caption generation
|
| 172 |
+
|
| 173 |
+
**Libraries**
|
| 174 |
+
- [PyTorch](https://pytorch.org/) — Deep learning framework
|
| 175 |
+
- [librosa](https://librosa.org/) — Audio analysis and processing
|
| 176 |
+
- [HuggingFace Transformers](https://github.com/huggingface/transformers) — Pretrained model loading
|
| 177 |
+
- [matplotlib](https://matplotlib.org/) — Visualization
|