he-shuwei commited on
Commit
888eeeb
Β·
verified Β·
1 Parent(s): 6397a4b

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +82 -21
README.md CHANGED
@@ -26,13 +26,10 @@ Inner Mongolia University &nbsp;&nbsp; <sup>*</sup> Corresponding Author
26
  </div>
27
 
28
  <div align="center">
29
- <a href="https://github.com/he-shuwei/MS2KU-VTTS">
30
- <img src='https://img.shields.io/badge/GitHub-Code-black?style=flat&logo=github' alt='github'>
31
- </a>
32
  <a href="https://huggingface.co/he-shuwei/MS2KU-VTTS">
33
  <img src='https://img.shields.io/badge/HuggingFace-Checkpoints-orange?style=flat&logo=huggingface&logoColor=white' alt='huggingface'>
34
  </a>
35
- <a href="https://github.com/he-shuwei/MS2KU-VTTS/blob/main/LICENSE">
36
  <img src='https://img.shields.io/badge/License-MIT-yellow.svg' alt='license'>
37
  </a>
38
  </div>
@@ -45,13 +42,25 @@ Visual Text-to-Speech (VTTS) aims to take the environmental image as the prompt
45
 
46
  ## Overview
47
 
 
 
 
 
48
  The proposed MS<sup>2</sup>KU-VTTS architecture consists of four components:
49
  - **Multi-source Spatial Knowledge**: RGB image (dominant), depth image, speaker position, and Gemini-generated semantic captions (supplementary)
50
  - **Dominant-Supplement Serial Interaction (D-SSI)**: RGB-Depth Interaction, Speaker Position Enhanced Interaction, and RGB-Semantic Interaction
51
  - **Dynamic Fusion**: Entropy-based dynamic weighting to aggregate multi-source spatial knowledge
52
  - **Speech Generation**: ControlNet-style DiT denoiser (based on F5-TTS) with BigVGAN vocoder
53
 
54
- ## Files
 
 
 
 
 
 
 
 
55
 
56
  | Resource | Path | Description |
57
  |---|---|---|
@@ -60,7 +69,7 @@ The proposed MS<sup>2</sup>KU-VTTS architecture consists of four components:
60
  | Pretrain Decoder | `checkpoints/pretrain_decoder/` | Pretrained DiT decoder (ControlNet backbone) |
61
  | BigVGAN v2 | `checkpoints/bigvgan/` | Retrained vocoder (16 kHz) |
62
  | Spatial environment captions | `data/raw_data/captions/` | Gemini-generated captions for all splits |
63
- | MFA alignment results | `data/processed_data/mfa/outputs/` | Pre-computed forced alignment (TextGrid) |
64
 
65
  The following third-party checkpoints are also required. Please download from their official sources:
66
 
@@ -70,18 +79,64 @@ The following third-party checkpoints are also required. Please download from th
70
  | ResNet-18 | `checkpoints/resnet-18/` | [Microsoft](https://huggingface.co/microsoft/resnet-18) |
71
  | RMVPE | `checkpoints/RMVPE/rmvpe.pt` | [RMVPE](https://github.com/Dream-High/RMVPE) |
72
 
73
- ## Usage
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74
 
75
  ```bash
76
- git clone https://github.com/he-shuwei/MS2KU-VTTS.git
77
- cd MS2KU-VTTS
78
- pip install -r requirements.txt
 
 
 
79
  ```
80
 
81
- See the [GitHub repository](https://github.com/he-shuwei/MS2KU-VTTS) for full training and inference instructions.
 
 
 
 
 
 
 
 
 
 
 
 
82
 
83
  ## Citation
84
 
 
 
85
  ```bibtex
86
  @inproceedings{he2025multi,
87
  title={Multi-source spatial knowledge understanding for immersive visual text-to-speech},
@@ -100,17 +155,23 @@ This work was funded by the Young Scientists Fund (No. 62206136) and the General
100
  This project builds upon several excellent open-source projects. We gratefully acknowledge:
101
 
102
  **Model Architectures & Code**
103
- - [F5-TTS](https://github.com/SWivid/F5-TTS) β€” Diffusion Transformer (DiT) architecture
104
- - [BigVGAN](https://github.com/NVIDIA/BigVGAN) β€” Neural vocoder by NVIDIA
105
- - [RMVPE](https://github.com/Dream-high/RMVPE) β€” Robust pitch (F0) estimation
106
- - [x-transformers](https://github.com/lucidrains/x-transformers) β€” Rotary positional embeddings
107
- - [FlashAttention](https://github.com/Dao-AILab/flash-attention) β€” Memory-efficient attention kernels
108
 
109
  **Pretrained Models**
110
- - [BERT-large-uncased](https://huggingface.co/google-bert/bert-large-uncased) (Google) β€” Caption feature extraction
111
- - [ResNet-18](https://huggingface.co/microsoft/resnet-18) (Microsoft) β€” RGB and depth visual feature extraction
112
 
113
  **Datasets & Tools**
114
- - [SoundSpaces-Speech](https://github.com/facebookresearch/learning-audio-visual-dereverberation) (Meta Research) β€” Audio-visual spatial speech dataset
115
- - [Montreal Forced Aligner (MFA)](https://montreal-forced-aligner.readthedocs.io/) β€” Phoneme-level forced alignment
116
- - [Google Gemini](https://ai.google.dev/) β€” Panoramic scene caption generation
 
 
 
 
 
 
 
26
  </div>
27
 
28
  <div align="center">
 
 
 
29
  <a href="https://huggingface.co/he-shuwei/MS2KU-VTTS">
30
  <img src='https://img.shields.io/badge/HuggingFace-Checkpoints-orange?style=flat&logo=huggingface&logoColor=white' alt='huggingface'>
31
  </a>
32
+ <a href="LICENSE">
33
  <img src='https://img.shields.io/badge/License-MIT-yellow.svg' alt='license'>
34
  </a>
35
  </div>
 
42
 
43
  ## Overview
44
 
45
+ <p align="center">
46
+ <img src="assets/model.png" width="100%" alt="MS2KU-VTTS Architecture">
47
+ </p>
48
+
49
  The proposed MS<sup>2</sup>KU-VTTS architecture consists of four components:
50
  - **Multi-source Spatial Knowledge**: RGB image (dominant), depth image, speaker position, and Gemini-generated semantic captions (supplementary)
51
  - **Dominant-Supplement Serial Interaction (D-SSI)**: RGB-Depth Interaction, Speaker Position Enhanced Interaction, and RGB-Semantic Interaction
52
  - **Dynamic Fusion**: Entropy-based dynamic weighting to aggregate multi-source spatial knowledge
53
  - **Speech Generation**: ControlNet-style DiT denoiser (based on F5-TTS) with BigVGAN vocoder
54
 
55
+ ## Installation
56
+
57
+ ```bash
58
+ git clone https://github.com/he-shuwei/MS2KU-VTTS.git
59
+ cd MS2KU-VTTS
60
+ pip install -r requirements.txt
61
+ ```
62
+
63
+ **Checkpoints & Data** &mdash; download from [HuggingFace](https://huggingface.co/he-shuwei/MS2KU-VTTS):
64
 
65
  | Resource | Path | Description |
66
  |---|---|---|
 
69
  | Pretrain Decoder | `checkpoints/pretrain_decoder/` | Pretrained DiT decoder (ControlNet backbone) |
70
  | BigVGAN v2 | `checkpoints/bigvgan/` | Retrained vocoder (16 kHz) |
71
  | Spatial environment captions | `data/raw_data/captions/` | Gemini-generated captions for all splits |
72
+ | MFA alignment results | `data/processed_data/mfa/mfa_outputs.tar.gz` | Pre-computed forced alignment (TextGrid) |
73
 
74
  The following third-party checkpoints are also required. Please download from their official sources:
75
 
 
79
  | ResNet-18 | `checkpoints/resnet-18/` | [Microsoft](https://huggingface.co/microsoft/resnet-18) |
80
  | RMVPE | `checkpoints/RMVPE/rmvpe.pt` | [RMVPE](https://github.com/Dream-High/RMVPE) |
81
 
82
+ **Data** &mdash; this project uses the [SoundSpaces-Speech](https://github.com/facebookresearch/learning-audio-visual-dereverberation) dataset. Please follow their instructions to obtain the raw data, then run the preprocessing pipeline:
83
+
84
+ 1. **Download pretrained models**:
85
+ ```bash
86
+ python scripts/download_bert.py
87
+ python scripts/download_resnet18.py
88
+ ```
89
+
90
+ 2. **ResNet18 features** (RGB & depth):
91
+ ```bash
92
+ bash scripts/extract_resnet18_features/run.sh start
93
+ ```
94
+
95
+ 3. **Caption features** (Gemini + BERT):
96
+ ```bash
97
+ python scripts/generate_gemini_captions.py --api_key YOUR_KEY --image_dir data/processed_data/images --output_dir data/processed_data/captions
98
+ bash scripts/extract_caption_features/run.sh start
99
+ ```
100
+
101
+ 4. **Speaker position features**:
102
+ ```bash
103
+ bash scripts/extract_speaker_position/run.sh start
104
+ ```
105
+
106
+ 5. **Binarize data**:
107
+ ```bash
108
+ bash scripts/binarize/run.sh start
109
+ ```
110
+
111
+ ## Training
112
 
113
  ```bash
114
+ bash scripts/train/run.sh start
115
+ ```
116
+
117
+ Monitor training:
118
+ ```bash
119
+ bash scripts/train/run.sh log
120
  ```
121
 
122
+ Check status:
123
+ ```bash
124
+ bash scripts/train/run.sh status
125
+ ```
126
+
127
+ ## Inference
128
+
129
+ ```bash
130
+ bash scripts/infer/run_infer.sh \
131
+ --ckpt checkpoints/ms2ku_vtts/model_ckpt_best.pt \
132
+ --outdir results/ms2ku_vtts/test_seen \
133
+ --batch_size 16
134
+ ```
135
 
136
  ## Citation
137
 
138
+ If you find this work useful, please consider citing:
139
+
140
  ```bibtex
141
  @inproceedings{he2025multi,
142
  title={Multi-source spatial knowledge understanding for immersive visual text-to-speech},
 
155
  This project builds upon several excellent open-source projects. We gratefully acknowledge:
156
 
157
  **Model Architectures & Code**
158
+ - [F5-TTS](https://github.com/SWivid/F5-TTS) &mdash; Diffusion Transformer (DiT) architecture
159
+ - [BigVGAN](https://github.com/NVIDIA/BigVGAN) &mdash; Neural vocoder by NVIDIA
160
+ - [RMVPE](https://github.com/Dream-high/RMVPE) &mdash; Robust pitch (F0) estimation
161
+ - [x-transformers](https://github.com/lucidrains/x-transformers) &mdash; Rotary positional embeddings
162
+ - [FlashAttention](https://github.com/Dao-AILab/flash-attention) &mdash; Memory-efficient attention kernels
163
 
164
  **Pretrained Models**
165
+ - [BERT-large-uncased](https://huggingface.co/google-bert/bert-large-uncased) (Google) &mdash; Caption feature extraction
166
+ - [ResNet-18](https://huggingface.co/microsoft/resnet-18) (Microsoft) &mdash; RGB and depth visual feature extraction
167
 
168
  **Datasets & Tools**
169
+ - [SoundSpaces-Speech](https://github.com/facebookresearch/learning-audio-visual-dereverberation) (Meta Research) &mdash; Audio-visual spatial speech dataset
170
+ - [Montreal Forced Aligner (MFA)](https://montreal-forced-aligner.readthedocs.io/) &mdash; Phoneme-level forced alignment
171
+ - [Google Gemini](https://ai.google.dev/) &mdash; Panoramic scene caption generation
172
+
173
+ **Libraries**
174
+ - [PyTorch](https://pytorch.org/) &mdash; Deep learning framework
175
+ - [librosa](https://librosa.org/) &mdash; Audio analysis and processing
176
+ - [HuggingFace Transformers](https://github.com/huggingface/transformers) &mdash; Pretrained model loading
177
+ - [matplotlib](https://matplotlib.org/) &mdash; Visualization