he-shuwei commited on
Commit
914a42a
·
verified ·
1 Parent(s): aebd13c

Add model card

Browse files
Files changed (1) hide show
  1. README.md +75 -0
README.md ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ tags:
6
+ - text-to-speech
7
+ - visual-tts
8
+ - speech-synthesis
9
+ - diffusion
10
+ - spatial-audio
11
+ pipeline_tag: text-to-speech
12
+ ---
13
+
14
+ # M<sup>2</sup>SE-VTTS: Multi-Modal and Multi-Scale Spatial Environment Understanding for Immersive Visual Text-to-Speech
15
+
16
+ [![Paper](https://img.shields.io/badge/AAAI%202025-Paper-blue)](https://arxiv.org/abs/2412.11409)
17
+ [![GitHub](https://img.shields.io/badge/GitHub-Code-green)](https://github.com/he-shuwei/M2SE-VTTS)
18
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
19
+
20
+ Visual Text-to-Speech (VTTS) aims to take the environmental image as the prompt to synthesize the reverberant speech for the spoken content. The challenge of this task lies in understanding the spatial environment from the image. Many attempts have been made to extract global spatial visual information from the RGB space of a spatial image. However, local and depth image information are crucial for understanding the spatial environment, which previous works have ignored. To address the issues, we propose a novel multi-modal and multi-scale spatial environment understanding scheme to achieve immersive VTTS, termed M<sup>2</sup>SE-VTTS. The multi-modal aims to take both the RGB and Depth spaces of the spatial image to learn more comprehensive spatial information, and the multi-scale seeks to model the local and global spatial knowledge simultaneously. Specifically, we first split the RGB and Depth images into patches and adopt the Gemini-generated environment captions to guide the local spatial understanding. After that, the multi-modal and multi-scale features are integrated by the local-aware global spatial understanding. In this way, M<sup>2</sup>SE-VTTS effectively models the interactions between local and global spatial contexts in the multi-modal spatial environment. Objective and subjective evaluations suggest that our model outperforms the advanced baselines in environmental speech generation.
21
+
22
+ ## Repository Contents
23
+
24
+ | Resource | Path | Description |
25
+ |---|---|---|
26
+ | M<sup>2</sup>SE-VTTS | `m2se_vtts/` | Finetuned model checkpoint |
27
+ | BigVGAN v2 | `bigvgan/` | Retrained 16 kHz vocoder |
28
+ | Spatial environment captions | `data/raw_data/captions/` | Gemini-generated captions for all splits |
29
+ | MFA alignment results | `data/processed_data/mfa/mfa_outputs.tar.gz` | Pre-computed forced alignment (TextGrid) |
30
+
31
+ ## Usage
32
+
33
+ Please refer to the [GitHub repository](https://github.com/he-shuwei/M2SE-VTTS) for installation, training, and inference instructions.
34
+
35
+ ```bash
36
+ git clone https://github.com/he-shuwei/M2SE-VTTS.git
37
+ cd M2SE-VTTS
38
+
39
+ # Download checkpoints
40
+ # Place m2se_vtts/ and bigvgan/ under checkpoints/
41
+ # Place captions under data/raw_data/captions/
42
+ # Extract mfa_outputs.tar.gz to data/processed_data/mfa/outputs/
43
+
44
+ # Inference
45
+ bash scripts/infer/run_infer.sh \
46
+ --ckpt checkpoints/m2se_vtts/model_ckpt_best.pt \
47
+ --outdir results/m2se_vtts/test_seen \
48
+ --batch_size 16
49
+ ```
50
+
51
+ ## Citation
52
+
53
+ ```bibtex
54
+ @inproceedings{liu2025multi,
55
+ title={Multi-modal and multi-scale spatial environment understanding for immersive visual text-to-speech},
56
+ author={Liu, Rui and He, Shuwei and Hu, Yifan and Li, Haizhou},
57
+ booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
58
+ volume={39},
59
+ number={23},
60
+ pages={24632--24640},
61
+ year={2025}
62
+ }
63
+ ```
64
+
65
+ ## Acknowledgments
66
+
67
+ This project builds upon several excellent open-source projects:
68
+
69
+ - [NATSpeech](https://github.com/NATSpeech/NATSpeech) — non-autoregressive TTS framework
70
+ - [DiffSinger](https://github.com/MoonInTheRiver/DiffSinger) — diffusion-based acoustic model
71
+ - [F5-TTS](https://github.com/SWivid/F5-TTS) — Diffusion Transformer (DiT) architecture
72
+ - [BigVGAN](https://github.com/NVIDIA/BigVGAN) — neural vocoder by NVIDIA
73
+ - [SoundSpaces-Speech](https://github.com/facebookresearch/learning-audio-visual-dereverberation) — audio-visual dataset by Meta Research
74
+ - [CLIP](https://github.com/openai/CLIP) — visual-language encoder by OpenAI
75
+ - [RMVPE](https://github.com/Dream-High/RMVPE) — robust pitch extractor