he-shuwei
/

M2SE-VTTS

@@ -27,8 +27,10 @@ Visual Text-to-Speech (VTTS) aims to take the environmental image as the prompt
 | Resource | Path | Description |
 |---|---|---|
-| M<sup>2</sup>SE-VTTS | `m2se_vtts/` | Finetuned model checkpoint |
-| BigVGAN v2 | `bigvgan/` | Retrained 16 kHz vocoder |
 | Spatial environment captions | `data/raw_data/captions/` | Gemini-generated captions for all splits |
 | MFA alignment results | `data/processed_data/mfa/mfa_outputs.tar.gz` | Pre-computed forced alignment (TextGrid) |
@@ -41,7 +43,7 @@ git clone https://github.com/he-shuwei/M2SE-VTTS.git
 cd M2SE-VTTS
 # Download checkpoints
-# Place m2se_vtts/ and bigvgan/ under checkpoints/
 # Place captions under data/raw_data/captions/
 # Extract mfa_outputs.tar.gz to data/processed_data/mfa/outputs/

 | Resource | Path | Description |
 |---|---|---|
+| M<sup>2</sup>SE-VTTS (finetuned) | `m2se_vtts/` | Finetuned model for inference |
+| Pretrain Encoder | `pretrain_encoder/` | Pretrained encoder (Emilia, MLM) |
+| Pretrain Decoder | `pretrain_decoder/` | Pretrained decoder (Emilia, Diffusion) |
+| BigVGAN v2 | `bigvgan/` | Retrained vocoder (16 kHz) |
 | Spatial environment captions | `data/raw_data/captions/` | Gemini-generated captions for all splits |
 | MFA alignment results | `data/processed_data/mfa/mfa_outputs.tar.gz` | Pre-computed forced alignment (TextGrid) |
 cd M2SE-VTTS
 # Download checkpoints
+# Place m2se_vtts/, pretrain_encoder/, pretrain_decoder/, and bigvgan/ under checkpoints/
 # Place captions under data/raw_data/captions/
 # Extract mfa_outputs.tar.gz to data/processed_data/mfa/outputs/