he-shuwei commited on
Commit
3ec558d
·
verified ·
1 Parent(s): 63affaf

Update model card: add pretrain encoder/decoder to contents table

Browse files
Files changed (1) hide show
  1. README.md +5 -3
README.md CHANGED
@@ -27,8 +27,10 @@ Visual Text-to-Speech (VTTS) aims to take the environmental image as the prompt
27
 
28
  | Resource | Path | Description |
29
  |---|---|---|
30
- | M<sup>2</sup>SE-VTTS | `m2se_vtts/` | Finetuned model checkpoint |
31
- | BigVGAN v2 | `bigvgan/` | Retrained 16 kHz vocoder |
 
 
32
  | Spatial environment captions | `data/raw_data/captions/` | Gemini-generated captions for all splits |
33
  | MFA alignment results | `data/processed_data/mfa/mfa_outputs.tar.gz` | Pre-computed forced alignment (TextGrid) |
34
 
@@ -41,7 +43,7 @@ git clone https://github.com/he-shuwei/M2SE-VTTS.git
41
  cd M2SE-VTTS
42
 
43
  # Download checkpoints
44
- # Place m2se_vtts/ and bigvgan/ under checkpoints/
45
  # Place captions under data/raw_data/captions/
46
  # Extract mfa_outputs.tar.gz to data/processed_data/mfa/outputs/
47
 
 
27
 
28
  | Resource | Path | Description |
29
  |---|---|---|
30
+ | M<sup>2</sup>SE-VTTS (finetuned) | `m2se_vtts/` | Finetuned model for inference |
31
+ | Pretrain Encoder | `pretrain_encoder/` | Pretrained encoder (Emilia, MLM) |
32
+ | Pretrain Decoder | `pretrain_decoder/` | Pretrained decoder (Emilia, Diffusion) |
33
+ | BigVGAN v2 | `bigvgan/` | Retrained vocoder (16 kHz) |
34
  | Spatial environment captions | `data/raw_data/captions/` | Gemini-generated captions for all splits |
35
  | MFA alignment results | `data/processed_data/mfa/mfa_outputs.tar.gz` | Pre-computed forced alignment (TextGrid) |
36
 
 
43
  cd M2SE-VTTS
44
 
45
  # Download checkpoints
46
+ # Place m2se_vtts/, pretrain_encoder/, pretrain_decoder/, and bigvgan/ under checkpoints/
47
  # Place captions under data/raw_data/captions/
48
  # Extract mfa_outputs.tar.gz to data/processed_data/mfa/outputs/
49