Update model card: add pretrain encoder/decoder to contents table
Browse files
README.md
CHANGED
|
@@ -27,8 +27,10 @@ Visual Text-to-Speech (VTTS) aims to take the environmental image as the prompt
|
|
| 27 |
|
| 28 |
| Resource | Path | Description |
|
| 29 |
|---|---|---|
|
| 30 |
-
| M<sup>2</sup>SE-VTTS | `m2se_vtts/` | Finetuned model
|
| 31 |
-
|
|
|
|
|
|
|
|
| 32 |
| Spatial environment captions | `data/raw_data/captions/` | Gemini-generated captions for all splits |
|
| 33 |
| MFA alignment results | `data/processed_data/mfa/mfa_outputs.tar.gz` | Pre-computed forced alignment (TextGrid) |
|
| 34 |
|
|
@@ -41,7 +43,7 @@ git clone https://github.com/he-shuwei/M2SE-VTTS.git
|
|
| 41 |
cd M2SE-VTTS
|
| 42 |
|
| 43 |
# Download checkpoints
|
| 44 |
-
# Place m2se_vtts/ and bigvgan/ under checkpoints/
|
| 45 |
# Place captions under data/raw_data/captions/
|
| 46 |
# Extract mfa_outputs.tar.gz to data/processed_data/mfa/outputs/
|
| 47 |
|
|
|
|
| 27 |
|
| 28 |
| Resource | Path | Description |
|
| 29 |
|---|---|---|
|
| 30 |
+
| M<sup>2</sup>SE-VTTS (finetuned) | `m2se_vtts/` | Finetuned model for inference |
|
| 31 |
+
| Pretrain Encoder | `pretrain_encoder/` | Pretrained encoder (Emilia, MLM) |
|
| 32 |
+
| Pretrain Decoder | `pretrain_decoder/` | Pretrained decoder (Emilia, Diffusion) |
|
| 33 |
+
| BigVGAN v2 | `bigvgan/` | Retrained vocoder (16 kHz) |
|
| 34 |
| Spatial environment captions | `data/raw_data/captions/` | Gemini-generated captions for all splits |
|
| 35 |
| MFA alignment results | `data/processed_data/mfa/mfa_outputs.tar.gz` | Pre-computed forced alignment (TextGrid) |
|
| 36 |
|
|
|
|
| 43 |
cd M2SE-VTTS
|
| 44 |
|
| 45 |
# Download checkpoints
|
| 46 |
+
# Place m2se_vtts/, pretrain_encoder/, pretrain_decoder/, and bigvgan/ under checkpoints/
|
| 47 |
# Place captions under data/raw_data/captions/
|
| 48 |
# Extract mfa_outputs.tar.gz to data/processed_data/mfa/outputs/
|
| 49 |
|