|
|
--- |
|
|
block_size: 2048 |
|
|
sample_rate: 44100 |
|
|
latent_size: 12 |
|
|
vocoder: "042-jvs-100m-xfermulti_0abe2b072b_streaming_norm.ts" |
|
|
dataset: "John Van Stan (LibriTTS)" |
|
|
vocoder_type: "RAVE" |
|
|
alignment_type: "DCA" |
|
|
likelihood_type: "NSF" |
|
|
text_encoder_type: "CANINE" |
|
|
--- |
|
|
|
|
|
# tungnaa_116_jvs |
|
|
|
|
|
### dimensions |
|
|
|
|
|
block size: 2048 |
|
|
|
|
|
sample rate: 44100 |
|
|
|
|
|
latent size: 12 |
|
|
|
|
|
### dataset |
|
|
|
|
|
JVS (Hi-Fi TTS speaker 9017) |
|
|
|
|
|
### vocoder |
|
|
|
|
|
`models/vocoder/042-jvs-100m-xfermulti_0abe2b072b_streaming_norm.ts` |
|
|
|
|
|
### training |
|
|
|
|
|
tungnaa commit `09ecdcd532eac3d454a8b4e28e896bca5bccbf9f` |
|
|
|
|
|
```bash |
|
|
tungnaa trainer --experiment 117-jvs-e2emulti-mask-ends --model-dir /data/users/victor/ivoice-models --log-dir /data/users/victor/ivoice-logs --manifest /data/users/victor/tmp/ivoice_prep_100m_0abe_multi/9017_manifest_clean_train.json --rave-model /data/users/victor/rave-v2/runs/042-jvs-100m-xfermulti_0abe2b072b/version_0/checkpoints/042-jvs-100m-xfermulti_0abe2b072b_streaming_norm.ts --lr 3e-4 --lr-text 3e-5 --epoch-size 200 --save-epochs 20 --device cuda:0 train |
|
|
``` |
|
|
|
|
|
### notes |
|
|
|
|
|
trained with full JVS dataset, no annotations. |
|
|
|
|
|
uses a 12-dimensional vocoder trained with a subset of JVS, fine tuned from a multivoice model. |
|
|
|
|
|
this model uses a neural spline flow likelihood. |