tungnaa-models-public / models /tts /tungnaa_119_vctk.md
no-op-ul-se's picture
add JVS and VCTK models
9a604e2
metadata
block_size: 2048
sample_rate: 48000
latent_size: 11
vocoder: 046-multivoice-2048-48k-vlobeta-specdis-noise_824a15d4dc_streaming_norm.ts
dataset: VCTK
vocoder_type: RAVE
alignment_type: DCA
likelihood_type: NSF
text_encoder_type: CANINE

tungnaa_119_vctk

dimensions

block size: 2048

sample rate: 48000

latent size: 11

dataset

VCTK

vocoder

models/vocoder/046-multivoice-2048-48k-vlobeta-specdis-noise_824a15d4dc_streaming_norm.ts

training

tungnaa prep --datasets '{kind:"vctk", path:"/data/datasets/VCTK"}' --rave-path /data/users/victor/rave-v2/runs/046-multivoice-2048-48k-vlobeta-specdis-noise_824a15d4dc/version_0/checkpoints/046-multivoice-2048-48k-vlobeta-specdis-noise_824a15d4dc_streaming_norm.ts --out-path /data/users/victor/tmp/ivoice_prep_824a/

tungnaa trainer --experiment 119-vctk --model-dir /data/users/victor/ivoice-models --log-dir /data/users/victor/ivoice-logs --manifest /data/users/victor/tmp/ivoice_prep_824a/vctk.json --concat-speakers 2 --speaker-annotate --device cuda:1 --batch-size 32 --rave-model /data/users/victor/rave-v2/runs/046-multivoice-2048-48k-vlobeta-specdis-noise_824a15d4dc/version_0/checkpoints/046-multivoice-2048-48k-vlobeta-specdis-noise_824a15d4dc_streaming_norm.ts --lr 3e-4 --lr-text 3e-5 --epoch-size 200 --save-epochs 20 train

notes

trained with concatation of utterance pairs plus speaker annotations. example syntax: [p225] this is an utterance. [p330] this is another.

uses a multi-dataset vocoder which was not fine tuned to only VCTK, so it should have a lot of play in the latent biases.

this model uses a neural spline flow likelihood.