Spaces:

mnhatdaous
/

learnable-speech

Sleeping

primepake commited on Jul 20

Commit

0108756

1 Parent(s): abebdb4

update readme

Files changed (2) hide show

README.md CHANGED Viewed

@@ -41,7 +41,7 @@ Maps discrete tokens to a continuous latent space using a Variational Autoencode
 Before training the main model:
 1. Extract discrete tokens using the trained FSQ [S3Tokenizer](https://github.com/xingchensong/S3Tokenizer)
-2. Generate continuous latent representations using the trained DAC-VAE - the pretrained I provided here: [DAC-VAE](https://drive.google.com/file/d/1iwZhPlcdDwvPjeON3bFAeYarsV4ZtI2E/view?usp=sharing)
 ### 3. Two-Stage Training
@@ -59,9 +59,9 @@ pip install -r requirements.txt
 ### Training Pipeline
-1. **Extracting FSQ** (if not using pretrained)
    ```bash
-   pip install
    s3tokenizer --wav_scp data.scp \
             --device "cuda" \
             --output_dir "./data" \

 Before training the main model:
 1. Extract discrete tokens using the trained FSQ [S3Tokenizer](https://github.com/xingchensong/S3Tokenizer)
+2. Generate continuous latent representations using the trained DAC-VAE - the pretrained I provided [DAC-VAE](https://github.com/primepake/learnable-speech/releases/tag/dac-vae)
 ### 3. Two-Stage Training
 ### Training Pipeline
+1. **Extracting FSQ**
    ```bash
+   pip install s3tokenizer
    s3tokenizer --wav_scp data.scp \
             --device "cuda" \
             --output_dir "./data" \

speech/config.yaml CHANGED Viewed

@@ -198,7 +198,7 @@ sort: !name:cosyvoice.dataset.processor.sort
     sort_size: 500  # sort_size should be less than shuffle_size
 batch: !name:cosyvoice.dataset.processor.batch
     batch_type: 'dynamic'
-    max_frames_in_batch: 40000
 padding: !name:cosyvoice.dataset.processor.padding
     use_spk_embedding: False # change to True during sft
     use_speaker_encoder: !ref <use_speaker_encoder>

     sort_size: 500  # sort_size should be less than shuffle_size
 batch: !name:cosyvoice.dataset.processor.batch
     batch_type: 'dynamic'
+    max_frames_in_batch: 30000
 padding: !name:cosyvoice.dataset.processor.padding
     use_spk_embedding: False # change to True during sft
     use_speaker_encoder: !ref <use_speaker_encoder>