primepake commited on
Commit
0108756
·
1 Parent(s): abebdb4

update readme

Browse files
Files changed (2) hide show
  1. README.md +3 -3
  2. speech/config.yaml +1 -1
README.md CHANGED
@@ -41,7 +41,7 @@ Maps discrete tokens to a continuous latent space using a Variational Autoencode
41
 
42
  Before training the main model:
43
  1. Extract discrete tokens using the trained FSQ [S3Tokenizer](https://github.com/xingchensong/S3Tokenizer)
44
- 2. Generate continuous latent representations using the trained DAC-VAE - the pretrained I provided here: [DAC-VAE](https://drive.google.com/file/d/1iwZhPlcdDwvPjeON3bFAeYarsV4ZtI2E/view?usp=sharing)
45
 
46
  ### 3. Two-Stage Training
47
 
@@ -59,9 +59,9 @@ pip install -r requirements.txt
59
 
60
  ### Training Pipeline
61
 
62
- 1. **Extracting FSQ** (if not using pretrained)
63
  ```bash
64
- pip install
65
  s3tokenizer --wav_scp data.scp \
66
  --device "cuda" \
67
  --output_dir "./data" \
 
41
 
42
  Before training the main model:
43
  1. Extract discrete tokens using the trained FSQ [S3Tokenizer](https://github.com/xingchensong/S3Tokenizer)
44
+ 2. Generate continuous latent representations using the trained DAC-VAE - the pretrained I provided [DAC-VAE](https://github.com/primepake/learnable-speech/releases/tag/dac-vae)
45
 
46
  ### 3. Two-Stage Training
47
 
 
59
 
60
  ### Training Pipeline
61
 
62
+ 1. **Extracting FSQ**
63
  ```bash
64
+ pip install s3tokenizer
65
  s3tokenizer --wav_scp data.scp \
66
  --device "cuda" \
67
  --output_dir "./data" \
speech/config.yaml CHANGED
@@ -198,7 +198,7 @@ sort: !name:cosyvoice.dataset.processor.sort
198
  sort_size: 500 # sort_size should be less than shuffle_size
199
  batch: !name:cosyvoice.dataset.processor.batch
200
  batch_type: 'dynamic'
201
- max_frames_in_batch: 40000
202
  padding: !name:cosyvoice.dataset.processor.padding
203
  use_spk_embedding: False # change to True during sft
204
  use_speaker_encoder: !ref <use_speaker_encoder>
 
198
  sort_size: 500 # sort_size should be less than shuffle_size
199
  batch: !name:cosyvoice.dataset.processor.batch
200
  batch_type: 'dynamic'
201
+ max_frames_in_batch: 30000
202
  padding: !name:cosyvoice.dataset.processor.padding
203
  use_spk_embedding: False # change to True during sft
204
  use_speaker_encoder: !ref <use_speaker_encoder>