Spaces:
Sleeping
Sleeping
primepake
commited on
Commit
·
0108756
1
Parent(s):
abebdb4
update readme
Browse files- README.md +3 -3
- speech/config.yaml +1 -1
README.md
CHANGED
|
@@ -41,7 +41,7 @@ Maps discrete tokens to a continuous latent space using a Variational Autoencode
|
|
| 41 |
|
| 42 |
Before training the main model:
|
| 43 |
1. Extract discrete tokens using the trained FSQ [S3Tokenizer](https://github.com/xingchensong/S3Tokenizer)
|
| 44 |
-
2. Generate continuous latent representations using the trained DAC-VAE - the pretrained I provided
|
| 45 |
|
| 46 |
### 3. Two-Stage Training
|
| 47 |
|
|
@@ -59,9 +59,9 @@ pip install -r requirements.txt
|
|
| 59 |
|
| 60 |
### Training Pipeline
|
| 61 |
|
| 62 |
-
1. **Extracting FSQ**
|
| 63 |
```bash
|
| 64 |
-
pip install
|
| 65 |
s3tokenizer --wav_scp data.scp \
|
| 66 |
--device "cuda" \
|
| 67 |
--output_dir "./data" \
|
|
|
|
| 41 |
|
| 42 |
Before training the main model:
|
| 43 |
1. Extract discrete tokens using the trained FSQ [S3Tokenizer](https://github.com/xingchensong/S3Tokenizer)
|
| 44 |
+
2. Generate continuous latent representations using the trained DAC-VAE - the pretrained I provided [DAC-VAE](https://github.com/primepake/learnable-speech/releases/tag/dac-vae)
|
| 45 |
|
| 46 |
### 3. Two-Stage Training
|
| 47 |
|
|
|
|
| 59 |
|
| 60 |
### Training Pipeline
|
| 61 |
|
| 62 |
+
1. **Extracting FSQ**
|
| 63 |
```bash
|
| 64 |
+
pip install s3tokenizer
|
| 65 |
s3tokenizer --wav_scp data.scp \
|
| 66 |
--device "cuda" \
|
| 67 |
--output_dir "./data" \
|
speech/config.yaml
CHANGED
|
@@ -198,7 +198,7 @@ sort: !name:cosyvoice.dataset.processor.sort
|
|
| 198 |
sort_size: 500 # sort_size should be less than shuffle_size
|
| 199 |
batch: !name:cosyvoice.dataset.processor.batch
|
| 200 |
batch_type: 'dynamic'
|
| 201 |
-
max_frames_in_batch:
|
| 202 |
padding: !name:cosyvoice.dataset.processor.padding
|
| 203 |
use_spk_embedding: False # change to True during sft
|
| 204 |
use_speaker_encoder: !ref <use_speaker_encoder>
|
|
|
|
| 198 |
sort_size: 500 # sort_size should be less than shuffle_size
|
| 199 |
batch: !name:cosyvoice.dataset.processor.batch
|
| 200 |
batch_type: 'dynamic'
|
| 201 |
+
max_frames_in_batch: 30000
|
| 202 |
padding: !name:cosyvoice.dataset.processor.padding
|
| 203 |
use_spk_embedding: False # change to True during sft
|
| 204 |
use_speaker_encoder: !ref <use_speaker_encoder>
|