nhat-dao-tpv-clv commited on
Commit
0216954
·
1 Parent(s): e5448db

update readme

Browse files
Files changed (1) hide show
  1. README.md +16 -2
README.md CHANGED
@@ -15,12 +15,15 @@ This repository provides an implementation of the Learnable-Speech model, featur
15
  - [x] **Immiscible assignment**: Support immiscible adding noise while training
16
  - [x] **Contrastive Flow matching**: Support Contrastive training
17
  - [ ] **Checkpoint release**: Release LLM and Contrastive FM checkpoint
 
18
  ## Architecture
19
 
20
  ### Stage 1: Audio to Discrete Tokens
 
21
  Converts raw audio into discrete representations using the FSQ (S3Tokenizer) framework.
22
 
23
  ### Stage 2: Discrete Tokens to Continuous Latent Space
 
24
  Maps discrete tokens to a continuous latent space using a Variational Autoencoder (VAE).
25
 
26
  > **Note**: This implementation uses standard DAC-VAE instead of Flow-VAE.
@@ -30,28 +33,33 @@ Maps discrete tokens to a continuous latent space using a Variational Autoencode
30
  ### 1. Model Training
31
 
32
  #### BPE tokens to FSQ tokens
 
33
  - Based on the FSQ
34
  - Using Auto Regressive to predict the FSQ tokens with learnable speaker extractor
35
 
36
  #### FSQ tokens to DAC-VAE latent
 
37
  - Based on Cosyvoice2 flow matching decoder
38
  - Learns continuous latent representations from discrete tokens
39
 
40
  ### 2. Feature Extraction
41
 
42
  Before training the main model:
 
43
  1. Extract discrete tokens using the trained FSQ [S3Tokenizer](https://github.com/xingchensong/S3Tokenizer)
44
  2. Generate continuous latent representations using the trained DAC-VAE - the pretrained I provided [DAC-VAE](https://github.com/primepake/learnable-speech/releases/tag/dac-vae)
45
 
46
  ### 3. Two-Stage Training
47
 
48
  Train the models sequentially:
49
- - **Stage 1**: BPE tokens → Discrete FSQ
 
50
  - **Stage 2**: Discrete FSQ → DAC-VAE Continuous latent space
51
 
52
  ## Getting Started
53
 
54
  ### Prerequisites
 
55
  ```bash
56
  # List your dependencies here
57
  pip install -r requirements.txt
@@ -60,6 +68,7 @@ pip install -r requirements.txt
60
  ### Training Pipeline
61
 
62
  1. **Extracting FSQ**
 
63
  ```bash
64
  pip install s3tokenizer
65
  s3tokenizer --wav_scp data.scp \
@@ -84,6 +93,7 @@ pip install -r requirements.txt
84
  ```
85
 
86
  2. **Extracting DAC-VAE latent**
 
87
  ```bash
88
  cd dac-vae
89
  python extract_dac_latents.py --checkpoint checkpoint.pt --config config.yml --root_path dataset --output_dir dataset/dac
@@ -105,6 +115,7 @@ dataset_root/
105
  ```
106
 
107
  3. **Stage 1: Auto Regressive Transformer**
 
108
  ```bash
109
  #!/bin/bash
110
  pretrained_model_dir=./pretrained_models/CosyVoice2-0.5B
@@ -136,6 +147,7 @@ dataset_root/
136
  ```
137
 
138
  4. **Stage 2: FLow matching decoder**
 
139
  ```bash
140
  #!/bin/bash
141
  pretrained_model_dir=./pretrained_models/CosyVoice2-0.5B
@@ -166,6 +178,7 @@ dataset_root/
166
  ```
167
 
168
  ## Project Structure
 
169
  ```
170
  minimax-speech/
171
  ├── assets/
@@ -208,6 +221,7 @@ If you use this code in your research, please cite:
208
  ## License
209
 
210
  This project follows the licensing terms of its dependencies:
 
211
  - CosyVoice2 components: [Check CosyVoice2 License](https://github.com/FunAudioLLM/CosyVoice/blob/main/LICENSE)
212
  - FSQ components: [Apache 2.0 License](https://github.com/xingchensong/S3Tokenizer/blob/main/LICENSE)
213
 
@@ -223,5 +237,5 @@ This project follows the licensing terms of its dependencies:
223
  Contributions are welcome! Please feel free to submit a Pull Request.
224
 
225
  ## Disclaimer
226
- The content provided above is for academic purposes only and is intended to demonstrate technical capabilities.
227
 
 
 
15
  - [x] **Immiscible assignment**: Support immiscible adding noise while training
16
  - [x] **Contrastive Flow matching**: Support Contrastive training
17
  - [ ] **Checkpoint release**: Release LLM and Contrastive FM checkpoint
18
+
19
  ## Architecture
20
 
21
  ### Stage 1: Audio to Discrete Tokens
22
+
23
  Converts raw audio into discrete representations using the FSQ (S3Tokenizer) framework.
24
 
25
  ### Stage 2: Discrete Tokens to Continuous Latent Space
26
+
27
  Maps discrete tokens to a continuous latent space using a Variational Autoencoder (VAE).
28
 
29
  > **Note**: This implementation uses standard DAC-VAE instead of Flow-VAE.
 
33
  ### 1. Model Training
34
 
35
  #### BPE tokens to FSQ tokens
36
+
37
  - Based on the FSQ
38
  - Using Auto Regressive to predict the FSQ tokens with learnable speaker extractor
39
 
40
  #### FSQ tokens to DAC-VAE latent
41
+
42
  - Based on Cosyvoice2 flow matching decoder
43
  - Learns continuous latent representations from discrete tokens
44
 
45
  ### 2. Feature Extraction
46
 
47
  Before training the main model:
48
+
49
  1. Extract discrete tokens using the trained FSQ [S3Tokenizer](https://github.com/xingchensong/S3Tokenizer)
50
  2. Generate continuous latent representations using the trained DAC-VAE - the pretrained I provided [DAC-VAE](https://github.com/primepake/learnable-speech/releases/tag/dac-vae)
51
 
52
  ### 3. Two-Stage Training
53
 
54
  Train the models sequentially:
55
+
56
+ - **Stage 1**: BPE tokens → Discrete FSQ
57
  - **Stage 2**: Discrete FSQ → DAC-VAE Continuous latent space
58
 
59
  ## Getting Started
60
 
61
  ### Prerequisites
62
+
63
  ```bash
64
  # List your dependencies here
65
  pip install -r requirements.txt
 
68
  ### Training Pipeline
69
 
70
  1. **Extracting FSQ**
71
+
72
  ```bash
73
  pip install s3tokenizer
74
  s3tokenizer --wav_scp data.scp \
 
93
  ```
94
 
95
  2. **Extracting DAC-VAE latent**
96
+
97
  ```bash
98
  cd dac-vae
99
  python extract_dac_latents.py --checkpoint checkpoint.pt --config config.yml --root_path dataset --output_dir dataset/dac
 
115
  ```
116
 
117
  3. **Stage 1: Auto Regressive Transformer**
118
+
119
  ```bash
120
  #!/bin/bash
121
  pretrained_model_dir=./pretrained_models/CosyVoice2-0.5B
 
147
  ```
148
 
149
  4. **Stage 2: FLow matching decoder**
150
+
151
  ```bash
152
  #!/bin/bash
153
  pretrained_model_dir=./pretrained_models/CosyVoice2-0.5B
 
178
  ```
179
 
180
  ## Project Structure
181
+
182
  ```
183
  minimax-speech/
184
  ├── assets/
 
221
  ## License
222
 
223
  This project follows the licensing terms of its dependencies:
224
+
225
  - CosyVoice2 components: [Check CosyVoice2 License](https://github.com/FunAudioLLM/CosyVoice/blob/main/LICENSE)
226
  - FSQ components: [Apache 2.0 License](https://github.com/xingchensong/S3Tokenizer/blob/main/LICENSE)
227
 
 
237
  Contributions are welcome! Please feel free to submit a Pull Request.
238
 
239
  ## Disclaimer
 
240
 
241
+ The content provided above is for academic purposes only and is intended to demonstrate technical capabilities.