Spaces:
Sleeping
Sleeping
nhat-dao-tpv-clv
commited on
Commit
·
0216954
1
Parent(s):
e5448db
update readme
Browse files
README.md
CHANGED
|
@@ -15,12 +15,15 @@ This repository provides an implementation of the Learnable-Speech model, featur
|
|
| 15 |
- [x] **Immiscible assignment**: Support immiscible adding noise while training
|
| 16 |
- [x] **Contrastive Flow matching**: Support Contrastive training
|
| 17 |
- [ ] **Checkpoint release**: Release LLM and Contrastive FM checkpoint
|
|
|
|
| 18 |
## Architecture
|
| 19 |
|
| 20 |
### Stage 1: Audio to Discrete Tokens
|
|
|
|
| 21 |
Converts raw audio into discrete representations using the FSQ (S3Tokenizer) framework.
|
| 22 |
|
| 23 |
### Stage 2: Discrete Tokens to Continuous Latent Space
|
|
|
|
| 24 |
Maps discrete tokens to a continuous latent space using a Variational Autoencoder (VAE).
|
| 25 |
|
| 26 |
> **Note**: This implementation uses standard DAC-VAE instead of Flow-VAE.
|
|
@@ -30,28 +33,33 @@ Maps discrete tokens to a continuous latent space using a Variational Autoencode
|
|
| 30 |
### 1. Model Training
|
| 31 |
|
| 32 |
#### BPE tokens to FSQ tokens
|
|
|
|
| 33 |
- Based on the FSQ
|
| 34 |
- Using Auto Regressive to predict the FSQ tokens with learnable speaker extractor
|
| 35 |
|
| 36 |
#### FSQ tokens to DAC-VAE latent
|
|
|
|
| 37 |
- Based on Cosyvoice2 flow matching decoder
|
| 38 |
- Learns continuous latent representations from discrete tokens
|
| 39 |
|
| 40 |
### 2. Feature Extraction
|
| 41 |
|
| 42 |
Before training the main model:
|
|
|
|
| 43 |
1. Extract discrete tokens using the trained FSQ [S3Tokenizer](https://github.com/xingchensong/S3Tokenizer)
|
| 44 |
2. Generate continuous latent representations using the trained DAC-VAE - the pretrained I provided [DAC-VAE](https://github.com/primepake/learnable-speech/releases/tag/dac-vae)
|
| 45 |
|
| 46 |
### 3. Two-Stage Training
|
| 47 |
|
| 48 |
Train the models sequentially:
|
| 49 |
-
|
|
|
|
| 50 |
- **Stage 2**: Discrete FSQ → DAC-VAE Continuous latent space
|
| 51 |
|
| 52 |
## Getting Started
|
| 53 |
|
| 54 |
### Prerequisites
|
|
|
|
| 55 |
```bash
|
| 56 |
# List your dependencies here
|
| 57 |
pip install -r requirements.txt
|
|
@@ -60,6 +68,7 @@ pip install -r requirements.txt
|
|
| 60 |
### Training Pipeline
|
| 61 |
|
| 62 |
1. **Extracting FSQ**
|
|
|
|
| 63 |
```bash
|
| 64 |
pip install s3tokenizer
|
| 65 |
s3tokenizer --wav_scp data.scp \
|
|
@@ -84,6 +93,7 @@ pip install -r requirements.txt
|
|
| 84 |
```
|
| 85 |
|
| 86 |
2. **Extracting DAC-VAE latent**
|
|
|
|
| 87 |
```bash
|
| 88 |
cd dac-vae
|
| 89 |
python extract_dac_latents.py --checkpoint checkpoint.pt --config config.yml --root_path dataset --output_dir dataset/dac
|
|
@@ -105,6 +115,7 @@ dataset_root/
|
|
| 105 |
```
|
| 106 |
|
| 107 |
3. **Stage 1: Auto Regressive Transformer**
|
|
|
|
| 108 |
```bash
|
| 109 |
#!/bin/bash
|
| 110 |
pretrained_model_dir=./pretrained_models/CosyVoice2-0.5B
|
|
@@ -136,6 +147,7 @@ dataset_root/
|
|
| 136 |
```
|
| 137 |
|
| 138 |
4. **Stage 2: FLow matching decoder**
|
|
|
|
| 139 |
```bash
|
| 140 |
#!/bin/bash
|
| 141 |
pretrained_model_dir=./pretrained_models/CosyVoice2-0.5B
|
|
@@ -166,6 +178,7 @@ dataset_root/
|
|
| 166 |
```
|
| 167 |
|
| 168 |
## Project Structure
|
|
|
|
| 169 |
```
|
| 170 |
minimax-speech/
|
| 171 |
├── assets/
|
|
@@ -208,6 +221,7 @@ If you use this code in your research, please cite:
|
|
| 208 |
## License
|
| 209 |
|
| 210 |
This project follows the licensing terms of its dependencies:
|
|
|
|
| 211 |
- CosyVoice2 components: [Check CosyVoice2 License](https://github.com/FunAudioLLM/CosyVoice/blob/main/LICENSE)
|
| 212 |
- FSQ components: [Apache 2.0 License](https://github.com/xingchensong/S3Tokenizer/blob/main/LICENSE)
|
| 213 |
|
|
@@ -223,5 +237,5 @@ This project follows the licensing terms of its dependencies:
|
|
| 223 |
Contributions are welcome! Please feel free to submit a Pull Request.
|
| 224 |
|
| 225 |
## Disclaimer
|
| 226 |
-
The content provided above is for academic purposes only and is intended to demonstrate technical capabilities.
|
| 227 |
|
|
|
|
|
|
| 15 |
- [x] **Immiscible assignment**: Support immiscible adding noise while training
|
| 16 |
- [x] **Contrastive Flow matching**: Support Contrastive training
|
| 17 |
- [ ] **Checkpoint release**: Release LLM and Contrastive FM checkpoint
|
| 18 |
+
|
| 19 |
## Architecture
|
| 20 |
|
| 21 |
### Stage 1: Audio to Discrete Tokens
|
| 22 |
+
|
| 23 |
Converts raw audio into discrete representations using the FSQ (S3Tokenizer) framework.
|
| 24 |
|
| 25 |
### Stage 2: Discrete Tokens to Continuous Latent Space
|
| 26 |
+
|
| 27 |
Maps discrete tokens to a continuous latent space using a Variational Autoencoder (VAE).
|
| 28 |
|
| 29 |
> **Note**: This implementation uses standard DAC-VAE instead of Flow-VAE.
|
|
|
|
| 33 |
### 1. Model Training
|
| 34 |
|
| 35 |
#### BPE tokens to FSQ tokens
|
| 36 |
+
|
| 37 |
- Based on the FSQ
|
| 38 |
- Using Auto Regressive to predict the FSQ tokens with learnable speaker extractor
|
| 39 |
|
| 40 |
#### FSQ tokens to DAC-VAE latent
|
| 41 |
+
|
| 42 |
- Based on Cosyvoice2 flow matching decoder
|
| 43 |
- Learns continuous latent representations from discrete tokens
|
| 44 |
|
| 45 |
### 2. Feature Extraction
|
| 46 |
|
| 47 |
Before training the main model:
|
| 48 |
+
|
| 49 |
1. Extract discrete tokens using the trained FSQ [S3Tokenizer](https://github.com/xingchensong/S3Tokenizer)
|
| 50 |
2. Generate continuous latent representations using the trained DAC-VAE - the pretrained I provided [DAC-VAE](https://github.com/primepake/learnable-speech/releases/tag/dac-vae)
|
| 51 |
|
| 52 |
### 3. Two-Stage Training
|
| 53 |
|
| 54 |
Train the models sequentially:
|
| 55 |
+
|
| 56 |
+
- **Stage 1**: BPE tokens → Discrete FSQ
|
| 57 |
- **Stage 2**: Discrete FSQ → DAC-VAE Continuous latent space
|
| 58 |
|
| 59 |
## Getting Started
|
| 60 |
|
| 61 |
### Prerequisites
|
| 62 |
+
|
| 63 |
```bash
|
| 64 |
# List your dependencies here
|
| 65 |
pip install -r requirements.txt
|
|
|
|
| 68 |
### Training Pipeline
|
| 69 |
|
| 70 |
1. **Extracting FSQ**
|
| 71 |
+
|
| 72 |
```bash
|
| 73 |
pip install s3tokenizer
|
| 74 |
s3tokenizer --wav_scp data.scp \
|
|
|
|
| 93 |
```
|
| 94 |
|
| 95 |
2. **Extracting DAC-VAE latent**
|
| 96 |
+
|
| 97 |
```bash
|
| 98 |
cd dac-vae
|
| 99 |
python extract_dac_latents.py --checkpoint checkpoint.pt --config config.yml --root_path dataset --output_dir dataset/dac
|
|
|
|
| 115 |
```
|
| 116 |
|
| 117 |
3. **Stage 1: Auto Regressive Transformer**
|
| 118 |
+
|
| 119 |
```bash
|
| 120 |
#!/bin/bash
|
| 121 |
pretrained_model_dir=./pretrained_models/CosyVoice2-0.5B
|
|
|
|
| 147 |
```
|
| 148 |
|
| 149 |
4. **Stage 2: FLow matching decoder**
|
| 150 |
+
|
| 151 |
```bash
|
| 152 |
#!/bin/bash
|
| 153 |
pretrained_model_dir=./pretrained_models/CosyVoice2-0.5B
|
|
|
|
| 178 |
```
|
| 179 |
|
| 180 |
## Project Structure
|
| 181 |
+
|
| 182 |
```
|
| 183 |
minimax-speech/
|
| 184 |
├── assets/
|
|
|
|
| 221 |
## License
|
| 222 |
|
| 223 |
This project follows the licensing terms of its dependencies:
|
| 224 |
+
|
| 225 |
- CosyVoice2 components: [Check CosyVoice2 License](https://github.com/FunAudioLLM/CosyVoice/blob/main/LICENSE)
|
| 226 |
- FSQ components: [Apache 2.0 License](https://github.com/xingchensong/S3Tokenizer/blob/main/LICENSE)
|
| 227 |
|
|
|
|
| 237 |
Contributions are welcome! Please feel free to submit a Pull Request.
|
| 238 |
|
| 239 |
## Disclaimer
|
|
|
|
| 240 |
|
| 241 |
+
The content provided above is for academic purposes only and is intended to demonstrate technical capabilities.
|