Spaces:
Sleeping
Sleeping
primepake
commited on
Commit
·
797c99c
1
Parent(s):
7215932
update readme
Browse files
README.md
CHANGED
|
@@ -29,27 +29,25 @@ Maps discrete tokens to a continuous latent space using a Variational Autoencode
|
|
| 29 |
|
| 30 |
### 1. Model Training
|
| 31 |
|
| 32 |
-
#### DAC
|
| 33 |
-
- Based on the
|
| 34 |
-
-
|
| 35 |
-
- Utilizes CosyVoice2's optimized training pipeline
|
| 36 |
|
| 37 |
-
#### DAC-VAE
|
| 38 |
-
-
|
| 39 |
- Learns continuous latent representations from discrete tokens
|
| 40 |
-
- Architecture adapted from CosyVoice2's VAE implementation
|
| 41 |
|
| 42 |
### 2. Feature Extraction
|
| 43 |
|
| 44 |
Before training the main model:
|
| 45 |
-
1. Extract discrete tokens using the trained DAC codec
|
| 46 |
-
2. Generate continuous latent representations using the trained DAC-VAE
|
| 47 |
|
| 48 |
### 3. Two-Stage Training
|
| 49 |
|
| 50 |
Train the models sequentially:
|
| 51 |
-
- **Stage 1**:
|
| 52 |
-
- **Stage 2**: Discrete
|
| 53 |
|
| 54 |
## Getting Started
|
| 55 |
|
|
@@ -61,22 +59,22 @@ pip install -r requirements.txt
|
|
| 61 |
|
| 62 |
### Training Pipeline
|
| 63 |
|
| 64 |
-
1. **
|
| 65 |
```bash
|
| 66 |
# Add training command
|
| 67 |
```
|
| 68 |
|
| 69 |
-
2. **
|
| 70 |
```bash
|
| 71 |
-
python
|
| 72 |
```
|
| 73 |
|
| 74 |
-
3. **
|
| 75 |
```bash
|
| 76 |
# Add feature extraction commands
|
| 77 |
```
|
| 78 |
|
| 79 |
-
4. **
|
| 80 |
```bash
|
| 81 |
# Add main training command
|
| 82 |
```
|
|
|
|
| 29 |
|
| 30 |
### 1. Model Training
|
| 31 |
|
| 32 |
+
#### BPE tokens to DAC codec tokens
|
| 33 |
+
- Based on the
|
| 34 |
+
- Using Auto Regressive to predict the DAC codec tokens with learnable speaker extractor
|
|
|
|
| 35 |
|
| 36 |
+
#### DAC codec tokens to DAC-VAE latent
|
| 37 |
+
- Based on Cosyvoice2 flow matching decoder
|
| 38 |
- Learns continuous latent representations from discrete tokens
|
|
|
|
| 39 |
|
| 40 |
### 2. Feature Extraction
|
| 41 |
|
| 42 |
Before training the main model:
|
| 43 |
+
1. Extract discrete tokens using the trained DAC codec [Descript Audio Codec](https://github.com/descriptinc/descript-audio-codec)
|
| 44 |
+
2. Generate continuous latent representations using the trained DAC-VAE - the pretrained I provided here: [DAC-VAE](https://drive.google.com/file/d/1iwZhPlcdDwvPjeON3bFAeYarsV4ZtI2E/view?usp=sharing)
|
| 45 |
|
| 46 |
### 3. Two-Stage Training
|
| 47 |
|
| 48 |
Train the models sequentially:
|
| 49 |
+
- **Stage 1**: BPE tokens → Discrete DAC codec
|
| 50 |
+
- **Stage 2**: Discrete DAC codec → DAC-VAE Continuous latent space
|
| 51 |
|
| 52 |
## Getting Started
|
| 53 |
|
|
|
|
| 59 |
|
| 60 |
### Training Pipeline
|
| 61 |
|
| 62 |
+
1. **Extracting DAC Codec** (if not using pretrained)
|
| 63 |
```bash
|
| 64 |
# Add training command
|
| 65 |
```
|
| 66 |
|
| 67 |
+
2. **Extracting DAC-VAE latent**
|
| 68 |
```bash
|
| 69 |
+
python inference.py
|
| 70 |
```
|
| 71 |
|
| 72 |
+
3. **Stage 1: Auto Regressive Transformer**
|
| 73 |
```bash
|
| 74 |
# Add feature extraction commands
|
| 75 |
```
|
| 76 |
|
| 77 |
+
4. **Stage 2: FLow matching decoder**
|
| 78 |
```bash
|
| 79 |
# Add main training command
|
| 80 |
```
|