File size: 6,237 Bytes
4025348
 
 
 
 
 
 
 
 
 
 
 
4a1f5f8
 
 
 
 
4025348
 
 
d9cc92f
4025348
 
 
 
 
 
 
 
 
 
d9cc92f
 
 
4025348
d9cc92f
797c99c
4025348
 
 
 
 
d9cc92f
0108756
4025348
 
 
 
d9cc92f
 
4025348
 
 
 
 
 
 
 
 
 
 
0108756
4025348
0108756
abebdb4
 
 
 
 
4025348
 
797c99c
4025348
997d9c0
 
4025348
 
797c99c
4025348
4a1f5f8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4025348
 
797c99c
4025348
4a1f5f8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4025348
 
 
 
 
 
035ed05
 
 
 
4025348
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d9cc92f
4025348
 
 
 
d9cc92f
4025348
 
 
 
 
 
 
72229da
 
 
4025348
 
20b85f9
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
# MiniMax-Speech Technical Implementation

An unofficial implementation based on the MiniMax-Speech technical report, with core components adapted from [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice).

![MiniMax-Speech Architecture](assets/image.png)

## Overview

This repository provides an implementation of the MiniMax-Speech model, featuring a two-stage training approach for high-quality 24kHz audio generation.

## Key Features

- [x] **24kHz Audio Support**: High-quality audio generation at 24kHz sampling rate
- [x] **Flow matching AE**: Flow matching training for autoencoders
- [x] **Immiscible assignment**: Support immiscible adding noise while training
- [x] **Contrastive Flow matching**: Support Contrastive training
- [ ] **Checkpoint release**: Release LLM and Contrastive FM checkpoint
## Architecture

### Stage 1: Audio to Discrete Tokens
Converts raw audio into discrete representations using the FSQ (S3Tokenizer) framework.

### Stage 2: Discrete Tokens to Continuous Latent Space
Maps discrete tokens to a continuous latent space using a Variational Autoencoder (VAE).

> **Note**: This implementation uses standard DAC-VAE instead of Flow-VAE.

## Implementation Pipeline

### 1. Model Training

#### BPE tokens to FSQ tokens
- Based on the FSQ
- Using Auto Regressive to predict the FSQ tokens with learnable speaker extractor

#### FSQ tokens to DAC-VAE latent
- Based on Cosyvoice2 flow matching decoder
- Learns continuous latent representations from discrete tokens

### 2. Feature Extraction

Before training the main model:
1. Extract discrete tokens using the trained FSQ [S3Tokenizer](https://github.com/xingchensong/S3Tokenizer)
2. Generate continuous latent representations using the trained DAC-VAE - the pretrained I provided [DAC-VAE](https://github.com/primepake/learnable-speech/releases/tag/dac-vae)

### 3. Two-Stage Training

Train the models sequentially:
- **Stage 1**: BPE tokens → Discrete FSQ 
- **Stage 2**: Discrete FSQ → DAC-VAE Continuous latent space

## Getting Started

### Prerequisites
```bash
# List your dependencies here
pip install -r requirements.txt
```

### Training Pipeline

1. **Extracting FSQ**
   ```bash
   pip install s3tokenizer
   s3tokenizer --wav_scp data.scp \
            --device "cuda" \
            --output_dir "./data" \
            --batch_size 32 \
            --model "speech_tokenizer_v2_25hz"
   ```

2. **Extracting DAC-VAE latent**
   ```bash
   cd dac-vae
   python inference.py --checkpoint checkpoint.pt --config config.yml
   ```

3. **Stage 1: Auto Regressive Transformer**
   ```bash
   #!/bin/bash
   pretrained_model_dir=./pretrained_models/CosyVoice2-0.5B

   export CUDA_VISIBLE_DEVICES="0"
   num_gpus=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')
   job_id=1986
   dist_backend="nccl"
   num_workers=2
   prefetch=100
   train_engine=torch_ddp
   model=llm

   torchrun --nnodes=1 --nproc_per_node=$num_gpus --rdzv_id=$job_id --rdzv_backend="c10d" --rdzv_endpoint="localhost:1234" \
   train.py \
   --train_engine $train_engine \
   --config config.yaml \
   --train_data data/data.list \
   --cv_data data/data.list \
   --qwen_pretrain_path $pretrained_model_dir/CosyVoice-BlankEN \
   --model $model \
   --model_dir /data/checkpoint/$model/ \
   --num_workers ${num_workers} \
   --prefetch ${prefetch} \
   --pin_memory \
   --use_amp \
   --comet_disabled

   ```

4. **Stage 2: FLow matching decoder**
   ```bash
   #!/bin/bash
   pretrained_model_dir=./pretrained_models/CosyVoice2-0.5B
   export CUDA_VISIBLE_DEVICES="0"
   num_gpus=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')
   job_id=1986
   dist_backend="nccl"
   num_workers=2
   prefetch=100
   train_engine=torch_ddp
   model=llm

   torchrun --nnodes=1 --nproc_per_node=$num_gpus --rdzv_id=$job_id --rdzv_backend="c10d" --rdzv_endpoint="localhost:1234" \
   train.py \
   --train_engine $train_engine \
   --config config.yaml \
   --train_data data/data.list \
   --cv_data data/data.list \
   --qwen_pretrain_path $pretrained_model_dir/CosyVoice-BlankEN \
   --model $model \
   --model_dir /data/checkpoint/$model/ \
   --num_workers ${num_workers} \
   --prefetch ${prefetch} \
   --pin_memory \
   --use_amp \
   --comet_disabled

   ```

## Project Structure
```
minimax-speech/
├── assets/
├── dac-vae/
├── flowae/
├── speech/    
│   ├── llm/
│   ├── flow/
└── README.md
```

## Related Projects

This implementation builds upon several key projects:

- **[CosyVoice2](https://github.com/FunAudioLLM/CosyVoice)**: Core model architectures and training pipelines
- **[Descript Audio Codec](https://github.com/descriptinc/descript-audio-codec)**: Audio tokenization framework
- **MiniMax-Speech**: Original technical report and methodology

## Citation

If you use this code in your research, please cite:

```bibtex
@article{minimax-speech,
  title={MiniMax-Speech},
  author={[MiniMax team]},
  year={[2025]}
  url={https://arxiv.org/pdf/2505.07916}
}

@misc{cosyvoice2,
  title={CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens},
  author={[FunAudioLLM Team, SpeechLab@Tongyi, Alibaba Group]},
  year={2024},
  url={https://github.com/FunAudioLLM/CosyVoice}
}
```

## License

This project follows the licensing terms of its dependencies:
- CosyVoice2 components: [Check CosyVoice2 License](https://github.com/FunAudioLLM/CosyVoice/blob/main/LICENSE)
- FSQ components: [Apache 2.0 License](https://github.com/xingchensong/S3Tokenizer/blob/main/LICENSE)

## Acknowledgments

- **[CosyVoice2](https://github.com/FunAudioLLM/CosyVoice)**: This implementation extensively uses code and architectures from CosyVoice2
- **[FSQ](https://github.com/xingchensong/S3Tokenizer)**: For the FSQ implementation
- **MiniMax team**: For the technical report and methodology
- **FunAudioLLM team**: For the excellent CosyVoice2 codebase

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## Disclaimer
The content provided above is for academic purposes only and is intended to demonstrate technical capabilities.

## Contact

[nguyennhutsam.math@gmail.com, https://www.linkedin.com/in/primepake/]