File size: 5,089 Bytes
4025348
 
 
 
 
 
 
 
 
 
 
 
3595b78
24941fa
3595b78
 
 
 
 
 
4025348
 
 
 
d9cc92f
4025348
 
 
 
 
 
 
 
 
 
d9cc92f
 
 
4025348
d9cc92f
797c99c
4025348
 
 
 
 
d9cc92f
797c99c
4025348
 
 
 
d9cc92f
 
4025348
 
 
 
 
 
 
 
 
 
 
d9cc92f
4025348
 
 
 
797c99c
4025348
997d9c0
 
4025348
 
797c99c
4025348
 
 
 
797c99c
4025348
 
 
 
 
 
 
 
 
 
 
 
d9cc92f
4025348
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d9cc92f
4025348
 
 
 
 
d9cc92f
4025348
 
 
 
 
 
 
72229da
 
 
4025348
 
20b85f9
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
# MiniMax-Speech Technical Implementation

An unofficial implementation based on the MiniMax-Speech technical report, with core components adapted from [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice).

![MiniMax-Speech Architecture](assets/image.png)

## Overview

This repository provides an implementation of the MiniMax-Speech model, featuring a two-stage training approach for high-quality 24kHz audio generation.

## Key Features

- [ ] **24kHz Audio Support**: High-quality audio generation at 24kHz sampling rate
- [ ] **FSQ tokenizer training**: Training FSQ from scratch
- [ ] **Two-Stage Architecture**: Optimized training pipeline with discrete and continuous representations
- [ ] **Modular Design**: Separate components for audio codec and variational autoencoder
- [ ] **CosyVoice2 Decoder**: Leverages proven components from the CosyVoice2's Decoder framework
- [ ] **Flow matching AE**: Flow matching training for autoencoders
- [ ] **Immiscible assignment**: Support immiscible adding noise while training
- [ ] **Contrastive Flow matching**: Support Contrastive training

## Architecture

### Stage 1: Audio to Discrete Tokens
Converts raw audio into discrete representations using the FSQ (S3Tokenizer) framework.

### Stage 2: Discrete Tokens to Continuous Latent Space
Maps discrete tokens to a continuous latent space using a Variational Autoencoder (VAE).

> **Note**: This implementation uses standard DAC-VAE instead of Flow-VAE.

## Implementation Pipeline

### 1. Model Training

#### BPE tokens to FSQ tokens
- Based on the FSQ
- Using Auto Regressive to predict the FSQ tokens with learnable speaker extractor

#### FSQ tokens to DAC-VAE latent
- Based on Cosyvoice2 flow matching decoder
- Learns continuous latent representations from discrete tokens

### 2. Feature Extraction

Before training the main model:
1. Extract discrete tokens using the trained FSQ [S3Tokenizer](https://github.com/xingchensong/S3Tokenizer)
2. Generate continuous latent representations using the trained DAC-VAE - the pretrained I provided here: [DAC-VAE](https://drive.google.com/file/d/1iwZhPlcdDwvPjeON3bFAeYarsV4ZtI2E/view?usp=sharing)

### 3. Two-Stage Training

Train the models sequentially:
- **Stage 1**: BPE tokens → Discrete FSQ 
- **Stage 2**: Discrete FSQ → DAC-VAE Continuous latent space

## Getting Started

### Prerequisites
```bash
# List your dependencies here
pip install -r requirements.txt
```

### Training Pipeline

1. **Extracting FSQ** (if not using pretrained)
   ```bash
   # Add training command
   ```

2. **Extracting DAC-VAE latent**
   ```bash
   cd dac-vae
   python inference.py --checkpoint checkpoint.pt --config config.yml
   ```

3. **Stage 1: Auto Regressive Transformer**
   ```bash
   # Add feature extraction commands
   ```

4. **Stage 2: FLow matching decoder**
   ```bash
   # Add main training command
   ```

## Project Structure
```
minimax-speech/
├── assets/
│   └── image.png
├── configs/
│   └── dac_vae.yaml
├── models/
│   ├── fsq/
│   └── dac_vae/
├── cosyvoice/          # Components from CosyVoice2
│   ├── flow/
│   ├── transformer/
│   └── utils/
└── README.md
```

## Related Projects

This implementation builds upon several key projects:

- **[CosyVoice2](https://github.com/FunAudioLLM/CosyVoice)**: Core model architectures and training pipelines
- **[Descript Audio Codec](https://github.com/descriptinc/descript-audio-codec)**: Audio tokenization framework
- **MiniMax-Speech**: Original technical report and methodology

## Citation

If you use this code in your research, please cite:

```bibtex
@article{minimax-speech,
  title={MiniMax-Speech},
  author={[MiniMax team]},
  year={[2025]}
  url={https://arxiv.org/pdf/2505.07916}
}

@misc{cosyvoice2,
  title={CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens},
  author={[FunAudioLLM Team, SpeechLab@Tongyi, Alibaba Group]},
  year={2024},
  url={https://github.com/FunAudioLLM/CosyVoice}
}
```

## License

This project follows the licensing terms of its dependencies:
- CosyVoice2 components: [Check CosyVoice2 License](https://github.com/FunAudioLLM/CosyVoice/blob/main/LICENSE)
- FSQ components: [Apache 2.0 License](https://github.com/xingchensong/S3Tokenizer/blob/main/LICENSE)
- Original contributions: [Specify your license here]

## Acknowledgments

- **[CosyVoice2](https://github.com/FunAudioLLM/CosyVoice)**: This implementation extensively uses code and architectures from CosyVoice2
- **[FSQ](https://github.com/xingchensong/S3Tokenizer)**: For the FSQ implementation
- **MiniMax team**: For the technical report and methodology
- **FunAudioLLM team**: For the excellent CosyVoice2 codebase

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## Disclaimer
The content provided above is for academic purposes only and is intended to demonstrate technical capabilities.

## Contact

[nguyennhutsam.math@gmail.com, https://www.linkedin.com/in/primepake/]