Spaces:
Sleeping
Sleeping
File size: 6,237 Bytes
4025348 4a1f5f8 4025348 d9cc92f 4025348 d9cc92f 4025348 d9cc92f 797c99c 4025348 d9cc92f 0108756 4025348 d9cc92f 4025348 0108756 4025348 0108756 abebdb4 4025348 797c99c 4025348 997d9c0 4025348 797c99c 4025348 4a1f5f8 4025348 797c99c 4025348 4a1f5f8 4025348 035ed05 4025348 d9cc92f 4025348 d9cc92f 4025348 72229da 4025348 20b85f9 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 |
# MiniMax-Speech Technical Implementation
An unofficial implementation based on the MiniMax-Speech technical report, with core components adapted from [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice).

## Overview
This repository provides an implementation of the MiniMax-Speech model, featuring a two-stage training approach for high-quality 24kHz audio generation.
## Key Features
- [x] **24kHz Audio Support**: High-quality audio generation at 24kHz sampling rate
- [x] **Flow matching AE**: Flow matching training for autoencoders
- [x] **Immiscible assignment**: Support immiscible adding noise while training
- [x] **Contrastive Flow matching**: Support Contrastive training
- [ ] **Checkpoint release**: Release LLM and Contrastive FM checkpoint
## Architecture
### Stage 1: Audio to Discrete Tokens
Converts raw audio into discrete representations using the FSQ (S3Tokenizer) framework.
### Stage 2: Discrete Tokens to Continuous Latent Space
Maps discrete tokens to a continuous latent space using a Variational Autoencoder (VAE).
> **Note**: This implementation uses standard DAC-VAE instead of Flow-VAE.
## Implementation Pipeline
### 1. Model Training
#### BPE tokens to FSQ tokens
- Based on the FSQ
- Using Auto Regressive to predict the FSQ tokens with learnable speaker extractor
#### FSQ tokens to DAC-VAE latent
- Based on Cosyvoice2 flow matching decoder
- Learns continuous latent representations from discrete tokens
### 2. Feature Extraction
Before training the main model:
1. Extract discrete tokens using the trained FSQ [S3Tokenizer](https://github.com/xingchensong/S3Tokenizer)
2. Generate continuous latent representations using the trained DAC-VAE - the pretrained I provided [DAC-VAE](https://github.com/primepake/learnable-speech/releases/tag/dac-vae)
### 3. Two-Stage Training
Train the models sequentially:
- **Stage 1**: BPE tokens → Discrete FSQ
- **Stage 2**: Discrete FSQ → DAC-VAE Continuous latent space
## Getting Started
### Prerequisites
```bash
# List your dependencies here
pip install -r requirements.txt
```
### Training Pipeline
1. **Extracting FSQ**
```bash
pip install s3tokenizer
s3tokenizer --wav_scp data.scp \
--device "cuda" \
--output_dir "./data" \
--batch_size 32 \
--model "speech_tokenizer_v2_25hz"
```
2. **Extracting DAC-VAE latent**
```bash
cd dac-vae
python inference.py --checkpoint checkpoint.pt --config config.yml
```
3. **Stage 1: Auto Regressive Transformer**
```bash
#!/bin/bash
pretrained_model_dir=./pretrained_models/CosyVoice2-0.5B
export CUDA_VISIBLE_DEVICES="0"
num_gpus=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')
job_id=1986
dist_backend="nccl"
num_workers=2
prefetch=100
train_engine=torch_ddp
model=llm
torchrun --nnodes=1 --nproc_per_node=$num_gpus --rdzv_id=$job_id --rdzv_backend="c10d" --rdzv_endpoint="localhost:1234" \
train.py \
--train_engine $train_engine \
--config config.yaml \
--train_data data/data.list \
--cv_data data/data.list \
--qwen_pretrain_path $pretrained_model_dir/CosyVoice-BlankEN \
--model $model \
--model_dir /data/checkpoint/$model/ \
--num_workers ${num_workers} \
--prefetch ${prefetch} \
--pin_memory \
--use_amp \
--comet_disabled
```
4. **Stage 2: FLow matching decoder**
```bash
#!/bin/bash
pretrained_model_dir=./pretrained_models/CosyVoice2-0.5B
export CUDA_VISIBLE_DEVICES="0"
num_gpus=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')
job_id=1986
dist_backend="nccl"
num_workers=2
prefetch=100
train_engine=torch_ddp
model=llm
torchrun --nnodes=1 --nproc_per_node=$num_gpus --rdzv_id=$job_id --rdzv_backend="c10d" --rdzv_endpoint="localhost:1234" \
train.py \
--train_engine $train_engine \
--config config.yaml \
--train_data data/data.list \
--cv_data data/data.list \
--qwen_pretrain_path $pretrained_model_dir/CosyVoice-BlankEN \
--model $model \
--model_dir /data/checkpoint/$model/ \
--num_workers ${num_workers} \
--prefetch ${prefetch} \
--pin_memory \
--use_amp \
--comet_disabled
```
## Project Structure
```
minimax-speech/
├── assets/
├── dac-vae/
├── flowae/
├── speech/
│ ├── llm/
│ ├── flow/
└── README.md
```
## Related Projects
This implementation builds upon several key projects:
- **[CosyVoice2](https://github.com/FunAudioLLM/CosyVoice)**: Core model architectures and training pipelines
- **[Descript Audio Codec](https://github.com/descriptinc/descript-audio-codec)**: Audio tokenization framework
- **MiniMax-Speech**: Original technical report and methodology
## Citation
If you use this code in your research, please cite:
```bibtex
@article{minimax-speech,
title={MiniMax-Speech},
author={[MiniMax team]},
year={[2025]}
url={https://arxiv.org/pdf/2505.07916}
}
@misc{cosyvoice2,
title={CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens},
author={[FunAudioLLM Team, SpeechLab@Tongyi, Alibaba Group]},
year={2024},
url={https://github.com/FunAudioLLM/CosyVoice}
}
```
## License
This project follows the licensing terms of its dependencies:
- CosyVoice2 components: [Check CosyVoice2 License](https://github.com/FunAudioLLM/CosyVoice/blob/main/LICENSE)
- FSQ components: [Apache 2.0 License](https://github.com/xingchensong/S3Tokenizer/blob/main/LICENSE)
## Acknowledgments
- **[CosyVoice2](https://github.com/FunAudioLLM/CosyVoice)**: This implementation extensively uses code and architectures from CosyVoice2
- **[FSQ](https://github.com/xingchensong/S3Tokenizer)**: For the FSQ implementation
- **MiniMax team**: For the technical report and methodology
- **FunAudioLLM team**: For the excellent CosyVoice2 codebase
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## Disclaimer
The content provided above is for academic purposes only and is intended to demonstrate technical capabilities.
## Contact
[nguyennhutsam.math@gmail.com, https://www.linkedin.com/in/primepake/] |