Spaces:
Sleeping
Sleeping
File size: 5,089 Bytes
4025348 3595b78 24941fa 3595b78 4025348 d9cc92f 4025348 d9cc92f 4025348 d9cc92f 797c99c 4025348 d9cc92f 797c99c 4025348 d9cc92f 4025348 d9cc92f 4025348 797c99c 4025348 997d9c0 4025348 797c99c 4025348 797c99c 4025348 d9cc92f 4025348 d9cc92f 4025348 d9cc92f 4025348 72229da 4025348 20b85f9 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 |
# MiniMax-Speech Technical Implementation
An unofficial implementation based on the MiniMax-Speech technical report, with core components adapted from [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice).

## Overview
This repository provides an implementation of the MiniMax-Speech model, featuring a two-stage training approach for high-quality 24kHz audio generation.
## Key Features
- [ ] **24kHz Audio Support**: High-quality audio generation at 24kHz sampling rate
- [ ] **FSQ tokenizer training**: Training FSQ from scratch
- [ ] **Two-Stage Architecture**: Optimized training pipeline with discrete and continuous representations
- [ ] **Modular Design**: Separate components for audio codec and variational autoencoder
- [ ] **CosyVoice2 Decoder**: Leverages proven components from the CosyVoice2's Decoder framework
- [ ] **Flow matching AE**: Flow matching training for autoencoders
- [ ] **Immiscible assignment**: Support immiscible adding noise while training
- [ ] **Contrastive Flow matching**: Support Contrastive training
## Architecture
### Stage 1: Audio to Discrete Tokens
Converts raw audio into discrete representations using the FSQ (S3Tokenizer) framework.
### Stage 2: Discrete Tokens to Continuous Latent Space
Maps discrete tokens to a continuous latent space using a Variational Autoencoder (VAE).
> **Note**: This implementation uses standard DAC-VAE instead of Flow-VAE.
## Implementation Pipeline
### 1. Model Training
#### BPE tokens to FSQ tokens
- Based on the FSQ
- Using Auto Regressive to predict the FSQ tokens with learnable speaker extractor
#### FSQ tokens to DAC-VAE latent
- Based on Cosyvoice2 flow matching decoder
- Learns continuous latent representations from discrete tokens
### 2. Feature Extraction
Before training the main model:
1. Extract discrete tokens using the trained FSQ [S3Tokenizer](https://github.com/xingchensong/S3Tokenizer)
2. Generate continuous latent representations using the trained DAC-VAE - the pretrained I provided here: [DAC-VAE](https://drive.google.com/file/d/1iwZhPlcdDwvPjeON3bFAeYarsV4ZtI2E/view?usp=sharing)
### 3. Two-Stage Training
Train the models sequentially:
- **Stage 1**: BPE tokens → Discrete FSQ
- **Stage 2**: Discrete FSQ → DAC-VAE Continuous latent space
## Getting Started
### Prerequisites
```bash
# List your dependencies here
pip install -r requirements.txt
```
### Training Pipeline
1. **Extracting FSQ** (if not using pretrained)
```bash
# Add training command
```
2. **Extracting DAC-VAE latent**
```bash
cd dac-vae
python inference.py --checkpoint checkpoint.pt --config config.yml
```
3. **Stage 1: Auto Regressive Transformer**
```bash
# Add feature extraction commands
```
4. **Stage 2: FLow matching decoder**
```bash
# Add main training command
```
## Project Structure
```
minimax-speech/
├── assets/
│ └── image.png
├── configs/
│ └── dac_vae.yaml
├── models/
│ ├── fsq/
│ └── dac_vae/
├── cosyvoice/ # Components from CosyVoice2
│ ├── flow/
│ ├── transformer/
│ └── utils/
└── README.md
```
## Related Projects
This implementation builds upon several key projects:
- **[CosyVoice2](https://github.com/FunAudioLLM/CosyVoice)**: Core model architectures and training pipelines
- **[Descript Audio Codec](https://github.com/descriptinc/descript-audio-codec)**: Audio tokenization framework
- **MiniMax-Speech**: Original technical report and methodology
## Citation
If you use this code in your research, please cite:
```bibtex
@article{minimax-speech,
title={MiniMax-Speech},
author={[MiniMax team]},
year={[2025]}
url={https://arxiv.org/pdf/2505.07916}
}
@misc{cosyvoice2,
title={CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens},
author={[FunAudioLLM Team, SpeechLab@Tongyi, Alibaba Group]},
year={2024},
url={https://github.com/FunAudioLLM/CosyVoice}
}
```
## License
This project follows the licensing terms of its dependencies:
- CosyVoice2 components: [Check CosyVoice2 License](https://github.com/FunAudioLLM/CosyVoice/blob/main/LICENSE)
- FSQ components: [Apache 2.0 License](https://github.com/xingchensong/S3Tokenizer/blob/main/LICENSE)
- Original contributions: [Specify your license here]
## Acknowledgments
- **[CosyVoice2](https://github.com/FunAudioLLM/CosyVoice)**: This implementation extensively uses code and architectures from CosyVoice2
- **[FSQ](https://github.com/xingchensong/S3Tokenizer)**: For the FSQ implementation
- **MiniMax team**: For the technical report and methodology
- **FunAudioLLM team**: For the excellent CosyVoice2 codebase
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## Disclaimer
The content provided above is for academic purposes only and is intended to demonstrate technical capabilities.
## Contact
[nguyennhutsam.math@gmail.com, https://www.linkedin.com/in/primepake/] |