primepake commited on
Commit
4025348
·
0 Parent(s):

frist commit

Browse files
Files changed (3) hide show
  1. .gitattributes +3 -0
  2. README.md +150 -0
  3. assets/image.png +3 -0
.gitattributes ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ *.png filter=lfs diff=lfs merge=lfs -text
2
+ *.jpg filter=lfs diff=lfs merge=lfs -text
3
+ *.jpeg filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,150 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # MiniMax-Speech Technical Implementation
2
+
3
+ An unofficial implementation based on the MiniMax-Speech technical report, with core components adapted from [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice).
4
+
5
+ ![MiniMax-Speech Architecture](assets/image.png)
6
+
7
+ ## Overview
8
+
9
+ This repository provides an implementation of the MiniMax-Speech model, featuring a two-stage training approach for high-quality 24kHz audio generation.
10
+
11
+ ## Key Features
12
+
13
+ - **24kHz Audio Support**: High-quality audio generation at 24kHz sampling rate
14
+ - **Two-Stage Architecture**: Optimized training pipeline with discrete and continuous representations
15
+ - **Modular Design**: Separate components for audio codec and variational autoencoder
16
+ - **CosyVoice2 Integration**: Leverages proven components from the CosyVoice2 framework
17
+
18
+ ## Architecture
19
+
20
+ ### Stage 1: Audio to Discrete Tokens
21
+ Converts raw audio into discrete representations using the DAC (Descript Audio Codec) framework.
22
+
23
+ ### Stage 2: Discrete Tokens to Continuous Latent Space
24
+ Maps discrete tokens to a continuous latent space using a Variational Autoencoder (VAE).
25
+
26
+ > **Note**: This implementation uses standard DAC-VAE instead of Flow-VAE.
27
+
28
+ ## Implementation Pipeline
29
+
30
+ ### 1. Model Training
31
+
32
+ #### DAC Codec
33
+ - Based on the [Descript Audio Codec](https://github.com/descriptinc/descript-audio-codec)
34
+ - Provides efficient audio tokenization
35
+ - Utilizes CosyVoice2's optimized training pipeline
36
+
37
+ #### DAC-VAE
38
+ - Train using `train_dac_vae.py`
39
+ - Learns continuous latent representations from discrete tokens
40
+ - Architecture adapted from CosyVoice2's VAE implementation
41
+
42
+ ### 2. Feature Extraction
43
+
44
+ Before training the main model:
45
+ 1. Extract discrete tokens using the trained DAC codec
46
+ 2. Generate continuous latent representations using the trained DAC-VAE
47
+
48
+ ### 3. Two-Stage Training
49
+
50
+ Train the models sequentially:
51
+ - **Stage 1**: Audio → Discrete token modeling
52
+ - **Stage 2**: Discrete token → Continuous latent space modeling
53
+
54
+ ## Getting Started
55
+
56
+ ### Prerequisites
57
+ ```bash
58
+ # List your dependencies here
59
+ pip install -r requirements.txt
60
+ ```
61
+
62
+ ### Training Pipeline
63
+
64
+ 1. **Train DAC Codec** (if not using pretrained)
65
+ ```bash
66
+ # Add training command
67
+ ```
68
+
69
+ 2. **Train DAC-VAE**
70
+ ```bash
71
+ python train_dac_vae.py --config configs/dac_vae.yaml
72
+ ```
73
+
74
+ 3. **Extract Features**
75
+ ```bash
76
+ # Add feature extraction commands
77
+ ```
78
+
79
+ 4. **Train MiniMax-Speech**
80
+ ```bash
81
+ # Add main training command
82
+ ```
83
+
84
+ ## Project Structure
85
+ ```
86
+ minimax-speech/
87
+ ├── assets/
88
+ │ └── image.png
89
+ ├── configs/
90
+ │ └── dac_vae.yaml
91
+ ├── models/
92
+ │ ├── dac_codec/
93
+ │ └── dac_vae/
94
+ ├── cosyvoice/ # Components from CosyVoice2
95
+ │ ├── flow/
96
+ │ ├── transformer/
97
+ │ └── utils/
98
+ ├── train_dac_vae.py
99
+ └── README.md
100
+ ```
101
+
102
+ ## Related Projects
103
+
104
+ This implementation builds upon several key projects:
105
+
106
+ - **[CosyVoice2](https://github.com/FunAudioLLM/CosyVoice)**: Core model architectures and training pipelines
107
+ - **[Descript Audio Codec](https://github.com/descriptinc/descript-audio-codec)**: Audio tokenization framework
108
+ - **MiniMax-Speech**: Original technical report and methodology
109
+
110
+ ## Citation
111
+
112
+ If you use this code in your research, please cite:
113
+
114
+ ```bibtex
115
+ @article{minimax-speech,
116
+ title={MiniMax-Speech},
117
+ author={[MiniMax team]},
118
+ year={[2025]}
119
+ url={https://arxiv.org/pdf/2505.07916}
120
+ }
121
+
122
+ @misc{cosyvoice2,
123
+ title={CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens},
124
+ author={[FunAudioLLM Team, SpeechLab@Tongyi, Alibaba Group]},
125
+ year={2024},
126
+ url={https://github.com/FunAudioLLM/CosyVoice}
127
+ }
128
+ ```
129
+
130
+ ## License
131
+
132
+ This project follows the licensing terms of its dependencies:
133
+ - CosyVoice2 components: [Check CosyVoice2 License](https://github.com/FunAudioLLM/CosyVoice/blob/main/LICENSE)
134
+ - DAC components: [Apache 2.0 License](https://github.com/descriptinc/descript-audio-codec/blob/main/LICENSE)
135
+ - Original contributions: [Specify your license here]
136
+
137
+ ## Acknowledgments
138
+
139
+ - **[CosyVoice2](https://github.com/FunAudioLLM/CosyVoice)**: This implementation extensively uses code and architectures from CosyVoice2
140
+ - **[Descript Audio Codec](https://github.com/descriptinc/descript-audio-codec)**: For the DAC implementation
141
+ - **MiniMax team**: For the technical report and methodology
142
+ - **FunAudioLLM team**: For the excellent CosyVoice2 codebase
143
+
144
+ ## Contributing
145
+
146
+ Contributions are welcome! Please feel free to submit a Pull Request.
147
+
148
+ ## Contact
149
+
150
+ [Your contact information or links to issues/discussions]
assets/image.png ADDED

Git LFS Details

  • SHA256: f10f503661fd5331b31f6a2450391c12df4042ae1b7333d8b4c8646852d2ebae
  • Pointer size: 130 Bytes
  • Size of remote file: 32.6 kB