learnable-speech / README.md
primepake
add fsq training
24941fa
|
raw
history blame
5.09 kB

MiniMax-Speech Technical Implementation

An unofficial implementation based on the MiniMax-Speech technical report, with core components adapted from CosyVoice2.

MiniMax-Speech Architecture

Overview

This repository provides an implementation of the MiniMax-Speech model, featuring a two-stage training approach for high-quality 24kHz audio generation.

Key Features

  • 24kHz Audio Support: High-quality audio generation at 24kHz sampling rate
  • FSQ tokenizer training: Training FSQ from scratch
  • Two-Stage Architecture: Optimized training pipeline with discrete and continuous representations
  • Modular Design: Separate components for audio codec and variational autoencoder
  • CosyVoice2 Decoder: Leverages proven components from the CosyVoice2's Decoder framework
  • Flow matching AE: Flow matching training for autoencoders
  • Immiscible assignment: Support immiscible adding noise while training
  • Contrastive Flow matching: Support Contrastive training

Architecture

Stage 1: Audio to Discrete Tokens

Converts raw audio into discrete representations using the FSQ (S3Tokenizer) framework.

Stage 2: Discrete Tokens to Continuous Latent Space

Maps discrete tokens to a continuous latent space using a Variational Autoencoder (VAE).

Note: This implementation uses standard DAC-VAE instead of Flow-VAE.

Implementation Pipeline

1. Model Training

BPE tokens to FSQ tokens

  • Based on the FSQ
  • Using Auto Regressive to predict the FSQ tokens with learnable speaker extractor

FSQ tokens to DAC-VAE latent

  • Based on Cosyvoice2 flow matching decoder
  • Learns continuous latent representations from discrete tokens

2. Feature Extraction

Before training the main model:

  1. Extract discrete tokens using the trained FSQ S3Tokenizer
  2. Generate continuous latent representations using the trained DAC-VAE - the pretrained I provided here: DAC-VAE

3. Two-Stage Training

Train the models sequentially:

  • Stage 1: BPE tokens → Discrete FSQ
  • Stage 2: Discrete FSQ → DAC-VAE Continuous latent space

Getting Started

Prerequisites

# List your dependencies here
pip install -r requirements.txt

Training Pipeline

  1. Extracting FSQ (if not using pretrained)

    # Add training command
    
  2. Extracting DAC-VAE latent

    cd dac-vae
    python inference.py --checkpoint checkpoint.pt --config config.yml
    
  3. Stage 1: Auto Regressive Transformer

    # Add feature extraction commands
    
  4. Stage 2: FLow matching decoder

    # Add main training command
    

Project Structure

minimax-speech/
├── assets/
│   └── image.png
├── configs/
│   └── dac_vae.yaml
├── models/
│   ├── fsq/
│   └── dac_vae/
├── cosyvoice/          # Components from CosyVoice2
│   ├── flow/
│   ├── transformer/
│   └── utils/
└── README.md

Related Projects

This implementation builds upon several key projects:

  • CosyVoice2: Core model architectures and training pipelines
  • Descript Audio Codec: Audio tokenization framework
  • MiniMax-Speech: Original technical report and methodology

Citation

If you use this code in your research, please cite:

@article{minimax-speech,
  title={MiniMax-Speech},
  author={[MiniMax team]},
  year={[2025]}
  url={https://arxiv.org/pdf/2505.07916}
}

@misc{cosyvoice2,
  title={CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens},
  author={[FunAudioLLM Team, SpeechLab@Tongyi, Alibaba Group]},
  year={2024},
  url={https://github.com/FunAudioLLM/CosyVoice}
}

License

This project follows the licensing terms of its dependencies:

Acknowledgments

  • CosyVoice2: This implementation extensively uses code and architectures from CosyVoice2
  • FSQ: For the FSQ implementation
  • MiniMax team: For the technical report and methodology
  • FunAudioLLM team: For the excellent CosyVoice2 codebase

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Disclaimer

The content provided above is for academic purposes only and is intended to demonstrate technical capabilities.

Contact

[nguyennhutsam.math@gmail.com, https://www.linkedin.com/in/primepake/]