# Learnable-Speech Technical Implementation An unofficial implementation based on improvements of cosyvoice with learnable encoder and dac-vae, with core components adapted from [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice). ![Architecture](assets/image.png) ## Overview This repository provides an implementation of the Learnable-Speech model, featuring a two-stage training approach for high-quality 24kHz audio generation. ## Key Features - [x] **24kHz Audio Support**: High-quality audio generation at 24kHz sampling rate - [x] **Flow matching AE**: Flow matching training for autoencoders - [x] **Immiscible assignment**: Support immiscible adding noise while training - [x] **Contrastive Flow matching**: Support Contrastive training - [ ] **Checkpoint release**: Release LLM and Contrastive FM checkpoint ## Architecture ### Stage 1: Audio to Discrete Tokens Converts raw audio into discrete representations using the FSQ (S3Tokenizer) framework. ### Stage 2: Discrete Tokens to Continuous Latent Space Maps discrete tokens to a continuous latent space using a Variational Autoencoder (VAE). > **Note**: This implementation uses standard DAC-VAE instead of Flow-VAE. ## Implementation Pipeline ### 1. Model Training #### BPE tokens to FSQ tokens - Based on the FSQ - Using Auto Regressive to predict the FSQ tokens with learnable speaker extractor #### FSQ tokens to DAC-VAE latent - Based on Cosyvoice2 flow matching decoder - Learns continuous latent representations from discrete tokens ### 2. Feature Extraction Before training the main model: 1. Extract discrete tokens using the trained FSQ [S3Tokenizer](https://github.com/xingchensong/S3Tokenizer) 2. Generate continuous latent representations using the trained DAC-VAE - the pretrained I provided [DAC-VAE](https://github.com/primepake/learnable-speech/releases/tag/dac-vae) ### 3. Two-Stage Training Train the models sequentially: - **Stage 1**: BPE tokens → Discrete FSQ - **Stage 2**: Discrete FSQ → DAC-VAE Continuous latent space ## Getting Started ### Prerequisites ```bash # List your dependencies here pip install -r requirements.txt ``` ### Training Pipeline 1. **Extracting FSQ** ```bash pip install s3tokenizer s3tokenizer --wav_scp data.scp \ --device "cuda" \ --output_dir "./data" \ --batch_size 32 \ --model "speech_tokenizer_v2_25hz" ``` or you can install via this repo, it will use filelist.txt to extract, each line in filelist.txt contains file audio path - example files_test.txt ``` cd speech/tools/S3Tokenizer pip3 install . # example cmd to run torchrun --nproc_per_node=4 --nnodes=1 --rdzv_id=2024 --rdzv_backend="c10d" --rdzv_endpoint="localhost:0" `which s3tokenizer` --root_path /data/dataset/ \ --model speech_tokenizer_v2_25hz \ --device "cuda" \ --batch_size 64 \ --file_list /speech/files_test.txt \ --skip_existing ``` 2. **Extracting DAC-VAE latent** ```bash cd dac-vae python extract_dac_latents.py --checkpoint checkpoint.pt --config config.yml --root_path dataset --output_dir dataset/dac ``` After processing you should have root folder with following files: ``` dataset_root/ ├── audio_name.wav ├── audio_name.txt ├── audio_name_fsq.pt ├── audio_name_latent.pt ├── another_audio.wav ├── another_audio.txt ├── another_audio_fsq.pt ├── another_audio_latent.pt └── ... ``` 3. **Stage 1: Auto Regressive Transformer** ```bash #!/bin/bash pretrained_model_dir=./pretrained_models/CosyVoice2-0.5B export CUDA_VISIBLE_DEVICES="0" num_gpus=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}') job_id=1986 dist_backend="nccl" num_workers=2 prefetch=100 train_engine=torch_ddp model=llm torchrun --nnodes=1 --nproc_per_node=$num_gpus --rdzv_id=$job_id --rdzv_backend="c10d" --rdzv_endpoint="localhost:1234" \ train.py \ --train_engine $train_engine \ --config config.yaml \ --train_data data/data.list \ --cv_data data/data.list \ --qwen_pretrain_path $pretrained_model_dir/CosyVoice-BlankEN \ --model $model \ --model_dir /data/checkpoint/$model/ \ --num_workers ${num_workers} \ --prefetch ${prefetch} \ --pin_memory \ --use_amp \ --comet_disabled ``` 4. **Stage 2: FLow matching decoder** ```bash #!/bin/bash pretrained_model_dir=./pretrained_models/CosyVoice2-0.5B export CUDA_VISIBLE_DEVICES="0" num_gpus=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}') job_id=1986 dist_backend="nccl" num_workers=2 prefetch=100 train_engine=torch_ddp model=llm torchrun --nnodes=1 --nproc_per_node=$num_gpus --rdzv_id=$job_id --rdzv_backend="c10d" --rdzv_endpoint="localhost:1234" \ train.py \ --train_engine $train_engine \ --config config.yaml \ --train_data data/data.list \ --cv_data data/data.list \ --qwen_pretrain_path $pretrained_model_dir/CosyVoice-BlankEN \ --model $model \ --model_dir /data/checkpoint/$model/ \ --num_workers ${num_workers} \ --prefetch ${prefetch} \ --pin_memory \ --use_amp \ --comet_disabled ``` ## Project Structure ``` minimax-speech/ ├── assets/ ├── dac-vae/ ├── flowae/ ├── speech/ │ ├── llm/ │ ├── flow/ └── README.md ``` ## Related Projects This implementation builds upon several key projects: - **[CosyVoice2](https://github.com/FunAudioLLM/CosyVoice)**: Core model architectures and training pipelines - **[Descript Audio Codec](https://github.com/descriptinc/descript-audio-codec)**: Audio tokenization framework - **Learnable-Speech**: Original technical report and methodology ## Citation If you use this code in your research, please cite: ```bibtex @article{minimax-speech, title={Learnable-Speech}, author={[Learnable team]}, year={[2025]} url={https://arxiv.org/pdf/2505.07916} } @misc{cosyvoice2, title={CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens}, author={[FunAudioLLM Team, SpeechLab@Tongyi, Alibaba Group]}, year={2024}, url={https://github.com/FunAudioLLM/CosyVoice} } ``` ## License This project follows the licensing terms of its dependencies: - CosyVoice2 components: [Check CosyVoice2 License](https://github.com/FunAudioLLM/CosyVoice/blob/main/LICENSE) - FSQ components: [Apache 2.0 License](https://github.com/xingchensong/S3Tokenizer/blob/main/LICENSE) ## Acknowledgments - **[CosyVoice2](https://github.com/FunAudioLLM/CosyVoice)**: This implementation extensively uses code and architectures from CosyVoice2 - **[FSQ](https://github.com/xingchensong/S3Tokenizer)**: For the FSQ implementation - **Learnable team**: For the technical report and methodology - **FunAudioLLM team**: For the excellent CosyVoice2 codebase ## Contributing Contributions are welcome! Please feel free to submit a Pull Request. ## Disclaimer The content provided above is for academic purposes only and is intended to demonstrate technical capabilities.