Spaces:

mnhatdaous
/

learnable-speech

Sleeping

App Files Files Community

learnable-speech / README.md

nhat-dao-tpv-clv

update readme

0216954 4 months ago

preview code

raw

history blame

7.18 kB

	# Learnable-Speech Technical Implementation

	An unofficial implementation based on improvements of cosyvoice with learnable encoder and dac-vae, with core components adapted from [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice).

	![Architecture](assets/image.png)

	## Overview

	This repository provides an implementation of the Learnable-Speech model, featuring a two-stage training approach for high-quality 24kHz audio generation.

	## Key Features

	- [x] 24kHz Audio Support: High-quality audio generation at 24kHz sampling rate
	- [x] Flow matching AE: Flow matching training for autoencoders
	- [x] Immiscible assignment: Support immiscible adding noise while training
	- [x] Contrastive Flow matching: Support Contrastive training
	- [ ] Checkpoint release: Release LLM and Contrastive FM checkpoint

	## Architecture

	### Stage 1: Audio to Discrete Tokens

	Converts raw audio into discrete representations using the FSQ (S3Tokenizer) framework.

	### Stage 2: Discrete Tokens to Continuous Latent Space

	Maps discrete tokens to a continuous latent space using a Variational Autoencoder (VAE).

	> Note: This implementation uses standard DAC-VAE instead of Flow-VAE.

	## Implementation Pipeline

	### 1. Model Training

	#### BPE tokens to FSQ tokens

	- Based on the FSQ
	- Using Auto Regressive to predict the FSQ tokens with learnable speaker extractor

	#### FSQ tokens to DAC-VAE latent

	- Based on Cosyvoice2 flow matching decoder
	- Learns continuous latent representations from discrete tokens

	### 2. Feature Extraction

	Before training the main model:

	1. Extract discrete tokens using the trained FSQ [S3Tokenizer](https://github.com/xingchensong/S3Tokenizer)
	2. Generate continuous latent representations using the trained DAC-VAE - the pretrained I provided [DAC-VAE](https://github.com/primepake/learnable-speech/releases/tag/dac-vae)

	### 3. Two-Stage Training

	Train the models sequentially:

	- Stage 1: BPE tokens → Discrete FSQ
	- Stage 2: Discrete FSQ → DAC-VAE Continuous latent space

	## Getting Started

	### Prerequisites

	```bash
	# List your dependencies here
	pip install -r requirements.txt
	```

	### Training Pipeline

	1. Extracting FSQ

	```bash
	pip install s3tokenizer
	s3tokenizer --wav_scp data.scp \
	--device "cuda" \
	--output_dir "./data" \
	--batch_size 32 \
	--model "speech_tokenizer_v2_25hz"
	```

	or you can install via this repo, it will use filelist.txt to extract, each line in filelist.txt contains file audio path - example files_test.txt

	```
	cd speech/tools/S3Tokenizer
	pip3 install .
	# example cmd to run
	torchrun --nproc_per_node=4 --nnodes=1 --rdzv_id=2024 --rdzv_backend="c10d" --rdzv_endpoint="localhost:0" `which s3tokenizer` --root_path /data/dataset/ \
	--model speech_tokenizer_v2_25hz \
	--device "cuda" \
	--batch_size 64 \
	--file_list /speech/files_test.txt \
	--skip_existing
	```

	2. Extracting DAC-VAE latent

	```bash
	cd dac-vae
	python extract_dac_latents.py --checkpoint checkpoint.pt --config config.yml --root_path dataset --output_dir dataset/dac
	```

	After processing you should have root folder with following files:

	```
	dataset_root/
	├── audio_name.wav
	├── audio_name.txt
	├── audio_name_fsq.pt
	├── audio_name_latent.pt
	├── another_audio.wav
	├── another_audio.txt
	├── another_audio_fsq.pt
	├── another_audio_latent.pt
	└── ...
	```

	3. Stage 1: Auto Regressive Transformer

	```bash
	#!/bin/bash
	pretrained_model_dir=./pretrained_models/CosyVoice2-0.5B

	export CUDA_VISIBLE_DEVICES="0"
	num_gpus=$(echo $CUDA_VISIBLE_DEVICES \| awk -F "," '{print NF}')
	job_id=1986
	dist_backend="nccl"
	num_workers=2
	prefetch=100
	train_engine=torch_ddp
	model=llm

	torchrun --nnodes=1 --nproc_per_node=$num_gpus --rdzv_id=$job_id --rdzv_backend="c10d" --rdzv_endpoint="localhost:1234" \
	train.py \
	--train_engine $train_engine \
	--config config.yaml \
	--train_data data/data.list \
	--cv_data data/data.list \
	--qwen_pretrain_path $pretrained_model_dir/CosyVoice-BlankEN \
	--model $model \
	--model_dir /data/checkpoint/$model/ \
	--num_workers ${num_workers} \
	--prefetch ${prefetch} \
	--pin_memory \
	--use_amp \
	--comet_disabled

	```

	4. Stage 2: FLow matching decoder

	```bash
	#!/bin/bash
	pretrained_model_dir=./pretrained_models/CosyVoice2-0.5B
	export CUDA_VISIBLE_DEVICES="0"
	num_gpus=$(echo $CUDA_VISIBLE_DEVICES \| awk -F "," '{print NF}')
	job_id=1986
	dist_backend="nccl"
	num_workers=2
	prefetch=100
	train_engine=torch_ddp
	model=llm

	torchrun --nnodes=1 --nproc_per_node=$num_gpus --rdzv_id=$job_id --rdzv_backend="c10d" --rdzv_endpoint="localhost:1234" \
	train.py \
	--train_engine $train_engine \
	--config config.yaml \
	--train_data data/data.list \
	--cv_data data/data.list \
	--qwen_pretrain_path $pretrained_model_dir/CosyVoice-BlankEN \
	--model $model \
	--model_dir /data/checkpoint/$model/ \
	--num_workers ${num_workers} \
	--prefetch ${prefetch} \
	--pin_memory \
	--use_amp \
	--comet_disabled

	```

	## Project Structure

	```
	minimax-speech/
	├── assets/
	├── dac-vae/
	├── flowae/
	├── speech/
	│ ├── llm/
	│ ├── flow/
	└── README.md
	```

	## Related Projects

	This implementation builds upon several key projects:

	- [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice): Core model architectures and training pipelines
	- [Descript Audio Codec](https://github.com/descriptinc/descript-audio-codec): Audio tokenization framework
	- Learnable-Speech: Original technical report and methodology

	## Citation

	If you use this code in your research, please cite:

	```bibtex
	@article{minimax-speech,
	title={Learnable-Speech},
	author={[Learnable team]},
	year={[2025]}
	url={https://arxiv.org/pdf/2505.07916}
	}

	@misc{cosyvoice2,
	title={CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens},
	author={[FunAudioLLM Team, SpeechLab@Tongyi, Alibaba Group]},
	year={2024},
	url={https://github.com/FunAudioLLM/CosyVoice}
	}
	```

	## License

	This project follows the licensing terms of its dependencies:

	- CosyVoice2 components: [Check CosyVoice2 License](https://github.com/FunAudioLLM/CosyVoice/blob/main/LICENSE)
	- FSQ components: [Apache 2.0 License](https://github.com/xingchensong/S3Tokenizer/blob/main/LICENSE)

	## Acknowledgments

	- [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice): This implementation extensively uses code and architectures from CosyVoice2
	- [FSQ](https://github.com/xingchensong/S3Tokenizer): For the FSQ implementation
	- Learnable team: For the technical report and methodology
	- FunAudioLLM team: For the excellent CosyVoice2 codebase

	## Contributing

	Contributions are welcome! Please feel free to submit a Pull Request.

	## Disclaimer

	The content provided above is for academic purposes only and is intended to demonstrate technical capabilities.