CosyVoice3-0.5B-Candle

This is a Candle-compatible version of FunAudioLLM/CosyVoice3-0.5B, converted to safetensors format for use with the Candle framework in Rust.

Model Description

CosyVoice3 is a state-of-the-art text-to-speech (TTS) model developed by FunAudioLLM. It features:

Zero-shot voice cloning: Clone any voice with just a few seconds of reference audio
Streaming inference: Real-time speech synthesis with low latency
High quality: Natural prosody and expression
Multilingual: Supports Chinese and English

Architecture

The model consists of three main components:

Component	Description	Parameters
LLM	Qwen2-based language model (0.5B)	642M
Flow Decoder	DiT + Conditional Flow Matching	332M
HiFT Vocoder	Neural Source Filter + iSTFT	21M

Model Files

CosyVoice3-0.5B-Candle/
├── llm.safetensors          # LLM weights (2.4 GB)
├── flow.safetensors         # Flow decoder weights (1.3 GB)
├── hift.safetensors         # Vocoder weights (79 MB)
├── campplus.onnx            # Speaker encoder (27 MB)
├── speech_tokenizer_v3.onnx # Speech tokenizer (925 MB)
├── config.json              # Model configuration
└── tokenizer/               # Qwen2 tokenizer files
    ├── config.json
    ├── generation_config.json
    ├── tokenizer_config.json
    ├── vocab.json
    └── merges.txt

Usage with Candle

use candle::{Device, DType};
use candle_nn::VarBuilder;

// Load model weights
let device = Device::cuda_if_available(0)?;
let dtype = DType::F32;

let llm_weights = unsafe {
    VarBuilder::from_mmaped_safetensors(&["llm.safetensors"], dtype, &device)?
};
let flow_weights = unsafe {
    VarBuilder::from_mmaped_safetensors(&["flow.safetensors"], dtype, &device)?
};
let hift_weights = unsafe {
    VarBuilder::from_mmaped_safetensors(&["hift.safetensors"], dtype, &device)?
};

// Initialize model components
// (See candle-transformers/src/models/cosyvoice for full implementation)

Conversion Details

This model was converted from the original PyTorch weights using the following process:

LLM weights: Direct conversion with key renaming
Flow weights: Direct conversion with DiT key mapping
HiFT weights: Weight norm fusion (g * v / ||v||) + conversion

The conversion script is available at: candle-examples/examples/cosyvoice3/convert_weights.py

Technical Specifications

Parameter	Value
Sample Rate	24,000 Hz
Token Frame Rate	25 fps
Mel Channels	80
DiT Depth	22 layers
DiT Dimension	1024
DiT Heads	16
Upsample Rates	[8, 5, 3] (120x total)
iSTFT n_fft	16

Limitations

This is an inference-only conversion; training is not supported
ONNX models (campplus, speech_tokenizer) require candle-onnx
Performance may vary compared to the original PyTorch implementation

Citation

@article{cosyvoice,
  title={CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer},
  author={FunAudioLLM Team},
  year={2024}
}

License

This model is released under the Apache 2.0 License, following the original CosyVoice3 model.

Acknowledgments

FunAudioLLM for the original CosyVoice3 model
Hugging Face for the Candle framework

Downloads last month: 93