CosyVoice3-0.5B-Candle
This is a Candle-compatible version of FunAudioLLM/CosyVoice3-0.5B, converted to safetensors format for use with the Candle framework in Rust.
Model Description
CosyVoice3 is a state-of-the-art text-to-speech (TTS) model developed by FunAudioLLM. It features:
- Zero-shot voice cloning: Clone any voice with just a few seconds of reference audio
- Streaming inference: Real-time speech synthesis with low latency
- High quality: Natural prosody and expression
- Multilingual: Supports Chinese and English
Architecture
The model consists of three main components:
| Component | Description | Parameters |
|---|---|---|
| LLM | Qwen2-based language model (0.5B) | 642M |
| Flow Decoder | DiT + Conditional Flow Matching | 332M |
| HiFT Vocoder | Neural Source Filter + iSTFT | 21M |
Model Files
CosyVoice3-0.5B-Candle/
βββ llm.safetensors # LLM weights (2.4 GB)
βββ flow.safetensors # Flow decoder weights (1.3 GB)
βββ hift.safetensors # Vocoder weights (79 MB)
βββ campplus.onnx # Speaker encoder (27 MB)
βββ speech_tokenizer_v3.onnx # Speech tokenizer (925 MB)
βββ config.json # Model configuration
βββ tokenizer/ # Qwen2 tokenizer files
βββ config.json
βββ generation_config.json
βββ tokenizer_config.json
βββ vocab.json
βββ merges.txt
Usage with Candle
use candle::{Device, DType};
use candle_nn::VarBuilder;
// Load model weights
let device = Device::cuda_if_available(0)?;
let dtype = DType::F32;
let llm_weights = unsafe {
VarBuilder::from_mmaped_safetensors(&["llm.safetensors"], dtype, &device)?
};
let flow_weights = unsafe {
VarBuilder::from_mmaped_safetensors(&["flow.safetensors"], dtype, &device)?
};
let hift_weights = unsafe {
VarBuilder::from_mmaped_safetensors(&["hift.safetensors"], dtype, &device)?
};
// Initialize model components
// (See candle-transformers/src/models/cosyvoice for full implementation)
Conversion Details
This model was converted from the original PyTorch weights using the following process:
- LLM weights: Direct conversion with key renaming
- Flow weights: Direct conversion with DiT key mapping
- HiFT weights: Weight norm fusion (
g * v / ||v||) + conversion
The conversion script is available at: candle-examples/examples/cosyvoice3/convert_weights.py
Technical Specifications
| Parameter | Value |
|---|---|
| Sample Rate | 24,000 Hz |
| Token Frame Rate | 25 fps |
| Mel Channels | 80 |
| DiT Depth | 22 layers |
| DiT Dimension | 1024 |
| DiT Heads | 16 |
| Upsample Rates | [8, 5, 3] (120x total) |
| iSTFT n_fft | 16 |
Limitations
- This is an inference-only conversion; training is not supported
- ONNX models (campplus, speech_tokenizer) require candle-onnx
- Performance may vary compared to the original PyTorch implementation
Citation
@article{cosyvoice,
title={CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer},
author={FunAudioLLM Team},
year={2024}
}
License
This model is released under the Apache 2.0 License, following the original CosyVoice3 model.
Acknowledgments
- FunAudioLLM for the original CosyVoice3 model
- Hugging Face for the Candle framework
- Downloads last month
- 27