CosyVoice3-0.5B-Candle

This is a Candle-compatible version of FunAudioLLM/CosyVoice3-0.5B, converted to safetensors format for use with the Candle framework in Rust.

Model Description

CosyVoice3 is a state-of-the-art text-to-speech (TTS) model developed by FunAudioLLM. It features:

  • Zero-shot voice cloning: Clone any voice with just a few seconds of reference audio
  • Streaming inference: Real-time speech synthesis with low latency
  • High quality: Natural prosody and expression
  • Multilingual: Supports Chinese and English

Architecture

The model consists of three main components:

Component Description Parameters
LLM Qwen2-based language model (0.5B) 642M
Flow Decoder DiT + Conditional Flow Matching 332M
HiFT Vocoder Neural Source Filter + iSTFT 21M

Model Files

CosyVoice3-0.5B-Candle/
β”œβ”€β”€ llm.safetensors          # LLM weights (2.4 GB)
β”œβ”€β”€ flow.safetensors         # Flow decoder weights (1.3 GB)
β”œβ”€β”€ hift.safetensors         # Vocoder weights (79 MB)
β”œβ”€β”€ campplus.onnx            # Speaker encoder (27 MB)
β”œβ”€β”€ speech_tokenizer_v3.onnx # Speech tokenizer (925 MB)
β”œβ”€β”€ config.json              # Model configuration
└── tokenizer/               # Qwen2 tokenizer files
    β”œβ”€β”€ config.json
    β”œβ”€β”€ generation_config.json
    β”œβ”€β”€ tokenizer_config.json
    β”œβ”€β”€ vocab.json
    └── merges.txt

Usage with Candle

use candle::{Device, DType};
use candle_nn::VarBuilder;

// Load model weights
let device = Device::cuda_if_available(0)?;
let dtype = DType::F32;

let llm_weights = unsafe {
    VarBuilder::from_mmaped_safetensors(&["llm.safetensors"], dtype, &device)?
};
let flow_weights = unsafe {
    VarBuilder::from_mmaped_safetensors(&["flow.safetensors"], dtype, &device)?
};
let hift_weights = unsafe {
    VarBuilder::from_mmaped_safetensors(&["hift.safetensors"], dtype, &device)?
};

// Initialize model components
// (See candle-transformers/src/models/cosyvoice for full implementation)

Conversion Details

This model was converted from the original PyTorch weights using the following process:

  1. LLM weights: Direct conversion with key renaming
  2. Flow weights: Direct conversion with DiT key mapping
  3. HiFT weights: Weight norm fusion (g * v / ||v||) + conversion

The conversion script is available at: candle-examples/examples/cosyvoice3/convert_weights.py

Technical Specifications

Parameter Value
Sample Rate 24,000 Hz
Token Frame Rate 25 fps
Mel Channels 80
DiT Depth 22 layers
DiT Dimension 1024
DiT Heads 16
Upsample Rates [8, 5, 3] (120x total)
iSTFT n_fft 16

Limitations

  • This is an inference-only conversion; training is not supported
  • ONNX models (campplus, speech_tokenizer) require candle-onnx
  • Performance may vary compared to the original PyTorch implementation

Citation

@article{cosyvoice,
  title={CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer},
  author={FunAudioLLM Team},
  year={2024}
}

License

This model is released under the Apache 2.0 License, following the original CosyVoice3 model.

Acknowledgments

Downloads last month
27
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support