metadata
license: apache-2.0
tags:
- speaker-embedding
- coreml
- apple-silicon
- neural-engine
- cam++
- campplus
language:
- zh
- en
pipeline_tag: audio-classification
CAM++ Speaker Embedding (CoreML)
CoreML-converted CAM++ (Context-Aware Masking++) speaker embedding model for Apple Silicon.
Produces 192-dimensional speaker embeddings compatible with CosyVoice3 voice cloning.
Model Details
- Architecture: D-TDNN (Densely-connected Time Delay Neural Network) with context-aware masking and multi-granularity pooling
- Parameters: 6.9M
- Input: 80-dim log-mel features, variable length
- Output: 192-dim speaker embedding
- Format: CoreML
.mlmodelc(compiled, FP16) - Size: ~14 MB
Input/Output
| Tensor | Shape | Description |
|---|---|---|
mel_features |
[1, T, 80] |
80-dim log-mel spectrogram (T = 10-3000 frames) |
embedding |
[1, 192] |
L2-normalizable speaker embedding |
Conversion
Converted from the official campplus.onnx shipped with Fun-CosyVoice3-0.5B-2512:
ONNX → onnx2torch (PyTorch) → torch.jit.trace → coremltools → CoreML FP16
One ONNX op patched: ReduceProd → ReduceSum in stats pooling (single-element tensor, mathematically equivalent).
Verified: CoreML vs ONNX max diff = 0.015 (FP16 precision).
Usage
Used by speech-swift for CosyVoice3 voice cloning:
// Extract 192-dim speaker embedding for CosyVoice3 voice cloning
let embedding = try camPlusPlus.embed(audio: samples, sampleRate: 16000)
let audio = model.synthesize(text: "Hello", speakerEmbedding: embedding)
Original Model
- Source: 3D-Speaker / CAM++ (Alibaba DAMO Academy)
- Checkpoint:
iic/speech_campplus_sv_zh-cn_16k-common(ModelScope) - ONNX:
campplus.onnxfrom FunAudioLLM/Fun-CosyVoice3-0.5B-2512
License
Apache-2.0 (same as original 3D-Speaker)
- Guide: soniqo.audio/guides/embed-speaker
- Docs: soniqo.audio
- GitHub: soniqo/speech-swift