CAM++ Speaker Embedding (CoreML)

CoreML-converted CAM++ (Context-Aware Masking++) speaker embedding model for Apple Silicon.

Produces 192-dimensional speaker embeddings compatible with CosyVoice3 voice cloning.

Model Details

  • Architecture: D-TDNN (Densely-connected Time Delay Neural Network) with context-aware masking and multi-granularity pooling
  • Parameters: 6.9M
  • Input: 80-dim log-mel features, variable length
  • Output: 192-dim speaker embedding
  • Format: CoreML .mlmodelc (compiled, FP16)
  • Size: ~14 MB

Input/Output

Tensor Shape Description
mel_features [1, T, 80] 80-dim log-mel spectrogram (T = 10-3000 frames)
embedding [1, 192] L2-normalizable speaker embedding

Conversion

Converted from the official campplus.onnx shipped with Fun-CosyVoice3-0.5B-2512:

ONNX → onnx2torch (PyTorch) → torch.jit.trace → coremltools → CoreML FP16

One ONNX op patched: ReduceProd → ReduceSum in stats pooling (single-element tensor, mathematically equivalent).

Verified: CoreML vs ONNX max diff = 0.015 (FP16 precision).

Usage

Used by speech-swift for CosyVoice3 voice cloning:

// Extract 192-dim speaker embedding for CosyVoice3 voice cloning
let embedding = try camPlusPlus.embed(audio: samples, sampleRate: 16000)
let audio = model.synthesize(text: "Hello", speakerEmbedding: embedding)

Original Model

License

Apache-2.0 (same as original 3D-Speaker)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including aufklarer/CamPlusPlus-Speaker-CoreML

Paper for aufklarer/CamPlusPlus-Speaker-CoreML