Upload README.md with huggingface_hub

3abd37f verified 10 days ago

2.42 kB

license: apache-2.0
tags:
  - speaker-embedding
  - coreml
  - apple-silicon
  - neural-engine
  - cam++
  - campplus
language:
  - zh
  - en
pipeline_tag: audio-classification

CAM++ Speaker Embedding (CoreML)

CoreML-converted CAM++ (Context-Aware Masking++) speaker embedding model for Apple Silicon.

Produces 192-dimensional speaker embeddings compatible with CosyVoice3 voice cloning.

Model Details

Architecture: D-TDNN (Densely-connected Time Delay Neural Network) with context-aware masking and multi-granularity pooling
Parameters: 6.9M
Input: 80-dim log-mel features, variable length
Output: 192-dim speaker embedding
Format: CoreML .mlmodelc (compiled, FP16)
Size: ~14 MB

Input/Output

Tensor	Shape	Description
`mel_features`	`[1, T, 80]`	80-dim log-mel spectrogram (T = 10-3000 frames)
`embedding`	`[1, 192]`	L2-normalizable speaker embedding

Conversion

Converted from the official campplus.onnx shipped with Fun-CosyVoice3-0.5B-2512:

ONNX → onnx2torch (PyTorch) → torch.jit.trace → coremltools → CoreML FP16

One ONNX op patched: ReduceProd → ReduceSum in stats pooling (single-element tensor, mathematically equivalent).

Verified: CoreML vs ONNX max diff = 0.015 (FP16 precision).

Usage

Used by speech-swift for CosyVoice3 voice cloning:

// Extract 192-dim speaker embedding for CosyVoice3 voice cloning
let embedding = try camPlusPlus.embed(audio: samples, sampleRate: 16000)
let audio = model.synthesize(text: "Hello", speakerEmbedding: embedding)

Original Model

Source: 3D-Speaker / CAM++ (Alibaba DAMO Academy)
Checkpoint: iic/speech_campplus_sv_zh-cn_16k-common (ModelScope)
ONNX: campplus.onnx from FunAudioLLM/Fun-CosyVoice3-0.5B-2512

License

Apache-2.0 (same as original 3D-Speaker)

Guide: soniqo.audio/guides/embed-speaker
Docs: soniqo.audio
GitHub: soniqo/speech-swift