CoreML Speech Models
Collection
Speech AI models for Apple Neural Engine via CoreML. iOS/macOS ready. ASR, TTS, VAD, diarization. • 13 items • Updated
CoreML-converted CAM++ (Context-Aware Masking++) speaker embedding model for Apple Silicon.
Produces 192-dimensional speaker embeddings compatible with CosyVoice3 voice cloning.
.mlmodelc (compiled, FP16)| Tensor | Shape | Description |
|---|---|---|
mel_features |
[1, T, 80] |
80-dim log-mel spectrogram (T = 10-3000 frames) |
embedding |
[1, 192] |
L2-normalizable speaker embedding |
Converted from the official campplus.onnx shipped with Fun-CosyVoice3-0.5B-2512:
ONNX → onnx2torch (PyTorch) → torch.jit.trace → coremltools → CoreML FP16
One ONNX op patched: ReduceProd → ReduceSum in stats pooling (single-element tensor, mathematically equivalent).
Verified: CoreML vs ONNX max diff = 0.015 (FP16 precision).
Used by speech-swift for CosyVoice3 voice cloning:
// Extract 192-dim speaker embedding for CosyVoice3 voice cloning
let embedding = try camPlusPlus.embed(audio: samples, sampleRate: 16000)
let audio = model.synthesize(text: "Hello", speakerEmbedding: embedding)
iic/speech_campplus_sv_zh-cn_16k-common (ModelScope)campplus.onnx from FunAudioLLM/Fun-CosyVoice3-0.5B-2512Apache-2.0 (same as original 3D-Speaker)