File size: 1,624 Bytes
83bddac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
daa02e7
 
 
 
 
 
 
 
 
83bddac
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
---
license: other
license_name: campplus-upstream
license_link: https://github.com/modelscope/FunASR
language: [zh]
library_name: coreml
tags: [coreml, ane, speaker-verification, speaker-diarization, campplus, funasr, fluidaudio]
pipeline_tag: audio-classification
---

# CAM++ — CoreML (Apple Neural Engine)

CoreML conversion of FunASR's **CAM++** speaker-embedding model (~7.2M params), for
on-device speaker verification / diarization on Apple Silicon. Upstream:
[iic/speech_campplus_sv_zh-cn_16k-common](https://www.modelscope.cn/models/iic/speech_campplus_sv_zh-cn_16k-common).

## Files

| File | Precision | Compute unit | Role |
|------|-----------|--------------|------|
| `CamPlusPreprocessor.mlmodelc` | FP32 | CPU | waveform → 80-d fbank features |
| `CamPlusPlus.mlmodelc` | FP16 | ANE | fbank → 192-d speaker embedding |

## Pipeline

```
waveform → [Preprocessor fp32/CPU] → fbank [1,T,80]
        → [CAM++ fp16/ANE] → embedding [1,192]  (L2-normalize, then cosine for verification/clustering)
```

CAM++ normalizes the fbank internally. The 192-d embedding is used with cosine
similarity for speaker verification and diarization clustering.

## Benchmark — AISHELL-1 speaker verification

| Metric | Value |
|--------|-------|
| **EER** | **0.48%** (20 speakers, 6000 same / 6000 diff trials) |
| same-speaker cosine | 0.805 |
| different-speaker cosine | 0.256 |

AISHELL-1 (clean read Mandarin) is easier than the official CN-Celeb (~6-7%). CoreML↔torch embedding cosine 0.9997-0.99999.

## License

Weights derive from FunASR's CAM++; upstream license applies. Format conversion only.