FluidInference
/

campplus-coreml

Audio Classification

speaker-verification

speaker-diarization

Model card Files Files and versions

campplus-coreml / README.md

alexwengg's picture

docs: AISHELL EER 0.48%

daa02e7 verified 10 days ago

|

history blame contribute delete

1.62 kB

	---
	license: other
	license_name: campplus-upstream
	license_link: https://github.com/modelscope/FunASR
	language: [zh]
	library_name: coreml
	tags: [coreml, ane, speaker-verification, speaker-diarization, campplus, funasr, fluidaudio]
	pipeline_tag: audio-classification
	---

	# CAM++ — CoreML (Apple Neural Engine)

	CoreML conversion of FunASR's CAM++ speaker-embedding model (~7.2M params), for
	on-device speaker verification / diarization on Apple Silicon. Upstream:
	[iic/speech_campplus_sv_zh-cn_16k-common](https://www.modelscope.cn/models/iic/speech_campplus_sv_zh-cn_16k-common).

	## Files

	\| File \| Precision \| Compute unit \| Role \|
	\|------\|-----------\|--------------\|------\|
	\| `CamPlusPreprocessor.mlmodelc` \| FP32 \| CPU \| waveform → 80-d fbank features \|
	\| `CamPlusPlus.mlmodelc` \| FP16 \| ANE \| fbank → 192-d speaker embedding \|

	## Pipeline

	```
	waveform → [Preprocessor fp32/CPU] → fbank [1,T,80]
	→ [CAM++ fp16/ANE] → embedding [1,192] (L2-normalize, then cosine for verification/clustering)
	```

	CAM++ normalizes the fbank internally. The 192-d embedding is used with cosine
	similarity for speaker verification and diarization clustering.

	## Benchmark — AISHELL-1 speaker verification

	\| Metric \| Value \|
	\|--------\|-------\|
	\| EER \| 0.48% (20 speakers, 6000 same / 6000 diff trials) \|
	\| same-speaker cosine \| 0.805 \|
	\| different-speaker cosine \| 0.256 \|

	AISHELL-1 (clean read Mandarin) is easier than the official CN-Celeb (~6-7%). CoreML↔torch embedding cosine 0.9997-0.99999.

	## License

	Weights derive from FunASR's CAM++; upstream license applies. Format conversion only.