| --- |
| license: other |
| license_name: fsmn-vad-upstream |
| license_link: https://github.com/modelscope/FunASR |
| language: [zh] |
| library_name: coreml |
| tags: [coreml, ane, voice-activity-detection, fsmn, funasr, fluidaudio] |
| pipeline_tag: voice-activity-detection |
| --- |
| |
| # FSMN-VAD β CoreML (Apple Neural Engine) |
|
|
| CoreML conversion of FunASR's **FSMN-VAD** (~5.2M params), for on-device voice |
| activity detection on Apple Silicon. Upstream: |
| [iic/speech_fsmn_vad_zh-cn-16k-common-pytorch](https://www.modelscope.cn/models/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch). |
|
|
| ## Files |
|
|
| | File | Precision | Compute unit | Role | |
| |------|-----------|--------------|------| |
| | `FsmnVadPreprocessor.mlmodelc` | FP32 | CPU | waveform β 400-d features (fbank80 + LFR m=5,n=1 + CMVN) | |
| | `FsmnVad.mlmodelc` | FP16 | ANE | FSMN scorer β per-frame scores `[1, T, 248]` | |
| | `vad_config.json` | β | β | decision params (`sil_pdf_ids`, thresholds) | |
|
|
| ## Pipeline |
|
|
| ``` |
| waveform β [Preprocessor fp32/CPU] β features [1,T,400] |
| β [FSMN fp16/ANE] β scores [1,T,248] |
| β host: silence_prob = softmax(scores)[:, sil_pdf_ids].sum() (sil_pdf_ids=[0]) |
| β state machine (thresholds in vad_config) β speech segments [start_ms, end_ms] |
| ``` |
|
|
| - Frame rate: 10 ms (LFR n=1, no downsampling). |
| - The segment **decision logic** (FunASR `FsmnVADStreaming`) runs on the host: |
| silence/speech hysteresis with `max_end_silence_time` (800 ms), |
| `max_start_silence_time` (3000 ms), `max_single_segment_time` (60 s), |
| `sil_to_speech_time_thres` (150 ms). See `vad_config.json`. |
|
|
| ## Benchmark β fidelity vs FunASR (FLEURS zh, n=50) |
|
|
| | Metric | Value | |
| |--------|-------| |
| | **Frame F1** | **97.4%** (P 100.0% / R 94.8%) | |
| | Median RTFx | 1209x | |
|
|
| Parity: preprocessor matches `WavFrontendOnline` max|Ξ|β3e-5; FSMN scorer max|Ξ| 0.0016. Boundaries match FunASR within ~50 ms. |
|
|
| ## License |
|
|
| Weights derive from FunASR's FSMN-VAD; upstream license applies. Format conversion only. |
|
|