File size: 1,967 Bytes
c7a87a9 f03860a c7a87a9 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | ---
license: other
license_name: fsmn-vad-upstream
license_link: https://github.com/modelscope/FunASR
language: [zh]
library_name: coreml
tags: [coreml, ane, voice-activity-detection, fsmn, funasr, fluidaudio]
pipeline_tag: voice-activity-detection
---
# FSMN-VAD β CoreML (Apple Neural Engine)
CoreML conversion of FunASR's **FSMN-VAD** (~5.2M params), for on-device voice
activity detection on Apple Silicon. Upstream:
[iic/speech_fsmn_vad_zh-cn-16k-common-pytorch](https://www.modelscope.cn/models/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch).
## Files
| File | Precision | Compute unit | Role |
|------|-----------|--------------|------|
| `FsmnVadPreprocessor.mlmodelc` | FP32 | CPU | waveform β 400-d features (fbank80 + LFR m=5,n=1 + CMVN) |
| `FsmnVad.mlmodelc` | FP16 | ANE | FSMN scorer β per-frame scores `[1, T, 248]` |
| `vad_config.json` | β | β | decision params (`sil_pdf_ids`, thresholds) |
## Pipeline
```
waveform β [Preprocessor fp32/CPU] β features [1,T,400]
β [FSMN fp16/ANE] β scores [1,T,248]
β host: silence_prob = softmax(scores)[:, sil_pdf_ids].sum() (sil_pdf_ids=[0])
β state machine (thresholds in vad_config) β speech segments [start_ms, end_ms]
```
- Frame rate: 10 ms (LFR n=1, no downsampling).
- The segment **decision logic** (FunASR `FsmnVADStreaming`) runs on the host:
silence/speech hysteresis with `max_end_silence_time` (800 ms),
`max_start_silence_time` (3000 ms), `max_single_segment_time` (60 s),
`sil_to_speech_time_thres` (150 ms). See `vad_config.json`.
## Benchmark β fidelity vs FunASR (FLEURS zh, n=50)
| Metric | Value |
|--------|-------|
| **Frame F1** | **97.4%** (P 100.0% / R 94.8%) |
| Median RTFx | 1209x |
Parity: preprocessor matches `WavFrontendOnline` max|Ξ|β3e-5; FSMN scorer max|Ξ| 0.0016. Boundaries match FunASR within ~50 ms.
## License
Weights derive from FunASR's FSMN-VAD; upstream license applies. Format conversion only.
|