--- license: other license_name: fsmn-vad-upstream license_link: https://github.com/modelscope/FunASR language: [zh] library_name: coreml tags: [coreml, ane, voice-activity-detection, fsmn, funasr, fluidaudio] pipeline_tag: voice-activity-detection --- # FSMN-VAD — CoreML (Apple Neural Engine) CoreML conversion of FunASR's **FSMN-VAD** (~5.2M params), for on-device voice activity detection on Apple Silicon. Upstream: [iic/speech_fsmn_vad_zh-cn-16k-common-pytorch](https://www.modelscope.cn/models/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch). ## Files | File | Precision | Compute unit | Role | |------|-----------|--------------|------| | `FsmnVadPreprocessor.mlmodelc` | FP32 | CPU | waveform → 400-d features (fbank80 + LFR m=5,n=1 + CMVN) | | `FsmnVad.mlmodelc` | FP16 | ANE | FSMN scorer → per-frame scores `[1, T, 248]` | | `vad_config.json` | — | — | decision params (`sil_pdf_ids`, thresholds) | ## Pipeline ``` waveform → [Preprocessor fp32/CPU] → features [1,T,400] → [FSMN fp16/ANE] → scores [1,T,248] → host: silence_prob = softmax(scores)[:, sil_pdf_ids].sum() (sil_pdf_ids=[0]) → state machine (thresholds in vad_config) → speech segments [start_ms, end_ms] ``` - Frame rate: 10 ms (LFR n=1, no downsampling). - The segment **decision logic** (FunASR `FsmnVADStreaming`) runs on the host: silence/speech hysteresis with `max_end_silence_time` (800 ms), `max_start_silence_time` (3000 ms), `max_single_segment_time` (60 s), `sil_to_speech_time_thres` (150 ms). See `vad_config.json`. ## Benchmark — fidelity vs FunASR (FLEURS zh, n=50) | Metric | Value | |--------|-------| | **Frame F1** | **97.4%** (P 100.0% / R 94.8%) | | Median RTFx | 1209x | Parity: preprocessor matches `WavFrontendOnline` max|Δ|≈3e-5; FSMN scorer max|Δ| 0.0016. Boundaries match FunASR within ~50 ms. ## License Weights derive from FunASR's FSMN-VAD; upstream license applies. Format conversion only.