---
license: other
license_name: fsmn-vad-upstream
license_link: https://github.com/modelscope/FunASR
language: [zh]
library_name: coreml
tags: [coreml, ane, voice-activity-detection, fsmn, funasr, fluidaudio]
pipeline_tag: voice-activity-detection
---

# FSMN-VAD — CoreML (Apple Neural Engine)

CoreML conversion of FunASR's **FSMN-VAD** (~5.2M params), for on-device voice
activity detection on Apple Silicon. Upstream:
[iic/speech_fsmn_vad_zh-cn-16k-common-pytorch](https://www.modelscope.cn/models/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch).

## Files

| File | Precision | Compute unit | Role |
|------|-----------|--------------|------|
| `FsmnVadPreprocessor.mlmodelc` | FP32 | CPU | waveform → 400-d features (fbank80 + LFR m=5,n=1 + CMVN) |
| `FsmnVad.mlmodelc` | FP16 | ANE | FSMN scorer → per-frame scores `[1, T, 248]` |
| `vad_config.json` | — | — | decision params (`sil_pdf_ids`, thresholds) |

## Pipeline

```
waveform → [Preprocessor fp32/CPU] → features [1,T,400]
        → [FSMN fp16/ANE] → scores [1,T,248]
        → host: silence_prob = softmax(scores)[:, sil_pdf_ids].sum()  (sil_pdf_ids=[0])
        → state machine (thresholds in vad_config) → speech segments [start_ms, end_ms]
```

- Frame rate: 10 ms (LFR n=1, no downsampling).
- The segment **decision logic** (FunASR `FsmnVADStreaming`) runs on the host:
  silence/speech hysteresis with `max_end_silence_time` (800 ms),
  `max_start_silence_time` (3000 ms), `max_single_segment_time` (60 s),
  `sil_to_speech_time_thres` (150 ms). See `vad_config.json`.

## Benchmark — fidelity vs FunASR (FLEURS zh, n=50)

| Metric | Value |
|--------|-------|
| **Frame F1** | **97.4%** (P 100.0% / R 94.8%) |
| Median RTFx | 1209x |

Parity: preprocessor matches `WavFrontendOnline` max|Δ|≈3e-5; FSMN scorer max|Δ| 0.0016. Boundaries match FunASR within ~50 ms.

## License

Weights derive from FunASR's FSMN-VAD; upstream license applies. Format conversion only.