| | --- |
| | license: apache-2.0 |
| | base_model: FireRedTeam/FireRedVAD |
| | tags: |
| | - voice-activity-detection |
| | - vad |
| | - coreml |
| | - apple |
| | - ios |
| | - macos |
| | - streaming |
| | - real-time |
| | - dfsmn |
| | - firered |
| | pipeline_tag: voice-activity-detection |
| | library_name: coremltools |
| | language: |
| | - multilingual |
| | --- |
| | |
| | # FireRedVAD-CoreML |
| |
|
| | Core ML conversion of [FireRedVAD](https://huggingface.co/FireRedTeam/FireRedVAD) Stream-VAD for real-time voice activity detection on Apple platforms (iOS 16+ / macOS 13+). Converted from the original PyTorch model by [FireRedTeam/FireRedVAD](https://huggingface.co/FireRedTeam/FireRedVAD). |
| |
|
| | ## Model Description |
| |
|
| | - **Original model:** FireRedVAD by Xiaohongshu (小红书) FireRedTeam |
| | - **Architecture:** DFSMN (Deep Feedforward Sequential Memory Network) — 8 DFSMN blocks + 1 DNN layer |
| | - **Variant:** Stream-VAD (causal, lookahead=0), suitable for real-time streaming |
| | - **Parameters:** ~568K (extremely lightweight) |
| | - **Model size:** 2.2 MB (FP32) |
| | - **Input:** 80-dim log-Mel filterbank features (16kHz, 25ms frame, 10ms shift) |
| | - **Output:** Speech probability [0, 1] per frame |
| | - **Language support:** 100+ languages, 20+ Chinese dialects |
| |
|
| | ## Performance |
| |
|
| | Results from the FLEURS-VAD-102 benchmark (102 languages, 9,443 audio clips): |
| |
|
| | | Metric | FireRedVAD | Silero-VAD | TEN-VAD | FunASR-VAD | WebRTC-VAD | |
| | |--------|-----------|-----------|---------|-----------|-----------| |
| | | AUC-ROC | **99.60** | 97.99 | 97.81 | - | - | |
| | | F1 Score | **97.57** | 95.95 | 95.19 | 90.91 | 52.30 | |
| | | False Alarm | **2.69%** | 9.41% | 15.47% | 44.03% | 2.83% | |
| | | Miss Rate | 3.62% | 3.95% | 2.95% | 0.42% | 64.15% | |
| |
|
| | ## Core ML Model Specification |
| |
|
| | ### Inputs |
| |
|
| | | Name | Shape | Type | Description | |
| | |------|-------|------|-------------| |
| | | `feat` | `[1, 1..512, 80]` | Float32 | Log-Mel filterbank features (dynamic time axis) | |
| | | `cache_0` ~ `cache_7` | `[1, 128, 19]` | Float32 | FSMN lookback cache for each of the 8 layers | |
| |
|
| | ### Outputs |
| |
|
| | | Name | Type | Description | |
| | |------|------|-------------| |
| | | `probs` | Float32 | Speech probability, shape `[1, T, 1]` | |
| | | `new_cache_0` ~ `new_cache_7` | Float32 | Updated lookback cache | |
| |
|
| | - **Minimum deployment target:** iOS 16 / macOS 13 |
| | - **Compute units:** CPU + Neural Engine |
| |
|
| | ## Conversion |
| |
|
| | Converted from PyTorch using [coremltools](https://github.com/apple/coremltools) via the export script in [FireRedASR2S](https://github.com/FireRedTeam/FireRedASR2S). The Stream-VAD variant was selected for its causal (no lookahead) property, making it suitable for real-time streaming applications. |
| |
|
| | ## Usage |
| |
|
| | ```swift |
| | import CoreML |
| | |
| | // Load model |
| | let model = try FireRedVAD(configuration: .init()) |
| | |
| | // Initialize caches (8 layers x [1, 128, 19]) |
| | var caches = (0..<8).map { _ in |
| | try! MLMultiArray(shape: [1, 128, 19], dataType: .float32) |
| | } |
| | |
| | // Process audio frame by frame |
| | let input = FireRedVADInput( |
| | feat: fbankFeatures, // [1, T, 80] |
| | cache_0: caches[0], cache_1: caches[1], |
| | cache_2: caches[2], cache_3: caches[3], |
| | cache_4: caches[4], cache_5: caches[5], |
| | cache_6: caches[6], cache_7: caches[7] |
| | ) |
| | let output = try model.prediction(input: input) |
| | let speechProb = output.probs // [1, T, 1] |
| | |
| | // Update caches for next frame |
| | caches = [ |
| | output.new_cache_0, output.new_cache_1, |
| | output.new_cache_2, output.new_cache_3, |
| | output.new_cache_4, output.new_cache_5, |
| | output.new_cache_6, output.new_cache_7 |
| | ] |
| | ``` |
| |
|
| | For a complete implementation with feature extraction, CMVN normalization, and speech state machine, see [FireRedASRKit](https://github.com/leaker/firered_asr). |
| |
|
| | ## References |
| |
|
| | - [FireRedVAD (Original Model)](https://huggingface.co/FireRedTeam/FireRedVAD) |
| | - [FireRedASR2S GitHub](https://github.com/FireRedTeam/FireRedASR2S) |
| | - [FireRedASR Paper (arXiv:2501.14350)](https://arxiv.org/abs/2501.14350) |
| | - [DFSMN Paper (arXiv:1803.05030)](https://arxiv.org/abs/1803.05030) |
| |
|
| | ## License |
| |
|
| | Apache 2.0, following the original [FireRedVAD](https://huggingface.co/FireRedTeam/FireRedVAD) license. |
| |
|