illitan
/

FireRedVAD-CoreML

+---
+license: apache-2.0
+base_model: FireRedTeam/FireRedVAD
+tags:
+  - voice-activity-detection
+  - vad
+  - coreml
+  - apple
+  - ios
+  - macos
+  - streaming
+  - real-time
+  - dfsmn
+  - firered
+pipeline_tag: voice-activity-detection
+library_name: coremltools
+language:
+  - multilingual
+---
+# FireRedVAD-CoreML
+Core ML conversion of [FireRedVAD](https://huggingface.co/FireRedTeam/FireRedVAD) Stream-VAD for real-time voice activity detection on Apple platforms (iOS 16+ / macOS 13+). Converted from the original PyTorch model by [FireRedTeam/FireRedVAD](https://huggingface.co/FireRedTeam/FireRedVAD).
+## Model Description
+- **Original model:** FireRedVAD by Xiaohongshu (小红书) FireRedTeam
+- **Architecture:** DFSMN (Deep Feedforward Sequential Memory Network) — 8 DFSMN blocks + 1 DNN layer
+- **Variant:** Stream-VAD (causal, lookahead=0), suitable for real-time streaming
+- **Parameters:** ~568K (extremely lightweight)
+- **Model size:** 2.2 MB (FP32)
+- **Input:** 80-dim log-Mel filterbank features (16kHz, 25ms frame, 10ms shift)
+- **Output:** Speech probability [0, 1] per frame
+- **Language support:** 100+ languages, 20+ Chinese dialects
+## Performance
+Results from the FLEURS-VAD-102 benchmark (102 languages, 9,443 audio clips):
+| Metric | FireRedVAD | Silero-VAD | TEN-VAD | FunASR-VAD | WebRTC-VAD |
+|--------|-----------|-----------|---------|-----------|-----------|
+| AUC-ROC | **99.60** | 97.99 | 97.81 | - | - |
+| F1 Score | **97.57** | 95.95 | 95.19 | 90.91 | 52.30 |
+| False Alarm | **2.69%** | 9.41% | 15.47% | 44.03% | 2.83% |
+| Miss Rate | 3.62% | 3.95% | 2.95% | 0.42% | 64.15% |
+## Core ML Model Specification
+### Inputs
+| Name | Shape | Type | Description |
+|------|-------|------|-------------|
+| `feat` | `[1, 1..512, 80]` | Float32 | Log-Mel filterbank features (dynamic time axis) |
+| `cache_0` ~ `cache_7` | `[1, 128, 19]` | Float32 | FSMN lookback cache for each of the 8 layers |
+### Outputs
+| Name | Type | Description |
+|------|------|-------------|
+| `probs` | Float32 | Speech probability, shape `[1, T, 1]` |
+| `new_cache_0` ~ `new_cache_7` | Float32 | Updated lookback cache |
+- **Minimum deployment target:** iOS 16 / macOS 13
+- **Compute units:** CPU + Neural Engine
+## Conversion
+Converted from PyTorch using [coremltools](https://github.com/apple/coremltools) via the export script in [FireRedASR2S](https://github.com/FireRedTeam/FireRedASR2S). The Stream-VAD variant was selected for its causal (no lookahead) property, making it suitable for real-time streaming applications.
+## Usage
+```swift
+import CoreML
+// Load model
+let model = try FireRedVAD(configuration: .init())
+// Initialize caches (8 layers x [1, 128, 19])
+var caches = (0..<8).map { _ in
+    try! MLMultiArray(shape: [1, 128, 19], dataType: .float32)
+}
+// Process audio frame by frame
+let input = FireRedVADInput(
+    feat: fbankFeatures,       // [1, T, 80]
+    cache_0: caches[0], cache_1: caches[1],
+    cache_2: caches[2], cache_3: caches[3],
+    cache_4: caches[4], cache_5: caches[5],
+    cache_6: caches[6], cache_7: caches[7]
+)
+let output = try model.prediction(input: input)
+let speechProb = output.probs  // [1, T, 1]
+// Update caches for next frame
+caches = [
+    output.new_cache_0, output.new_cache_1,
+    output.new_cache_2, output.new_cache_3,
+    output.new_cache_4, output.new_cache_5,
+    output.new_cache_6, output.new_cache_7
+]
+```
+For a complete implementation with feature extraction, CMVN normalization, and speech state machine, see [FireRedASRKit](https://github.com/leaker/firered_asr).
+## References
+- [FireRedVAD (Original Model)](https://huggingface.co/FireRedTeam/FireRedVAD)
+- [FireRedASR2S GitHub](https://github.com/FireRedTeam/FireRedASR2S)
+- [FireRedASR Paper (arXiv:2501.14350)](https://arxiv.org/abs/2501.14350)
+- [DFSMN Paper (arXiv:1803.05030)](https://arxiv.org/abs/1803.05030)
+## License
+Apache 2.0, following the original [FireRedVAD](https://huggingface.co/FireRedTeam/FireRedVAD) license.