illitan commited on
Commit
df8717e
·
verified ·
1 Parent(s): 637a401

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +114 -0
README.md ADDED
@@ -0,0 +1,114 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: FireRedTeam/FireRedVAD
4
+ tags:
5
+ - voice-activity-detection
6
+ - vad
7
+ - coreml
8
+ - apple
9
+ - ios
10
+ - macos
11
+ - streaming
12
+ - real-time
13
+ - dfsmn
14
+ - firered
15
+ pipeline_tag: voice-activity-detection
16
+ library_name: coremltools
17
+ language:
18
+ - multilingual
19
+ ---
20
+
21
+ # FireRedVAD-CoreML
22
+
23
+ Core ML conversion of [FireRedVAD](https://huggingface.co/FireRedTeam/FireRedVAD) Stream-VAD for real-time voice activity detection on Apple platforms (iOS 16+ / macOS 13+). Converted from the original PyTorch model by [FireRedTeam/FireRedVAD](https://huggingface.co/FireRedTeam/FireRedVAD).
24
+
25
+ ## Model Description
26
+
27
+ - **Original model:** FireRedVAD by Xiaohongshu (小红书) FireRedTeam
28
+ - **Architecture:** DFSMN (Deep Feedforward Sequential Memory Network) — 8 DFSMN blocks + 1 DNN layer
29
+ - **Variant:** Stream-VAD (causal, lookahead=0), suitable for real-time streaming
30
+ - **Parameters:** ~568K (extremely lightweight)
31
+ - **Model size:** 2.2 MB (FP32)
32
+ - **Input:** 80-dim log-Mel filterbank features (16kHz, 25ms frame, 10ms shift)
33
+ - **Output:** Speech probability [0, 1] per frame
34
+ - **Language support:** 100+ languages, 20+ Chinese dialects
35
+
36
+ ## Performance
37
+
38
+ Results from the FLEURS-VAD-102 benchmark (102 languages, 9,443 audio clips):
39
+
40
+ | Metric | FireRedVAD | Silero-VAD | TEN-VAD | FunASR-VAD | WebRTC-VAD |
41
+ |--------|-----------|-----------|---------|-----------|-----------|
42
+ | AUC-ROC | **99.60** | 97.99 | 97.81 | - | - |
43
+ | F1 Score | **97.57** | 95.95 | 95.19 | 90.91 | 52.30 |
44
+ | False Alarm | **2.69%** | 9.41% | 15.47% | 44.03% | 2.83% |
45
+ | Miss Rate | 3.62% | 3.95% | 2.95% | 0.42% | 64.15% |
46
+
47
+ ## Core ML Model Specification
48
+
49
+ ### Inputs
50
+
51
+ | Name | Shape | Type | Description |
52
+ |------|-------|------|-------------|
53
+ | `feat` | `[1, 1..512, 80]` | Float32 | Log-Mel filterbank features (dynamic time axis) |
54
+ | `cache_0` ~ `cache_7` | `[1, 128, 19]` | Float32 | FSMN lookback cache for each of the 8 layers |
55
+
56
+ ### Outputs
57
+
58
+ | Name | Type | Description |
59
+ |------|------|-------------|
60
+ | `probs` | Float32 | Speech probability, shape `[1, T, 1]` |
61
+ | `new_cache_0` ~ `new_cache_7` | Float32 | Updated lookback cache |
62
+
63
+ - **Minimum deployment target:** iOS 16 / macOS 13
64
+ - **Compute units:** CPU + Neural Engine
65
+
66
+ ## Conversion
67
+
68
+ Converted from PyTorch using [coremltools](https://github.com/apple/coremltools) via the export script in [FireRedASR2S](https://github.com/FireRedTeam/FireRedASR2S). The Stream-VAD variant was selected for its causal (no lookahead) property, making it suitable for real-time streaming applications.
69
+
70
+ ## Usage
71
+
72
+ ```swift
73
+ import CoreML
74
+
75
+ // Load model
76
+ let model = try FireRedVAD(configuration: .init())
77
+
78
+ // Initialize caches (8 layers x [1, 128, 19])
79
+ var caches = (0..<8).map { _ in
80
+ try! MLMultiArray(shape: [1, 128, 19], dataType: .float32)
81
+ }
82
+
83
+ // Process audio frame by frame
84
+ let input = FireRedVADInput(
85
+ feat: fbankFeatures, // [1, T, 80]
86
+ cache_0: caches[0], cache_1: caches[1],
87
+ cache_2: caches[2], cache_3: caches[3],
88
+ cache_4: caches[4], cache_5: caches[5],
89
+ cache_6: caches[6], cache_7: caches[7]
90
+ )
91
+ let output = try model.prediction(input: input)
92
+ let speechProb = output.probs // [1, T, 1]
93
+
94
+ // Update caches for next frame
95
+ caches = [
96
+ output.new_cache_0, output.new_cache_1,
97
+ output.new_cache_2, output.new_cache_3,
98
+ output.new_cache_4, output.new_cache_5,
99
+ output.new_cache_6, output.new_cache_7
100
+ ]
101
+ ```
102
+
103
+ For a complete implementation with feature extraction, CMVN normalization, and speech state machine, see [FireRedASRKit](https://github.com/leaker/firered_asr).
104
+
105
+ ## References
106
+
107
+ - [FireRedVAD (Original Model)](https://huggingface.co/FireRedTeam/FireRedVAD)
108
+ - [FireRedASR2S GitHub](https://github.com/FireRedTeam/FireRedASR2S)
109
+ - [FireRedASR Paper (arXiv:2501.14350)](https://arxiv.org/abs/2501.14350)
110
+ - [DFSMN Paper (arXiv:1803.05030)](https://arxiv.org/abs/1803.05030)
111
+
112
+ ## License
113
+
114
+ Apache 2.0, following the original [FireRedVAD](https://huggingface.co/FireRedTeam/FireRedVAD) license.