Update README.md
Browse files
README.md
CHANGED
|
@@ -1,188 +1,187 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
|
| 2 |
-
|
| 3 |
-
license: mit
|
| 4 |
-
tags:
|
| 5 |
-
- audio
|
| 6 |
-
- voice-activity-detection
|
| 7 |
-
- coreml
|
| 8 |
-
- silero
|
| 9 |
-
- speech
|
| 10 |
-
- ios
|
| 11 |
-
- macos
|
| 12 |
-
- swift
|
| 13 |
-
library_name: coreml
|
| 14 |
-
pipeline_tag: audio-classification
|
| 15 |
-
---
|
| 16 |
|
| 17 |
-
|
|
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
-
|
| 20 |
-
Detection (VAD) model, optimized for Apple platforms
|
| 21 |
-
(iOS/macOS). This repository contains pre-converted
|
| 22 |
-
CoreML models ready for use in Swift applications.
|
| 23 |
|
| 24 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
|
| 26 |
-
|
| 27 |
-
FluidAudio
|
| 28 |
-
**Model type:** Voice Activity Detection
|
| 29 |
-
**License:** MIT
|
| 30 |
-
**Parent Model:**
|
| 31 |
-
[silero-vad](https://github.com/snakers4/silero-vad)
|
| 32 |
|
| 33 |
-
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
-
|
| 36 |
-
- **Input:** 16kHz mono audio chunks (512 samples / 32ms)
|
| 37 |
-
- **Output:** Voice activity probability (0.0-1.0)
|
| 38 |
-
- **Memory:** ~2MB total model size
|
| 39 |
|
| 40 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 41 |
|
| 42 |
-
|
| 43 |
-
- Real-time voice activity detection in iOS/macOS
|
| 44 |
-
applications
|
| 45 |
-
- Speech preprocessing for ASR systems
|
| 46 |
-
- Audio segmentation and filtering
|
| 47 |
|
| 48 |
-
|
| 49 |
|
| 50 |
-
|
|
|
|
| 51 |
|
| 52 |
-
|
| 53 |
-
|
|
|
|
|
|
|
|
|
|
| 54 |
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
chunkSize: 512, // 512 being the most optimal
|
| 58 |
-
sampleRate: 16000
|
| 59 |
-
)
|
| 60 |
|
| 61 |
-
|
| 62 |
-
|
|
|
|
|
|
|
|
|
|
| 63 |
|
| 64 |
-
|
| 65 |
-
let result = try await
|
| 66 |
-
vadManager.processChunk(audioChunk)
|
| 67 |
-
print("Voice probability: \(result.probability)")
|
| 68 |
-
print("Is voice active: \(result.isVoiceActive)")
|
| 69 |
|
| 70 |
-
|
| 71 |
|
| 72 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 73 |
|
| 74 |
-
|
| 75 |
-
.package(url:
|
| 76 |
-
"https://github.com/FluidAudio/FluidAudioSwift.git",
|
| 77 |
-
from: "1.0.0")
|
| 78 |
-
]
|
| 79 |
|
| 80 |
-
|
| 81 |
|
| 82 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 83 |
|
| 84 |
-
|
| 85 |
-
|------------------|---------------------|
|
| 86 |
-
| Latency | <2ms per 32ms chunk |
|
| 87 |
-
| Real-time Factor | 0.02x |
|
| 88 |
-
| Memory Usage | ~15MB |
|
| 89 |
-
| CPU Usage | <5% (single core) |
|
| 90 |
|
| 91 |
-
|
|
|
|
|
|
|
|
|
|
| 92 |
|
| 93 |
-
|
| 94 |
-
- Precision: 94.2%
|
| 95 |
-
- Recall: 92.8%
|
| 96 |
-
- F1-Score: 93.5%
|
| 97 |
|
| 98 |
-
|
|
|
|
| 99 |
|
| 100 |
-
|
| 101 |
-
|
|
|
|
|
|
|
| 102 |
|
| 103 |
-
|
| 104 |
-
- silero_encoder.mlmodel (254KB) - Feature encoding
|
| 105 |
-
- silero_rnn_decoder.mlmodel (527KB) - RNN-based
|
| 106 |
-
classification
|
| 107 |
|
| 108 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 109 |
|
| 110 |
-
|
| 111 |
-
dataset including:
|
| 112 |
-
- Clean speech audio
|
| 113 |
-
- Noisy speech with various background conditions
|
| 114 |
-
- Music and non-speech audio for negative samples
|
| 115 |
|
| 116 |
-
|
| 117 |
|
| 118 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 119 |
|
| 120 |
-
- Optimized for 16kHz sample rate (other rates may reduce
|
| 121 |
-
accuracy)
|
| 122 |
-
- May struggle with very quiet speech (<-30dB SNR)
|
| 123 |
-
- Performance varies with microphone quality and
|
| 124 |
-
recording conditions
|
| 125 |
|
|
|
|
| 126 |
|
| 127 |
-
|
| 128 |
|
| 129 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 130 |
|
| 131 |
-
Audio Input (512 samples, 16kHz)
|
| 132 |
-
↓
|
| 133 |
-
STFT Model (spectral features)
|
| 134 |
-
↓
|
| 135 |
-
Encoder Model (feature compression)
|
| 136 |
-
↓
|
| 137 |
-
RNN Decoder (temporal modeling)
|
| 138 |
-
↓
|
| 139 |
-
Voice Probability Output
|
| 140 |
|
|
|
|
| 141 |
|
| 142 |
-
|
|
|
|
|
|
|
|
|
|
| 143 |
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
author={FluidAudio Team},
|
| 147 |
-
year={2024},
|
| 148 |
|
| 149 |
-
|
| 150 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
| 151 |
|
| 152 |
-
|
| 153 |
-
title={Silero VAD},
|
| 154 |
-
author={Silero Team},
|
| 155 |
-
year={2021},
|
| 156 |
-
url={https://github.com/snakers4/silero-vad}
|
| 157 |
-
}
|
| 158 |
|
| 159 |
-
|
|
|
|
|
|
|
| 160 |
|
| 161 |
-
|
| 162 |
-
|
| 163 |
-
|
|
|
|
| 164 |
|
| 165 |
-
|
| 166 |
-
tion - Identify "who spoke when"
|
| 167 |
-
- https://huggingface.co/collections/bweng/coreml-685b12f
|
| 168 |
-
d251f80552c08e2b9 - Speech-to-text for Apple platforms
|
| 169 |
|
| 170 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 171 |
|
| 172 |
-
|
| 173 |
-
- Documentation:
|
| 174 |
-
https://github.com/FluidAudio/FluidAudioSwift/wiki
|
| 175 |
-
- Issues:
|
| 176 |
-
https://github.com/FluidAudio/FluidAudioSwift/issues
|
| 177 |
-
- Community:
|
| 178 |
-
https://github.com/FluidAudio/FluidAudioSwift/discussions
|
| 179 |
|
| 180 |
-
|
|
|
|
| 181 |
|
| 182 |
-
|
| 183 |
-
|
| 184 |
-
|
| 185 |
-
The original Silero VAD model is also under MIT license.
|
| 186 |
-
See https://github.com/snakers4/silero-vad/blob/master/LI
|
| 187 |
-
CENSE for details.
|
| 188 |
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
tags:
|
| 4 |
+
- audio
|
| 5 |
+
- voice-activity-detection
|
| 6 |
+
- coreml
|
| 7 |
+
- silero
|
| 8 |
+
- speech
|
| 9 |
+
- ios
|
| 10 |
+
- macos
|
| 11 |
+
- swift
|
| 12 |
+
library_name: coreml
|
| 13 |
+
pipeline_tag: audio-classification
|
| 14 |
+
---
|
| 15 |
|
| 16 |
+
# CoreML Silero VAD
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
|
| 18 |
+
A CoreML implementation of the Silero Voice Activity
|
| 19 |
+
Detection (VAD) model, optimized for Apple platforms
|
| 20 |
+
(iOS/macOS). This repository contains pre-converted
|
| 21 |
+
CoreML models ready for use in Swift applications.
|
| 22 |
|
| 23 |
+
## Model Description
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
+
**Developed by:** Silero Team (original), converted by
|
| 26 |
+
FluidAudio
|
| 27 |
+
**Model type:** Voice Activity Detection
|
| 28 |
+
**License:** MIT
|
| 29 |
+
**Parent Model:**
|
| 30 |
+
[silero-vad](https://github.com/snakers4/silero-vad)
|
| 31 |
|
| 32 |
+
### Model Details
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
|
| 34 |
+
- **Architecture:** STFT + Encoder + RNN Decoder pipeline
|
| 35 |
+
- **Input:** 16kHz mono audio chunks (512 samples / 32ms)
|
| 36 |
+
- **Output:** Voice activity probability (0.0-1.0)
|
| 37 |
+
- **Memory:** ~2MB total model size
|
| 38 |
|
| 39 |
+
## Intended Use
|
|
|
|
|
|
|
|
|
|
| 40 |
|
| 41 |
+
### Primary Use Cases
|
| 42 |
+
- Real-time voice activity detection in iOS/macOS
|
| 43 |
+
applications
|
| 44 |
+
- Speech preprocessing for ASR systems
|
| 45 |
+
- Audio segmentation and filtering
|
| 46 |
|
| 47 |
+
## How to Use
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
|
| 49 |
+
### Swift Integration
|
| 50 |
|
| 51 |
+
```swift
|
| 52 |
+
import FluidAudio
|
| 53 |
|
| 54 |
+
let config = VADConfig(
|
| 55 |
+
threshold: 0.3,
|
| 56 |
+
chunkSize: 512, // 512 being the most optimal
|
| 57 |
+
sampleRate: 16000
|
| 58 |
+
)
|
| 59 |
|
| 60 |
+
let vadManager = VADManager(config: config)
|
| 61 |
+
try await vadManager.initialize()
|
|
|
|
|
|
|
|
|
|
| 62 |
|
| 63 |
+
// Process audio chunk
|
| 64 |
+
let result = try await
|
| 65 |
+
vadManager.processChunk(audioChunk)
|
| 66 |
+
print("Voice probability: \(result.probability)")
|
| 67 |
+
print("Is voice active: \(result.isVoiceActive)")
|
| 68 |
|
| 69 |
+
Installation
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
|
| 71 |
+
Add FluidAudio to your Swift project:
|
| 72 |
|
| 73 |
+
dependencies: [
|
| 74 |
+
.package(url:
|
| 75 |
+
"https://github.com/FluidAudio/FluidAudioSwift.git",
|
| 76 |
+
from: "1.0.0")
|
| 77 |
+
]
|
| 78 |
|
| 79 |
+
Performance
|
|
|
|
|
|
|
|
|
|
|
|
|
| 80 |
|
| 81 |
+
Benchmarks on Apple Silicon (M1/M2)
|
| 82 |
|
| 83 |
+
| Metric | Value |
|
| 84 |
+
|------------------|---------------------|
|
| 85 |
+
| Latency | <2ms per 32ms chunk |
|
| 86 |
+
| Real-time Factor | 0.02x |
|
| 87 |
+
| Memory Usage | ~15MB |
|
| 88 |
+
| CPU Usage | <5% (single core) |
|
| 89 |
|
| 90 |
+
Accuracy Metrics
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 91 |
|
| 92 |
+
Evaluated on common speech datasets:
|
| 93 |
+
- Precision: 94.2%
|
| 94 |
+
- Recall: 92.8%
|
| 95 |
+
- F1-Score: 93.5%
|
| 96 |
|
| 97 |
+
Model Files
|
|
|
|
|
|
|
|
|
|
| 98 |
|
| 99 |
+
This repository contains three CoreML models that work
|
| 100 |
+
together:
|
| 101 |
|
| 102 |
+
- silero_stft.mlmodel (650KB) - STFT feature extraction
|
| 103 |
+
- silero_encoder.mlmodel (254KB) - Feature encoding
|
| 104 |
+
- silero_rnn_decoder.mlmodel (527KB) - RNN-based
|
| 105 |
+
classification
|
| 106 |
|
| 107 |
+
Training Data
|
|
|
|
|
|
|
|
|
|
| 108 |
|
| 109 |
+
The original Silero VAD model was trained on a diverse
|
| 110 |
+
dataset including:
|
| 111 |
+
- Clean speech audio
|
| 112 |
+
- Noisy speech with various background conditions
|
| 113 |
+
- Music and non-speech audio for negative samples
|
| 114 |
|
| 115 |
+
Limitations and Bias
|
|
|
|
|
|
|
|
|
|
|
|
|
| 116 |
|
| 117 |
+
Known Limitations
|
| 118 |
|
| 119 |
+
- Optimized for 16kHz sample rate (other rates may reduce
|
| 120 |
+
accuracy)
|
| 121 |
+
- May struggle with very quiet speech (<-30dB SNR)
|
| 122 |
+
- Performance varies with microphone quality and
|
| 123 |
+
recording conditions
|
| 124 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 125 |
|
| 126 |
+
Technical Details
|
| 127 |
|
| 128 |
+
Model Architecture
|
| 129 |
|
| 130 |
+
Audio Input (512 samples, 16kHz)
|
| 131 |
+
↓
|
| 132 |
+
STFT Model (spectral features)
|
| 133 |
+
↓
|
| 134 |
+
Encoder Model (feature compression)
|
| 135 |
+
↓
|
| 136 |
+
RNN Decoder (temporal modeling)
|
| 137 |
+
↓
|
| 138 |
+
Voice Probability Output
|
| 139 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 140 |
|
| 141 |
+
Citation
|
| 142 |
|
| 143 |
+
@misc{silero-vad-coreml,
|
| 144 |
+
title={CoreML Silero VAD},
|
| 145 |
+
author={FluidAudio Team},
|
| 146 |
+
year={2024},
|
| 147 |
|
| 148 |
+
url={https://huggingface.co/alexwengg/coreml-silero-vad}
|
| 149 |
+
}
|
|
|
|
|
|
|
| 150 |
|
| 151 |
+
@misc{silero-vad,
|
| 152 |
+
title={Silero VAD},
|
| 153 |
+
author={Silero Team},
|
| 154 |
+
year={2021},
|
| 155 |
+
url={https://github.com/snakers4/silero-vad}
|
| 156 |
+
}
|
| 157 |
|
| 158 |
+
Related Models
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 159 |
|
| 160 |
+
Check out other CoreML audio models in the
|
| 161 |
+
https://huggingface.co/collections/bweng/coreml-685b12fd2
|
| 162 |
+
51f80552c08e2b9:
|
| 163 |
|
| 164 |
+
- https://huggingface.co/alexwengg/coreml_speaker_diariza
|
| 165 |
+
tion - Identify "who spoke when"
|
| 166 |
+
- https://huggingface.co/collections/bweng/coreml-685b12f
|
| 167 |
+
d251f80552c08e2b9 - Speech-to-text for Apple platforms
|
| 168 |
|
| 169 |
+
Repository and Support
|
|
|
|
|
|
|
|
|
|
| 170 |
|
| 171 |
+
- GitHub: https://github.com/FluidAudio/FluidAudioSwift
|
| 172 |
+
- Documentation:
|
| 173 |
+
https://github.com/FluidAudio/FluidAudioSwift/wiki
|
| 174 |
+
- Issues:
|
| 175 |
+
https://github.com/FluidAudio/FluidAudioSwift/issues
|
| 176 |
+
- Community:
|
| 177 |
+
https://github.com/FluidAudio/FluidAudioSwift/discussions
|
| 178 |
|
| 179 |
+
License
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 180 |
|
| 181 |
+
This project is licensed under the MIT License - see the
|
| 182 |
+
LICENSE file for details.
|
| 183 |
|
| 184 |
+
The original Silero VAD model is also under MIT license.
|
| 185 |
+
See https://github.com/snakers4/silero-vad/blob/master/LI
|
| 186 |
+
CENSE for details.
|
|
|
|
|
|
|
|
|
|
| 187 |
|