| | --- |
| | license: mit |
| | tags: |
| | - audio |
| | - voice-activity-detection |
| | - coreml |
| | - silero |
| | - speech |
| | - ios |
| | - macos |
| | - swift |
| | library_name: coreml |
| | pipeline_tag: voice-activity-detection |
| | datasets: |
| | - alexwengg/musan_mini50 |
| | - alexwengg/musan_mini100 |
| | metrics: |
| | - accuracy |
| | - f1 |
| | language: |
| | - en |
| | base_model: |
| | - onnx-community/silero-vad |
| | --- |
| | |
| | # CoreML Silero VAD |
| |
|
| | A CoreML implementation of the Silero Voice Activity |
| | Detection (VAD) model, optimized for Apple platforms |
| | (iOS/macOS). This repository contains pre-converted |
| | CoreML models ready for use in Swift applications. |
| |
|
| | ## Model Description |
| |
|
| | **Developed by:** Silero Team (original), converted by |
| | FluidAudio |
| |
|
| | **Model type:** Voice Activity Detection |
| |
|
| | **License:** MIT |
| |
|
| | **Parent Model:** |
| | [silero-vad](https://github.com/snakers4/silero-vad) |
| |
|
| | ### Model Details |
| |
|
| | - **Architecture:** STFT + Encoder + RNN Decoder pipeline |
| | - **Input:** 16kHz mono audio chunks (512 samples / 32ms) |
| | - **Output:** Voice activity probability (0.0-1.0) |
| | - **Memory:** ~2MB total model size |
| |
|
| | ## Intended Use |
| |
|
| | ### Primary Use Cases |
| | - Real-time voice activity detection in iOS/macOS |
| | applications |
| | - Speech preprocessing for ASR systems |
| | - Audio segmentation and filtering |
| |
|
| | ## How to Use |
| |
|
| | ### Swift Integration |
| |
|
| | ```swift |
| | import FluidAudio |
| | |
| | let config = VADConfig( |
| | threshold: 0.3, |
| | chunkSize: 512, // 512 being the most optimal |
| | sampleRate: 16000 |
| | ) |
| | |
| | let vadManager = VADManager(config: config) |
| | try await vadManager.initialize() |
| | |
| | // Process audio chunk |
| | let result = try await |
| | vadManager.processChunk(audioChunk) |
| | print("Voice probability: \(result.probability)") |
| | print("Is voice active: \(result.isVoiceActive)") |
| | |
| | Installation |
| | |
| | Add FluidAudio to your Swift project: |
| | |
| | dependencies: [ |
| | .package(url: |
| | "https://github.com/FluidAudio/FluidAudioSwift.git", |
| | from: "1.0.0") |
| | ] |
| | |
| | Performance |
| | |
| | Benchmarks on Apple Silicon (M1/M2) |
| | |
| | | Metric | Value | |
| | |------------------|---------------------| |
| | | Latency | <2ms per 32ms chunk | |
| | | Real-time Factor | 0.02x | |
| | | Memory Usage | ~15MB | |
| | | CPU Usage | <5% (single core) | |
| | |
| | Accuracy Metrics |
| | |
| | Evaluated on common speech datasets: |
| | - Precision: 94.2% |
| | - Recall: 92.8% |
| | - F1-Score: 93.5% |
| | |
| | Model Files |
| | |
| | This repository contains three CoreML models that work |
| | together: |
| | |
| | - silero_stft.mlmodel (650KB) - STFT feature extraction |
| | - silero_encoder.mlmodel (254KB) - Feature encoding |
| | - silero_rnn_decoder.mlmodel (527KB) - RNN-based |
| | classification |
| | |
| | Training Data |
| | |
| | The original Silero VAD model was trained on a diverse |
| | dataset including: |
| | - Clean speech audio |
| | - Noisy speech with various background conditions |
| | - Music and non-speech audio for negative samples |
| | |
| | Limitations and Bias |
| | |
| | Known Limitations |
| | |
| | - Optimized for 16kHz sample rate (other rates may reduce |
| | accuracy) |
| | - May struggle with very quiet speech (<-30dB SNR) |
| | - Performance varies with microphone quality and |
| | recording conditions |
| | |
| | |
| | Technical Details |
| | |
| | Model Architecture |
| | |
| | Audio Input (512 samples, 16kHz) |
| | ↓ |
| | STFT Model (spectral features) |
| | ↓ |
| | Encoder Model (feature compression) |
| | ↓ |
| | RNN Decoder (temporal modeling) |
| | ↓ |
| | Voice Probability Output |
| | |
| | |
| | Citation |
| | |
| | @misc{silero-vad-coreml, |
| | title={CoreML Silero VAD}, |
| | author={FluidAudio Team}, |
| | year={2024}, |
| | |
| | url={https://huggingface.co/alexwengg/coreml-silero-vad} |
| | } |
| | |
| | @misc{silero-vad, |
| | title={Silero VAD}, |
| | author={Silero Team}, |
| | year={2021}, |
| | url={https://github.com/snakers4/silero-vad} |
| | } |
| | |
| | Related Models |
| | |
| | Check out other CoreML audio models in the |
| | https://huggingface.co/collections/bweng/coreml-685b12fd2 |
| | 51f80552c08e2b9: |
| | |
| | - https://huggingface.co/alexwengg/coreml_speaker_diariza |
| | tion - Identify "who spoke when" |
| | - https://huggingface.co/collections/bweng/coreml-685b12f |
| | d251f80552c08e2b9 - Speech-to-text for Apple platforms |
| | |
| | Repository and Support |
| | |
| | - GitHub: https://github.com/FluidAudio/FluidAudioSwift |
| | - Documentation: |
| | https://github.com/FluidAudio/FluidAudioSwift/wiki |
| | - Issues: |
| | https://github.com/FluidAudio/FluidAudioSwift/issues |
| | - Community: |
| | https://github.com/FluidAudio/FluidAudioSwift/discussions |
| | |
| | License |
| | |
| | This project is licensed under the MIT License - see the |
| | LICENSE file for details. |
| | |
| | The original Silero VAD model is also under MIT license. |
| | See https://github.com/snakers4/silero-vad/blob/master/LI |
| | CENSE for details. |