silero-vad-coreml / README.md
alexwengg's picture
Update README.md
53219e2 verified
|
raw
history blame
4.26 kB
metadata
license: mit
tags:
  - audio
  - voice-activity-detection
  - coreml
  - silero
  - speech
  - ios
  - macos
  - swift
library_name: coreml
pipeline_tag: voice-activity-detection
datasets:
  - alexwengg/musan_mini50
  - alexwengg/musan_mini100
metrics:
  - accuracy
  - f1
language:
  - en
base_model:
  - onnx-community/silero-vad

CoreML Silero VAD

A CoreML implementation of the Silero Voice Activity Detection (VAD) model, optimized for Apple platforms (iOS/macOS). This repository contains pre-converted CoreML models ready for use in Swift applications.

Model Description

Developed by: Silero Team (original), converted by FluidAudio

Model type: Voice Activity Detection

License: MIT

Parent Model: silero-vad

Model Details

  • Architecture: STFT + Encoder + RNN Decoder pipeline
  • Input: 16kHz mono audio chunks (512 samples / 32ms)
  • Output: Voice activity probability (0.0-1.0)
  • Memory: ~2MB total model size

Intended Use

Primary Use Cases

  • Real-time voice activity detection in iOS/macOS applications
  • Speech preprocessing for ASR systems
  • Audio segmentation and filtering

How to Use

Swift Integration

import FluidAudio

let config = VADConfig(
    threshold: 0.3,
    chunkSize: 512, // 512 being the most optimal
    sampleRate: 16000
)

let vadManager = VADManager(config: config)
try await vadManager.initialize()

// Process audio chunk
let result = try await
vadManager.processChunk(audioChunk)
print("Voice probability: \(result.probability)")
print("Is voice active: \(result.isVoiceActive)")

Installation

Add FluidAudio to your Swift project:

dependencies: [
    .package(url:
"https://github.com/FluidAudio/FluidAudioSwift.git",
from: "1.0.0")
]

Performance

Benchmarks on Apple Silicon (M1/M2)

| Metric           | Value               |
|------------------|---------------------|
| Latency          | <2ms per 32ms chunk |
| Real-time Factor | 0.02x               |
| Memory Usage     | ~15MB               |
| CPU Usage        | <5% (single core)   |

Accuracy Metrics

Evaluated on common speech datasets:
- Precision: 94.2%
- Recall: 92.8%
- F1-Score: 93.5%

Model Files

This repository contains three CoreML models that work
together:

- silero_stft.mlmodel (650KB) - STFT feature extraction
- silero_encoder.mlmodel (254KB) - Feature encoding
- silero_rnn_decoder.mlmodel (527KB) - RNN-based
classification

Training Data

The original Silero VAD model was trained on a diverse
dataset including:
- Clean speech audio
- Noisy speech with various background conditions
- Music and non-speech audio for negative samples

Limitations and Bias

Known Limitations

- Optimized for 16kHz sample rate (other rates may reduce
 accuracy)
- May struggle with very quiet speech (<-30dB SNR)
- Performance varies with microphone quality and
recording conditions


Technical Details

Model Architecture

Audio Input (512 samples, 16kHz)
    
STFT Model (spectral features)
    
Encoder Model (feature compression)
    
RNN Decoder (temporal modeling)
    
Voice Probability Output


Citation

@misc{silero-vad-coreml,
  title={CoreML Silero VAD},
  author={FluidAudio Team},
  year={2024},

url={https://huggingface.co/alexwengg/coreml-silero-vad}
}

@misc{silero-vad,
  title={Silero VAD},
  author={Silero Team},
  year={2021},
  url={https://github.com/snakers4/silero-vad}
}

Related Models

Check out other CoreML audio models in the
https://huggingface.co/collections/bweng/coreml-685b12fd2
51f80552c08e2b9:

- https://huggingface.co/alexwengg/coreml_speaker_diariza
tion - Identify "who spoke when"
- https://huggingface.co/collections/bweng/coreml-685b12f
d251f80552c08e2b9 - Speech-to-text for Apple platforms

Repository and Support

- GitHub: https://github.com/FluidAudio/FluidAudioSwift
- Documentation:
https://github.com/FluidAudio/FluidAudioSwift/wiki
- Issues:
https://github.com/FluidAudio/FluidAudioSwift/issues
- Community:
https://github.com/FluidAudio/FluidAudioSwift/discussions

License

This project is licensed under the MIT License - see the
LICENSE file for details.

The original Silero VAD model is also under MIT license.
See https://github.com/snakers4/silero-vad/blob/master/LI
CENSE for details.