FireRedVAD-CoreML / README.md
illitan's picture
Upload README.md with huggingface_hub
df8717e verified
metadata
license: apache-2.0
base_model: FireRedTeam/FireRedVAD
tags:
  - voice-activity-detection
  - vad
  - coreml
  - apple
  - ios
  - macos
  - streaming
  - real-time
  - dfsmn
  - firered
pipeline_tag: voice-activity-detection
library_name: coremltools
language:
  - multilingual

FireRedVAD-CoreML

Core ML conversion of FireRedVAD Stream-VAD for real-time voice activity detection on Apple platforms (iOS 16+ / macOS 13+). Converted from the original PyTorch model by FireRedTeam/FireRedVAD.

Model Description

  • Original model: FireRedVAD by Xiaohongshu (小红书) FireRedTeam
  • Architecture: DFSMN (Deep Feedforward Sequential Memory Network) — 8 DFSMN blocks + 1 DNN layer
  • Variant: Stream-VAD (causal, lookahead=0), suitable for real-time streaming
  • Parameters: ~568K (extremely lightweight)
  • Model size: 2.2 MB (FP32)
  • Input: 80-dim log-Mel filterbank features (16kHz, 25ms frame, 10ms shift)
  • Output: Speech probability [0, 1] per frame
  • Language support: 100+ languages, 20+ Chinese dialects

Performance

Results from the FLEURS-VAD-102 benchmark (102 languages, 9,443 audio clips):

Metric FireRedVAD Silero-VAD TEN-VAD FunASR-VAD WebRTC-VAD
AUC-ROC 99.60 97.99 97.81 - -
F1 Score 97.57 95.95 95.19 90.91 52.30
False Alarm 2.69% 9.41% 15.47% 44.03% 2.83%
Miss Rate 3.62% 3.95% 2.95% 0.42% 64.15%

Core ML Model Specification

Inputs

Name Shape Type Description
feat [1, 1..512, 80] Float32 Log-Mel filterbank features (dynamic time axis)
cache_0 ~ cache_7 [1, 128, 19] Float32 FSMN lookback cache for each of the 8 layers

Outputs

Name Type Description
probs Float32 Speech probability, shape [1, T, 1]
new_cache_0 ~ new_cache_7 Float32 Updated lookback cache
  • Minimum deployment target: iOS 16 / macOS 13
  • Compute units: CPU + Neural Engine

Conversion

Converted from PyTorch using coremltools via the export script in FireRedASR2S. The Stream-VAD variant was selected for its causal (no lookahead) property, making it suitable for real-time streaming applications.

Usage

import CoreML

// Load model
let model = try FireRedVAD(configuration: .init())

// Initialize caches (8 layers x [1, 128, 19])
var caches = (0..<8).map { _ in
    try! MLMultiArray(shape: [1, 128, 19], dataType: .float32)
}

// Process audio frame by frame
let input = FireRedVADInput(
    feat: fbankFeatures,       // [1, T, 80]
    cache_0: caches[0], cache_1: caches[1],
    cache_2: caches[2], cache_3: caches[3],
    cache_4: caches[4], cache_5: caches[5],
    cache_6: caches[6], cache_7: caches[7]
)
let output = try model.prediction(input: input)
let speechProb = output.probs  // [1, T, 1]

// Update caches for next frame
caches = [
    output.new_cache_0, output.new_cache_1,
    output.new_cache_2, output.new_cache_3,
    output.new_cache_4, output.new_cache_5,
    output.new_cache_6, output.new_cache_7
]

For a complete implementation with feature extraction, CMVN normalization, and speech state machine, see FireRedASRKit.

References

License

Apache 2.0, following the original FireRedVAD license.