MarbleNet Frame-VAD — CoreML

NVIDIA MarbleNet multilingual frame-level VAD converted to Apple CoreML (mlprogram, fp32 + int8 weight-quant).

Upstream: https://huggingface.co/nvidia/Frame_VAD_Multilingual_MarbleNet_v2.0

Part of the VAD CoreML collection — CoreML conversions of community VAD models for use in iOS / macOS apps.

Files

File Format Notes
MarbleNetVAD.mlpackage CoreML mlprogram (fp32)
MarbleNetVAD_int8.mlpackage CoreML mlprogram (int8 weight-quant) smaller, ~same accuracy

I/O

Inputs:

{
  "processed_signal": [
    1,
    64,
    "T(flex 50-4096)"
  ],
  "length": [
    1
  ]
}

Outputs:

(single output: speech probability)

Preprocessing (NOT included in the CoreML graph — implement on-device): AudioToMelSpectrogramPreprocessor(n_mels=64, win=25ms, hop=10ms, per_feature norm) — applied outside the CoreML model

Numerical parity vs reference

Variant max abs diff mean abs diff Pearson
fp32 4.172e-07 1.596e-08 1.000000
int8 3.305e-02 2.567e-03 0.999893

Validated against vs NeMo torch ref.

License

This conversion is released under the same license as the upstream model (cc-by-4.0). Original credits remain with the upstream authors.

Conversion

Converted with coremltools 9.0, mlprogram format, minimum_deployment_target=iOS17.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including TigreGotico/marblenet-vad-coreml