Speaker Diarization CoreML Models

CoreML conversions of speaker diarization and speaker embedding models for on-device inference on Apple platforms.

Models

Model Original Size Description
sortformer_4spk_v21.mlpackage nvidia/diar_streaming_sortformer_4spk-v2.1 441 MB Sortformer diarization model โ€” end-to-end neural speaker diarization supporting up to 4 speakers, streaming capable
wespeaker_resnet34.mlpackage WeSpeaker ResNet34 25 MB ResNet34 speaker embedding model โ€” extracts 256-dim speaker embeddings for speaker verification and identification

Format

Both models are in Apple .mlpackage format (FP32). On first load, CoreML compiles them to .mlmodelc and caches the compiled version for subsequent fast loading.

  • Sortformer: Input mel_features (B, 128, T) โ†’ Output speaker_probs (B, T/8, 4) sigmoid probabilities per speaker per frame
  • ResNet34: Input fbank_features (1, 80, T) โ†’ Output embedding (1, 256) speaker embedding vector

Usage

These models are designed for use with the AxiiDiarization Swift library:

import AxiiDiarization

let pipeline = try DiarizationPipeline(
    sortformerModelPath: "path/to/sortformer_4spk_v21.mlpackage",
    embModelPath: "path/to/wespeaker_resnet34.mlpackage"
)

let result = try pipeline.run(samples: audioSamples)
for segment in result.segments {
    print("\(segment.speaker.label): \(segment.start)s - \(segment.end)s")
}

Licenses

The models in this repository have separate licenses from their original authors:

The CoreML conversion code and this repository are MIT licensed.

Acknowledgments

  • NVIDIA NeMo team for the Sortformer diarization model
  • WeSpeaker / WeNet team for the ResNet34 speaker embedding model
Downloads last month
6
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support