license: other
license_name: mixed
license_link: LICENSE.md
tags:
- speaker-diarization
- coreml
- apple
- macos
- ios
- sortformer
- wespeaker
- speaker-embedding
language:
- en
pipeline_tag: audio-classification
Speaker Diarization CoreML Models
CoreML conversions of speaker diarization and speaker embedding models for on-device inference on Apple platforms.
Models
| Model | Original | Size | Description |
|---|---|---|---|
sortformer_4spk_v21.mlpackage |
nvidia/diar_streaming_sortformer_4spk-v2.1 | 441 MB | Sortformer diarization model — end-to-end neural speaker diarization supporting up to 4 speakers, streaming capable |
wespeaker_resnet34.mlpackage |
WeSpeaker ResNet34 | 25 MB | ResNet34 speaker embedding model — extracts 256-dim speaker embeddings for speaker verification and identification |
Format
Both models are in Apple .mlpackage format (FP32). On first load, CoreML compiles them to .mlmodelc and caches the compiled version for subsequent fast loading.
- Sortformer: Input
mel_features (B, 128, T)→ Outputspeaker_probs (B, T/8, 4)sigmoid probabilities per speaker per frame - ResNet34: Input
fbank_features (1, 80, T)→ Outputembedding (1, 256)speaker embedding vector
Usage
These models are designed for use with the AxiiDiarization Swift library:
import AxiiDiarization
let pipeline = try DiarizationPipeline(
sortformerModelPath: "path/to/sortformer_4spk_v21.mlpackage",
embModelPath: "path/to/wespeaker_resnet34.mlpackage"
)
let result = try pipeline.run(samples: audioSamples)
for segment in result.segments {
print("\(segment.speaker.label): \(segment.start)s - \(segment.end)s")
}
Licenses
The models in this repository have separate licenses from their original authors:
Sortformer v2.1: Licensed by NVIDIA Corporation under the NVIDIA Open Model License. Commercial use permitted. See original model card: nvidia/diar_streaming_sortformer_4spk-v2.1
WeSpeaker ResNet34: Licensed under Apache License 2.0. See original project: wenet-e2e/wespeaker
The CoreML conversion code and this repository are MIT licensed.
Acknowledgments
- NVIDIA NeMo team for the Sortformer diarization model
- WeSpeaker / WeNet team for the ResNet34 speaker embedding model