Sortformer CoreML Models - Gradient Descent Configuration
Streaming speaker diarization models converted from NVIDIA's Sortformer to CoreML.
Configuration
Gradient Descent - Higher quality, more context:
| Parameter |
Value |
| chunk_len |
6 |
| chunk_right_context |
7 |
| chunk_left_context |
1 |
| fifo_len |
40 |
| spkcache_len |
188 |
| spkcache_update_period |
31 |
Model Input Shapes
| Model |
Input |
Shape |
| Preprocessor |
audio_signal |
[1, 18160] |
| Preprocessor |
length |
[1] |
| PreEncoder |
chunk |
[1, 112, 128] |
| PreEncoder |
chunk_lengths |
[1] |
| PreEncoder |
spkcache |
[1, 188, 512] |
| PreEncoder |
spkcache_lengths |
[1] |
| PreEncoder |
fifo |
[1, 40, 512] |
| PreEncoder |
fifo_lengths |
[1] |
| Head |
pre_encoder_embs |
[1, 242, 512] |
| Head |
pre_encoder_lengths |
[1] |
| Head |
chunk_embs_in |
[1, 14, 512] |
| Head |
chunk_lens_in |
[1] |
Model Output Shapes
| Model |
Output |
Shape |
| Preprocessor |
features |
[1, 112, 128] |
| Preprocessor |
feature_lengths |
[1] |
| PreEncoder |
pre_encoder_embs |
[1, 242, 512] |
| PreEncoder |
pre_encoder_lengths |
[1] |
| PreEncoder |
chunk_embs_in |
[1, 14, 512] |
| PreEncoder |
chunk_lens_in |
[1] |
| Head |
speaker_preds |
[1, 242, 4] |
| Head |
chunk_pre_encoder_embs |
[1, 14, 512] |
| Head |
chunk_pre_encoder_lengths |
[1] |
Files
Models
Pipeline_Preprocessor.mlpackage / .mlmodelc - Audio to mel features
Pipeline_PreEncoder.mlpackage / .mlmodelc - Mel features + state to embeddings
Pipeline_Head_Fixed.mlpackage / .mlmodelc - Embeddings to speaker predictions
Scripts
export_gradient_descent.py - Export script used to create these models
coreml_wrappers.py - PyTorch wrapper classes for export
streaming_inference.py - Python streaming inference example
mic_inference.py - Real-time microphone demo
Usage with FluidAudio (Swift)
let config = SortformerConfig.gradientDescent
let diarizer = try await SortformerDiarizer(config: config)
while let samples = getAudioChunk() {
if let result = try diarizer.processChunk(samples) {
}
}
Performance
| Metric |
Value |
| Latency |
~1.04s (7 * 80ms right context + chunk) |
| DER (AMI) |
~30.8% |
| RTFx |
~8.2x on Apple Silicon |
Source
Original model: nvidia/diar_streaming_sortformer_4spk-v2.1