Sortformer CoreML Models

Streaming speaker diarization models converted from NVIDIA's Sortformer to CoreML for Apple Silicon.

Model Variants

Variant	File	Latency	Use Case
Default	`Sortformer.mlmodelc`	~1.04s	Low latency streaming
NVIDIA Low	`SortformerNvidiaLow.mlmodelc`	~1.04s	Low latency streaming
NVIDIA High	`SortformerNvidiaHigh.mlmodelc`	~30.4s	Best quality, offline

Configuration Parameters

Parameter	Default	NVIDIA Low	NVIDIA High
chunk_len	6	6	340
chunk_right_context	7	7	40
chunk_left_context	1	1	1
fifo_len	40	188	40
spkcache_len	188	188	188

Model Input/Output Shapes

General:

Input	Shape	Description
chunk	`[1, 8*(C+L+R), 128]`	Mel spectrogram features
chunk_lengths	`[1]`	Actual chunk length
spkcache	`[1, S, 512]`	Speaker cache embeddings
spkcache_lengths	`[1]`	Actual cache length
fifo	`[1, F, 512]`	FIFO queue embeddings
fifo_lengths	`[1]`	Actual FIFO length

Output	Shape	Description
speaker_preds	`[C+L+R+S+F, 4]`	Speaker probabilities (4 speakers)
chunk_pre_encoder_embs	`[C+L+R, 512]`	Embeddings for state update
chunk_pre_encoder_lengths	`[1]`	Actual embedding count
nest_encoder_embs	`[C+L+R+S+F, 192]`	Embeddings for speaker discrimination
nest_encoder_lengths	`[1]`	Actual speaker embedding count

Note: C = chunk_len, L = chunk_left_context, R = chunk_right_context, S = spkcache_len, F = fifo_len.

Configuration-Specific Shapes:

| Input | Default | NVIDIA Low | NVIDIA High | | chunk | [1, 112, 128] | [1, 112, 128] | [1, 3048, 128] | | chunk_lengths | [1] | [1] | [1] | | spkcache | [1, 188, 512] | [1, 188, 512] | [1, 188, 512] | | spkcache_lengths | [1] | [1] | [1] | | fifo | [1, 40, 512] | [1, 188, 512] | [1, 40, 512] | fifo_lengths | [1] | [1] | [1] |

| Output | Default | NVIDIA Low | NVIDIA High | | speaker_preds | [1, 242, 128] | [1, 390, 128] | [1, 609, 128] | | chunk_pre_encoder_embs | [1, 14, 512] | [1, 14, 512] | [1, 381, 512] | | chunk_pre_encoder_lengths | [1] | [1] | [1] | | nest_encoder_embs | [1, 242, 192] | [1, 390, 192] | [1, 609, 192] | | nest_encoder_lengths | [1] | [1] | [1] |

Usage with FluidAudio (Swift)

import FluidAudio

// Initialize with default config (auto-downloads from HuggingFace)
let diarizer = SortformerDiarizer(config: .default)
let models = try await SortformerModels.loadFromHuggingFace(config: .default)
diarizer.initialize(models: models)

// Streaming processing
for audioChunk in audioStream {
    if let result = try diarizer.processSamples(audioChunk) {
        for frame in 0..<result.frameCount {
            for speaker in 0..<4 {
                let prob = result.getSpeakerPrediction(speaker: speaker, frame: frame)
            }
        }
    }
}

// Or file processing
let timeline = try diarizer.processComplete(audioSamples)
for (speakerIndex, segments) in timeline.segments.enumerated() {
    for segment in segments {
        print("Speaker \(speakerIndex): \(segment.startTime)s - \(segment.endTime)s")
    }
}

Performance

Metric	Default	NVIDIA High
Latency	~1.12s	~30.4s
RTFx (M4 Max)	~5.7x	~125.3x

Files

Models

Sortformer.mlpackage / .mlmodelc - Default config (low latency)
SortformerNvidiaLow.mlpackage / .mlmodelc - NVIDIA low latency config
SortformerNvidiaHigh.mlpackage / .mlmodelc - NVIDIA high latency config

Scripts

convert_to_coreml.py - PyTorch to CoreML conversion
streaming_inference.py - Python streaming inference example
mic_inference.py - Real-time microphone demo

Source

Original model: https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1

Credits & Acknowledgements

This project would not have been possible without the significant technical contributions of https://huggingface.co/GradientDescent2718.

Their work was instrumental in:

Architecture Conversion: Developing the complex PyTorch-to-CoreML conversion pipeline for the 17-layer Fast-Conformer and 18-layer Transformer heads.
Build & Optimization: Engineering the static shape configurations that allow the model to achieve ~120x RTF on Apple Silicon.
Logic Implementation: Porting the critical streaming state logic (speaker cache and FIFO management) to ensure consistent speaker identity tracking.

This project was built upon the foundational work of the NVIDIA NeMo team.