Sortformer Diarization (CoreML)

CoreML port of NVIDIA Sortformer for end-to-end streaming speaker diarization on Apple Silicon.

Runs on the Neural Engine via CoreML. No separate embedding extraction or clustering — the model directly predicts per-frame speaker activity for up to 4 speakers.

Model Details

  • Architecture: Sortformer (Sort Loss + 17-layer FastConformer encoder + 18-layer Transformer)
  • Base model: nvidia/diar_streaming_sortformer_4spk-v2.1 (117M params)
  • Task: Speaker diarization (up to 4 speakers)
  • Input: 128-dim log-mel features, streamed in chunks
  • Output: Per-frame speaker activity probabilities (sigmoid)
  • Format: CoreML .mlmodelc (compiled pipeline, 2 sub-models)
  • Size: ~230 MB

Streaming Configuration

Parameter Value
Sample rate 16 kHz
Mel bins 128
n_fft 400
Hop length 160
Chunk length 6s
Left context 1 chunk
Right context 7 chunks
Subsampling factor 8
Speaker cache length 188 frames
FIFO length 40 frames
Max speakers 4

Input/Output Shapes

Inputs:

Tensor Shape Description
chunk [1, 112, 128] Mel features for current chunk
chunk_lengths [1] Valid frames in chunk
spkcache [1, 188, 512] Speaker cache state
spkcache_lengths [1] Valid entries in speaker cache
fifo [1, 40, 512] FIFO buffer state
fifo_lengths [1] Valid entries in FIFO

Outputs:

Tensor Shape Description
speaker_preds_out [1, 242, 4] Speaker activity probabilities
chunk_pre_encoder_embs_out [1, 14, 512] Chunk embeddings for state update
chunk_pre_encoder_lengths_out [1] Valid embedding frames

Usage

Used by speech-swift for speaker diarization:

audio diarize meeting.wav --engine sortformer
let diarizer = try await SortformerDiarizer.fromPretrained()
let result = diarizer.diarize(audio: samples, sampleRate: 16000)
for segment in result.segments {
    print("Speaker \(segment.speakerId): \(segment.startTime)s - \(segment.endTime)s")
}

Pipeline Architecture

The model is a CoreML pipeline with two sub-models:

  1. PreEncoder (model0) — Runs pre_encode on the mel chunk, concatenates with speaker cache and FIFO state
  2. Head (model1) — Full FastConformer encoder + Transformer + sigmoid speaker heads

State management (FIFO rotation, speaker cache compression) is handled in Swift outside the model.

License

CC-BY-4.0

Downloads last month
153
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including aufklarer/Sortformer-Diarization-CoreML

Paper for aufklarer/Sortformer-Diarization-CoreML