GradientDescent2718's picture
Update README.md
6e0b538 verified
|
raw
history blame
4.98 kB

Sortformer CoreML Models

Streaming speaker diarization models converted from NVIDIA's Sortformer to CoreML for Apple Silicon.

Model Variants

Variant File Latency Use Case
Default Sortformer.mlmodelc ~1.04s Low latency streaming
NVIDIA Low SortformerNvidiaLow.mlmodelc ~1.04s Low latency streaming
NVIDIA High SortformerNvidiaHigh.mlmodelc ~30.4s Best quality, offline

Configuration Parameters

Parameter Default NVIDIA Low NVIDIA High
chunk_len 6 6 340
chunk_right_context 7 7 40
chunk_left_context 1 1 1
fifo_len 40 188 40
spkcache_len 188 188 188

Model Input/Output Shapes

General:

Input Shape Description
chunk [1, 8*(C+L+R), 128] Mel spectrogram features
chunk_lengths [1] Actual chunk length
spkcache [1, S, 512] Speaker cache embeddings
spkcache_lengths [1] Actual cache length
fifo [1, F, 512] FIFO queue embeddings
fifo_lengths [1] Actual FIFO length
Output Shape Description
speaker_preds [C+L+R+S+F, 4] Speaker probabilities (4 speakers)
chunk_pre_encoder_embs [C+L+R, 512] Embeddings for state update
chunk_pre_encoder_lengths [1] Actual embedding count
nest_encoder_embs [C+L+R+S+F, 192] Embeddings for speaker discrimination
nest_encoder_lengths [1] Actual speaker embedding count

Note: C = chunk_len, L = chunk_left_context, R = chunk_right_context, S = spkcache_len, F = fifo_len.

Configuration-Specific Shapes:

| Input | Default | NVIDIA Low | NVIDIA High | | chunk | [1, 112, 128] | [1, 112, 128] | [1, 3048, 128] | | chunk_lengths | [1] | [1] | [1] | | spkcache | [1, 188, 512] | [1, 188, 512] | [1, 188, 512] | | spkcache_lengths | [1] | [1] | [1] | | fifo | [1, 40, 512] | [1, 188, 512] | [1, 40, 512] | fifo_lengths | [1] | [1] | [1] |

| Output | Default | NVIDIA Low | NVIDIA High | | speaker_preds | [1, 242, 128] | [1, 390, 128] | [1, 609, 128] | | chunk_pre_encoder_embs | [1, 14, 512] | [1, 14, 512] | [1, 381, 512] | | chunk_pre_encoder_lengths | [1] | [1] | [1] | | nest_encoder_embs | [1, 242, 192] | [1, 390, 192] | [1, 609, 192] | | nest_encoder_lengths | [1] | [1] | [1] |

Usage with FluidAudio (Swift)

import FluidAudio

// Initialize with default config (auto-downloads from HuggingFace)
let diarizer = SortformerDiarizer(config: .default)
let models = try await SortformerModels.loadFromHuggingFace(config: .default)
diarizer.initialize(models: models)

// Streaming processing
for audioChunk in audioStream {
    if let result = try diarizer.processSamples(audioChunk) {
        for frame in 0..<result.frameCount {
            for speaker in 0..<4 {
                let prob = result.getSpeakerPrediction(speaker: speaker, frame: frame)
            }
        }
    }
}

// Or file processing
let timeline = try diarizer.processComplete(audioSamples)
for (speakerIndex, segments) in timeline.segments.enumerated() {
    for segment in segments {
        print("Speaker \(speakerIndex): \(segment.startTime)s - \(segment.endTime)s")
    }
}

Performance

Metric Default NVIDIA High
Latency ~1.12s ~30.4s
RTFx (M4 Max) ~5.7x ~125.3x

Files

Models

  • Sortformer.mlpackage / .mlmodelc - Default config (low latency)
  • SortformerNvidiaLow.mlpackage / .mlmodelc - NVIDIA low latency config
  • SortformerNvidiaHigh.mlpackage / .mlmodelc - NVIDIA high latency config

Scripts

  • convert_to_coreml.py - PyTorch to CoreML conversion
  • streaming_inference.py - Python streaming inference example
  • mic_inference.py - Real-time microphone demo

Source

Original model: https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1

Credits & Acknowledgements

This project would not have been possible without the significant technical contributions of https://huggingface.co/GradientDescent2718.

Their work was instrumental in:

  • Architecture Conversion: Developing the complex PyTorch-to-CoreML conversion pipeline for the 17-layer Fast-Conformer and 18-layer Transformer heads.
  • Build & Optimization: Engineering the static shape configurations that allow the model to achieve ~120x RTF on Apple Silicon.
  • Logic Implementation: Porting the critical streaming state logic (speaker cache and FIFO management) to ensure consistent speaker identity tracking.

This project was built upon the foundational work of the NVIDIA NeMo team.