alexwengg's picture
Upload 25 files
435fb20 verified
|
raw
history blame
2.66 kB

Sortformer CoreML Models - Gradient Descent Configuration

Streaming speaker diarization models converted from NVIDIA's Sortformer to CoreML.

Configuration

Gradient Descent - Higher quality, more context:

Parameter Value
chunk_len 6
chunk_right_context 7
chunk_left_context 1
fifo_len 40
spkcache_len 188
spkcache_update_period 31

Model Input Shapes

Model Input Shape
Preprocessor audio_signal [1, 18160]
Preprocessor length [1]
PreEncoder chunk [1, 112, 128]
PreEncoder chunk_lengths [1]
PreEncoder spkcache [1, 188, 512]
PreEncoder spkcache_lengths [1]
PreEncoder fifo [1, 40, 512]
PreEncoder fifo_lengths [1]
Head pre_encoder_embs [1, 242, 512]
Head pre_encoder_lengths [1]
Head chunk_embs_in [1, 14, 512]
Head chunk_lens_in [1]

Model Output Shapes

Model Output Shape
Preprocessor features [1, 112, 128]
Preprocessor feature_lengths [1]
PreEncoder pre_encoder_embs [1, 242, 512]
PreEncoder pre_encoder_lengths [1]
PreEncoder chunk_embs_in [1, 14, 512]
PreEncoder chunk_lens_in [1]
Head speaker_preds [1, 242, 4]
Head chunk_pre_encoder_embs [1, 14, 512]
Head chunk_pre_encoder_lengths [1]

Files

Models

  • Pipeline_Preprocessor.mlpackage / .mlmodelc - Audio to mel features
  • Pipeline_PreEncoder.mlpackage / .mlmodelc - Mel features + state to embeddings
  • Pipeline_Head_Fixed.mlpackage / .mlmodelc - Embeddings to speaker predictions

Scripts

  • export_gradient_descent.py - Export script used to create these models
  • coreml_wrappers.py - PyTorch wrapper classes for export
  • streaming_inference.py - Python streaming inference example
  • mic_inference.py - Real-time microphone demo

Usage with FluidAudio (Swift)

let config = SortformerConfig.gradientDescent
let diarizer = try await SortformerDiarizer(config: config)

// Process audio chunks
while let samples = getAudioChunk() {
    if let result = try diarizer.processChunk(samples) {
        // result.probabilities - confirmed speaker probabilities
        // result.tentativeProbabilities - preview (may change)
    }
}

Performance

Metric Value
Latency ~1.04s (7 * 80ms right context + chunk)
DER (AMI) ~30.8%
RTFx ~8.2x on Apple Silicon

Source

Original model: nvidia/diar_streaming_sortformer_4spk-v2.1