# Sortformer CoreML Models - Gradient Descent Configuration Streaming speaker diarization models converted from NVIDIA's Sortformer to CoreML. ## Configuration **Gradient Descent** - Higher quality, more context: | Parameter | Value | |-----------|-------| | chunk_len | 6 | | chunk_right_context | 7 | | chunk_left_context | 1 | | fifo_len | 40 | | spkcache_len | 188 | | spkcache_update_period | 31 | ## Model Input Shapes | Model | Input | Shape | |-------|-------|-------| | Preprocessor | audio_signal | [1, 18160] | | Preprocessor | length | [1] | | PreEncoder | chunk | [1, 112, 128] | | PreEncoder | chunk_lengths | [1] | | PreEncoder | spkcache | [1, 188, 512] | | PreEncoder | spkcache_lengths | [1] | | PreEncoder | fifo | [1, 40, 512] | | PreEncoder | fifo_lengths | [1] | | Head | pre_encoder_embs | [1, 242, 512] | | Head | pre_encoder_lengths | [1] | | Head | chunk_embs_in | [1, 14, 512] | | Head | chunk_lens_in | [1] | ## Model Output Shapes | Model | Output | Shape | |-------|--------|-------| | Preprocessor | features | [1, 112, 128] | | Preprocessor | feature_lengths | [1] | | PreEncoder | pre_encoder_embs | [1, 242, 512] | | PreEncoder | pre_encoder_lengths | [1] | | PreEncoder | chunk_embs_in | [1, 14, 512] | | PreEncoder | chunk_lens_in | [1] | | Head | speaker_preds | [1, 242, 4] | | Head | chunk_pre_encoder_embs | [1, 14, 512] | | Head | chunk_pre_encoder_lengths | [1] | ## Files ### Models - `Pipeline_Preprocessor.mlpackage` / `.mlmodelc` - Audio to mel features - `Pipeline_PreEncoder.mlpackage` / `.mlmodelc` - Mel features + state to embeddings - `Pipeline_Head_Fixed.mlpackage` / `.mlmodelc` - Embeddings to speaker predictions ### Scripts - `export_gradient_descent.py` - Export script used to create these models - `coreml_wrappers.py` - PyTorch wrapper classes for export - `streaming_inference.py` - Python streaming inference example - `mic_inference.py` - Real-time microphone demo ## Usage with FluidAudio (Swift) ```swift let config = SortformerConfig.gradientDescent let diarizer = try await SortformerDiarizer(config: config) // Process audio chunks while let samples = getAudioChunk() { if let result = try diarizer.processChunk(samples) { // result.probabilities - confirmed speaker probabilities // result.tentativeProbabilities - preview (may change) } } ``` ## Performance | Metric | Value | |--------|-------| | Latency | ~1.04s (7 * 80ms right context + chunk) | | DER (AMI) | ~30.8% | | RTFx | ~8.2x on Apple Silicon | ## Source Original model: [nvidia/diar_streaming_sortformer_4spk-v2.1](https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1)