| # Sortformer CoreML Models - Gradient Descent Configuration | |
| Streaming speaker diarization models converted from NVIDIA's Sortformer to CoreML. | |
| ## Configuration | |
| **Gradient Descent** - Higher quality, more context: | |
| | Parameter | Value | | |
| |-----------|-------| | |
| | chunk_len | 6 | | |
| | chunk_right_context | 7 | | |
| | chunk_left_context | 1 | | |
| | fifo_len | 40 | | |
| | spkcache_len | 188 | | |
| | spkcache_update_period | 31 | | |
| ## Model Input Shapes | |
| | Model | Input | Shape | | |
| |-------|-------|-------| | |
| | Preprocessor | audio_signal | [1, 18160] | | |
| | Preprocessor | length | [1] | | |
| | PreEncoder | chunk | [1, 112, 128] | | |
| | PreEncoder | chunk_lengths | [1] | | |
| | PreEncoder | spkcache | [1, 188, 512] | | |
| | PreEncoder | spkcache_lengths | [1] | | |
| | PreEncoder | fifo | [1, 40, 512] | | |
| | PreEncoder | fifo_lengths | [1] | | |
| | Head | pre_encoder_embs | [1, 242, 512] | | |
| | Head | pre_encoder_lengths | [1] | | |
| | Head | chunk_embs_in | [1, 14, 512] | | |
| | Head | chunk_lens_in | [1] | | |
| ## Model Output Shapes | |
| | Model | Output | Shape | | |
| |-------|--------|-------| | |
| | Preprocessor | features | [1, 112, 128] | | |
| | Preprocessor | feature_lengths | [1] | | |
| | PreEncoder | pre_encoder_embs | [1, 242, 512] | | |
| | PreEncoder | pre_encoder_lengths | [1] | | |
| | PreEncoder | chunk_embs_in | [1, 14, 512] | | |
| | PreEncoder | chunk_lens_in | [1] | | |
| | Head | speaker_preds | [1, 242, 4] | | |
| | Head | chunk_pre_encoder_embs | [1, 14, 512] | | |
| | Head | chunk_pre_encoder_lengths | [1] | | |
| ## Files | |
| ### Models | |
| - `Pipeline_Preprocessor.mlpackage` / `.mlmodelc` - Audio to mel features | |
| - `Pipeline_PreEncoder.mlpackage` / `.mlmodelc` - Mel features + state to embeddings | |
| - `Pipeline_Head_Fixed.mlpackage` / `.mlmodelc` - Embeddings to speaker predictions | |
| ### Scripts | |
| - `export_gradient_descent.py` - Export script used to create these models | |
| - `coreml_wrappers.py` - PyTorch wrapper classes for export | |
| - `streaming_inference.py` - Python streaming inference example | |
| - `mic_inference.py` - Real-time microphone demo | |
| ## Usage with FluidAudio (Swift) | |
| ```swift | |
| let config = SortformerConfig.gradientDescent | |
| let diarizer = try await SortformerDiarizer(config: config) | |
| // Process audio chunks | |
| while let samples = getAudioChunk() { | |
| if let result = try diarizer.processChunk(samples) { | |
| // result.probabilities - confirmed speaker probabilities | |
| // result.tentativeProbabilities - preview (may change) | |
| } | |
| } | |
| ``` | |
| ## Performance | |
| | Metric | Value | | |
| |--------|-------| | |
| | Latency | ~1.04s (7 * 80ms right context + chunk) | | |
| | DER (AMI) | ~30.8% | | |
| | RTFx | ~8.2x on Apple Silicon | | |
| ## Source | |
| Original model: [nvidia/diar_streaming_sortformer_4spk-v2.1](https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1) | |