--- license: cc-by-4.0 library_name: coreml base_model: - nvidia/diar_streaming_sortformer_4spk-v2.1 base_model_relation: finetune tags: - speaker-diarization - speech - audio - coreml - apple - ios - macos - sortformer - streaming pipeline_tag: automatic-speech-recognition --- # Sortformer CoreML Models Streaming speaker diarization models converted from NVIDIA's Sortformer to CoreML for Apple Silicon. ## Model Variants | Variant | File | Latency | Use Case | | -------------------- | ------------------------------------ | ------- | --------------------- | | **Fastest v2** | `Sortformer_v2.mlmodelc` | ~1.04s | Low latency streaming | | **Fastest v2.1** | `Sortformer_v2.1.mlmodelc` | ~1.04s | Low latency streaming | | **NVIDIA Low v2** | `SortformerNvidiaLow_v2.mlmodelc` | ~1.04s | Low latency streaming | | **NVIDIA Low v2.1** | `SortformerNvidiaLow_v2.1.mlmodelc` | ~1.04s | Low latency streaming | | **NVIDIA High v2** | `SortformerNvidiaHigh_v2.mlmodelc` | ~30.4s | Best quality, offline | | **NVIDIA High v2.1** | `SortformerNvidiaHigh_v2.1.mlmodelc` | ~30.4s | Best quality, offline | The `v2` and `v2.1` refer to the version of the model weights to use. According to NVIDIA, `v2.1` is more robust in meeting scenarios. ## Configuration Parameters | Parameter | Default | NVIDIA Low | NVIDIA High | | ------------------- | ------- | ---------- | ----------- | | chunk_len | 6 | 6 | 340 | | chunk_right_context | 7 | 7 | 40 | | chunk_left_context | 1 | 1 | 1 | | fifo_len | 40 | 188 | 40 | | spkcache_len | 188 | 188 | 188 | ## Model Input/Output Shapes **General**: | Input | Shape | Description | | ---------------- | --------------------- | ------------------------ | | chunk | `[1, 8*(C+L+R), 128]` | Mel spectrogram features | | chunk_lengths | `[1]` | Actual chunk length | | spkcache | `[1, S, 512]` | Speaker cache embeddings | | spkcache_lengths | `[1]` | Actual cache length | | fifo | `[1, F, 512]` | FIFO queue embeddings | | fifo_lengths | `[1]` | Actual FIFO length | | Output | Shape | Description | | ------------------------- | ------------------ | ------------------------------------- | | speaker_preds | `[C+L+R+S+F, 4]` | Speaker probabilities (4 speakers) | | chunk_pre_encoder_embs | `[C+L+R, 512]` | Embeddings for state update | | chunk_pre_encoder_lengths | `[1]` | Actual embedding count | | nest_encoder_embs | `[C+L+R+S+F, 192]` | Embeddings for speaker discrimination | | nest_encoder_lengths | `[1]` | Actual speaker embedding count | Note: `C = chunk_len`, `L = chunk_left_context`, `R = chunk_right_context`, `S = spkcache_len`, `F = fifo_len`. **Configuration-Specific Shapes**: | Input | Default | NVIDIA Low | NVIDIA High | | ---------------- | --------------- | --------------- | ---------------- | | chunk | `[1, 112, 128]` | `[1, 112, 128]` | `[1, 3048, 128]` | | chunk_lengths | `[1]` | `[1]` | `[1]` | | spkcache | `[1, 188, 512]` | `[1, 188, 512]` | `[1, 188, 512]` | | spkcache_lengths | `[1]` | `[1]` | `[1]` | | fifo | `[1, 40, 512]` | `[1, 188, 512]` | `[1, 40, 512]` | | fifo_lengths | `[1]` | `[1]` | `[1]` | | Output | Default | NVIDIA Low | NVIDIA High | | ------------------------- | --------------- | --------------- | --------------- | | speaker_preds | `[1, 242, 128]` | `[1, 390, 128]` | `[1, 609, 128]` | | chunk_pre_encoder_embs | `[1, 14, 512]` | `[1, 14, 512]` | `[1, 381, 512]` | | chunk_pre_encoder_lengths | `[1]` | `[1]` | `[1]` | | nest_encoder_embs | `[1, 242, 192]` | `[1, 390, 192]` | `[1, 609, 192]` | | nest_encoder_lengths | `[1]` | `[1]` | `[1]` | | Metric | Default | NVIDIA High | | ------------- | ------- | ----------- | | Latency | ~1.12s | ~30.4s | | RTFx (M4 Max) | ~5.7x | ~125.3x | ## Usage with FluidAudio (Swift) ```swift import FluidAudio // Initialize with default config (auto-downloads from HuggingFace) let diarizer = SortformerDiarizer(config: .default) let models = try await SortformerModels.loadFromHuggingFace(config: .default) diarizer.initialize(models: models) // Streaming processing for audioChunk in audioStream { if let result = try diarizer.processSamples(audioChunk) { for frame in 0..