---
license: cc-by-4.0
library_name: coreml
base_model: nvidia/nemotron-speech-streaming-en-0.6b
tags:
- speech-recognition
- automatic-speech-recognition
- streaming-asr
- coreml
- apple
- ios
- macos
- FastConformer
- RNNT
- Parakeet
- ASR
pipeline_tag: automatic-speech-recognition
---
  # Sortformer CoreML Models

  Streaming speaker diarization models converted from NVIDIA's Sortformer to CoreML for Apple Silicon.

  ## Model Variants

  | Variant | File | Latency | Use Case |
  |---------|------|---------|----------|
  | **Default** | `Sortformer.mlmodelc` | ~1.04s | Low latency streaming |
  | **NVIDIA Low** | `SortformerNvidiaLow.mlmodelc` | ~1.04s | Low latency streaming |
  | **NVIDIA High** | `SortformerNvidiaHigh.mlmodelc` | ~30.4s | Best quality, offline |

  ## Configuration Parameters

  | Parameter | Default | NVIDIA Low | NVIDIA High |
  |-----------|---------|------------|-------------|
  | chunk_len | 6 | 6 | 340 |
  | chunk_right_context | 7 | 7 | 40 |
  | chunk_left_context | 1 | 1 | 1 |
  | fifo_len | 40 | 188 | 40 |
  | spkcache_len | 188 | 188 | 188 |

  ## Model Input/Output Shapes

  **General**:

  | Input | Shape | Description |
  |-------|-------|-------------|
  | chunk | `[1, 8*(C+L+R), 128]` | Mel spectrogram features |
  | chunk_lengths | `[1]` | Actual chunk length |
  | spkcache | `[1, S, 512]` | Speaker cache embeddings |
  | spkcache_lengths | `[1]` | Actual cache length |
  | fifo | `[1, F, 512]` | FIFO queue embeddings |
  | fifo_lengths | `[1]` | Actual FIFO length |

  | Output | Shape | Description |
  |--------|-------|-------------|
  | speaker_preds | `[C+L+R+S+F, 4]` | Speaker probabilities (4 speakers) |
  | chunk_pre_encoder_embs | `[C+L+R, 512]` | Embeddings for state update |
  | chunk_pre_encoder_lengths | `[1]` | Actual embedding count |
  | nest_encoder_embs | `[C+L+R+S+F, 192]` | Embeddings for speaker discrimination |
  | nest_encoder_lengths | `[1]` | Actual speaker embedding count |

  Note: `C = chunk_len`, `L = chunk_left_context`, `R = chunk_right_context`, `S = spkcache_len`, `F = fifo_len`.

  **Configuration-Specific Shapes**:
  
  | Input | Default | NVIDIA Low | NVIDIA High |
  |-------|---------|------------|-------------|
  | chunk | `[1, 112, 128]` | `[1, 112, 128]` | `[1, 3048, 128]` |
  | chunk_lengths | `[1]` | `[1]` | `[1]` |
  | spkcache | `[1, 188, 512]` | `[1, 188, 512]` | `[1, 188, 512]` |
  | spkcache_lengths | `[1]` | `[1]` | `[1]` |
  | fifo | `[1, 40, 512]` | `[1, 188, 512]` | `[1, 40, 512]`
  | fifo_lengths | `[1]` | `[1]` | `[1]` |
  
  | Output | Default | NVIDIA Low | NVIDIA High |
  |--------|---------|------------|-------------|
  | speaker_preds | `[1, 242, 128]` | `[1, 390, 128]` | `[1, 609, 128]` |
  | chunk_pre_encoder_embs | `[1, 14, 512]` | `[1, 14, 512]` | `[1, 381, 512]` |
  | chunk_pre_encoder_lengths | `[1]` | `[1]` | `[1]` |
  | nest_encoder_embs | `[1, 242, 192]` | `[1, 390, 192]` | `[1, 609, 192]` |
  | nest_encoder_lengths | `[1]` | `[1]` | `[1]` |

  
  | Metric        | Default | NVIDIA High |
  |---------------|---------|-------------|
  | Latency       | ~1.12s  | ~30.4s      |
  | RTFx (M4 Max) | ~5.7x   | ~125.3x     |

  ## Usage with FluidAudio (Swift)

  ```swift
  import FluidAudio

  // Initialize with default config (auto-downloads from HuggingFace)
  let diarizer = SortformerDiarizer(config: .default)
  let models = try await SortformerModels.loadFromHuggingFace(config: .default)
  diarizer.initialize(models: models)

  // Streaming processing
  for audioChunk in audioStream {
      if let result = try diarizer.processSamples(audioChunk) {
          for frame in 0..<result.frameCount {
              for speaker in 0..<4 {
                  let prob = result.getSpeakerPrediction(speaker: speaker, frame: frame)
              }
          }
      }
  }

  // Or batch processing
  let timeline = try diarizer.processComplete(audioSamples)
  for (speakerIndex, segments) in timeline.segments.enumerated() {
      for segment in segments {
          print("Speaker \(speakerIndex): \(segment.startTime)s - \(segment.endTime)s")
      }
  }
```
  Performance

https://github.com/FluidInference/FluidAudio/blob/main/Documentation/Benchmarks.md

  Files

  Models

  - Sortformer.mlpackage / .mlmodelc - Default config (low latency)
  - SortformerNvidiaLow.mlpackage / .mlmodelc - NVIDIA low latency config
  - SortformerNvidiaHigh.mlpackage / .mlmodelc - NVIDIA high latency config

  Scripts

  - convert_to_coreml.py - PyTorch to CoreML conversion
  - streaming_inference.py - Python streaming inference example
  - mic_inference.py - Real-time microphone demo

  Source

  Original model: https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1

  Credits & Acknowledgements

  This project would not have been possible without the significant technical contributions of https://huggingface.co/GradientDescent2718.

  Their work was instrumental in:

  - Architecture Conversion: Developing the complex PyTorch-to-CoreML conversion pipeline for the 17-layer Fast-Conformer and 18-layer Transformer heads.
  - Build & Optimization: Engineering the static shape configurations that allow the model to achieve ~120x RTF on Apple Silicon.
  - Logic Implementation: Porting the critical streaming state logic (speaker cache and FIFO management) to ensure consistent speaker identity tracking.

  This project was built upon the foundational work of the NVIDIA NeMo team.