File size: 5,450 Bytes

6efbfb3
ff41bac
6efbfb3
 
 
ff41bac
 
 
 
 
 
 
 
 
 
 
6efbfb3
 
ca84a2d
 
 
 
 
 
 
 
fb3bc01
ca84a2d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6b7b7dc
ca84a2d
 
 
6b7b7dc
 
 
 
 
 
ca84a2d
 
 
6b7b7dc
 
 
 
 
 
 
 
 
 
 
3a5d56e
6b7b7dc
 
 
 
 
 
 
 
3a5d56e
6b7b7dc
 
 
 
 
ca84a2d
3a5d56e
 
 
 
 
 
ca84a2d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f9a579a
ca84a2d
 
fb3bc01
f9a579a
ca84a2d
f9a579a
ca84a2d
f9a579a
ca84a2d
 
 
f9a579a
ca84a2d
ea4a31e
ca84a2d
 
 
ea4a31e
ca84a2d
ea4a31e
ca84a2d
ea4a31e
ca84a2d
5774ae7
ca84a2d
5774ae7
ca84a2d
5774ae7
ca84a2d
 
 
5774ae7
ff41bac

---
license: cc-by-4.0
library_name: coreml
base_model: nvidia/nemotron-speech-streaming-en-0.6b
tags:
- speech-recognition
- automatic-speech-recognition
- streaming-asr
- coreml
- apple
- ios
- macos
- FastConformer
- RNNT
- Parakeet
- ASR
pipeline_tag: automatic-speech-recognition
---
  # Sortformer CoreML Models

  Streaming speaker diarization models converted from NVIDIA's Sortformer to CoreML for Apple Silicon.

  ## Model Variants

  | Variant | File | Latency | Use Case |
  |---------|------|---------|----------|
  | **Default** | `Sortformer.mlmodelc` | ~1.04s | Low latency streaming |
  | **NVIDIA Low** | `SortformerNvidiaLow.mlmodelc` | ~1.04s | Low latency streaming |
  | **NVIDIA High** | `SortformerNvidiaHigh.mlmodelc` | ~30.4s | Best quality, offline |

  ## Configuration Parameters

  | Parameter | Default | NVIDIA Low | NVIDIA High |
  |-----------|---------|------------|-------------|
  | chunk_len | 6 | 6 | 340 |
  | chunk_right_context | 7 | 7 | 40 |
  | chunk_left_context | 1 | 1 | 1 |
  | fifo_len | 40 | 188 | 40 |
  | spkcache_len | 188 | 188 | 188 |

  ## Model Input/Output Shapes

  **General**:

  | Input | Shape | Description |
  |-------|-------|-------------|
  | chunk | `[1, 8*(C+L+R), 128]` | Mel spectrogram features |
  | chunk_lengths | `[1]` | Actual chunk length |
  | spkcache | `[1, S, 512]` | Speaker cache embeddings |
  | spkcache_lengths | `[1]` | Actual cache length |
  | fifo | `[1, F, 512]` | FIFO queue embeddings |
  | fifo_lengths | `[1]` | Actual FIFO length |

  | Output | Shape | Description |
  |--------|-------|-------------|
  | speaker_preds | `[C+L+R+S+F, 4]` | Speaker probabilities (4 speakers) |
  | chunk_pre_encoder_embs | `[C+L+R, 512]` | Embeddings for state update |
  | chunk_pre_encoder_lengths | `[1]` | Actual embedding count |
  | nest_encoder_embs | `[C+L+R+S+F, 192]` | Embeddings for speaker discrimination |
  | nest_encoder_lengths | `[1]` | Actual speaker embedding count |

  Note: `C = chunk_len`, `L = chunk_left_context`, `R = chunk_right_context`, `S = spkcache_len`, `F = fifo_len`.

  **Configuration-Specific Shapes**:
  
  | Input | Default | NVIDIA Low | NVIDIA High |
  |-------|---------|------------|-------------|
  | chunk | `[1, 112, 128]` | `[1, 112, 128]` | `[1, 3048, 128]` |
  | chunk_lengths | `[1]` | `[1]` | `[1]` |
  | spkcache | `[1, 188, 512]` | `[1, 188, 512]` | `[1, 188, 512]` |
  | spkcache_lengths | `[1]` | `[1]` | `[1]` |
  | fifo | `[1, 40, 512]` | `[1, 188, 512]` | `[1, 40, 512]`
  | fifo_lengths | `[1]` | `[1]` | `[1]` |
  
  | Output | Default | NVIDIA Low | NVIDIA High |
  |--------|---------|------------|-------------|
  | speaker_preds | `[1, 242, 128]` | `[1, 390, 128]` | `[1, 609, 128]` |
  | chunk_pre_encoder_embs | `[1, 14, 512]` | `[1, 14, 512]` | `[1, 381, 512]` |
  | chunk_pre_encoder_lengths | `[1]` | `[1]` | `[1]` |
  | nest_encoder_embs | `[1, 242, 192]` | `[1, 390, 192]` | `[1, 609, 192]` |
  | nest_encoder_lengths | `[1]` | `[1]` | `[1]` |

  
  | Metric        | Default | NVIDIA High |
  |---------------|---------|-------------|
  | Latency       | ~1.12s  | ~30.4s      |
  | RTFx (M4 Max) | ~5.7x   | ~125.3x     |

  ## Usage with FluidAudio (Swift)

  ```swift
  import FluidAudio

  // Initialize with default config (auto-downloads from HuggingFace)
  let diarizer = SortformerDiarizer(config: .default)
  let models = try await SortformerModels.loadFromHuggingFace(config: .default)
  diarizer.initialize(models: models)

  // Streaming processing
  for audioChunk in audioStream {
      if let result = try diarizer.processSamples(audioChunk) {
          for frame in 0..<result.frameCount {
              for speaker in 0..<4 {
                  let prob = result.getSpeakerPrediction(speaker: speaker, frame: frame)
              }
          }
      }
  }

  // Or batch processing
  let timeline = try diarizer.processComplete(audioSamples)
  for (speakerIndex, segments) in timeline.segments.enumerated() {
      for segment in segments {
          print("Speaker \(speakerIndex): \(segment.startTime)s - \(segment.endTime)s")
      }
  }
```
  Performance

https://github.com/FluidInference/FluidAudio/blob/main/Documentation/Benchmarks.md

  Files

  Models

  - Sortformer.mlpackage / .mlmodelc - Default config (low latency)
  - SortformerNvidiaLow.mlpackage / .mlmodelc - NVIDIA low latency config
  - SortformerNvidiaHigh.mlpackage / .mlmodelc - NVIDIA high latency config

  Scripts

  - convert_to_coreml.py - PyTorch to CoreML conversion
  - streaming_inference.py - Python streaming inference example
  - mic_inference.py - Real-time microphone demo

  Source

  Original model: https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1

  Credits & Acknowledgements

  This project would not have been possible without the significant technical contributions of https://huggingface.co/GradientDescent2718.

  Their work was instrumental in:

  - Architecture Conversion: Developing the complex PyTorch-to-CoreML conversion pipeline for the 17-layer Fast-Conformer and 18-layer Transformer heads.
  - Build & Optimization: Engineering the static shape configurations that allow the model to achieve ~120x RTF on Apple Silicon.
  - Logic Implementation: Porting the critical streaming state logic (speaker cache and FIFO management) to ensure consistent speaker identity tracking.

  This project was built upon the foundational work of the NVIDIA NeMo team.