FluidInference
/

diar-streaming-sortformer-coreml

@@ -1,103 +1,115 @@
-# Sortformer CoreML Models - Gradient Descent Configuration
-Streaming speaker diarization models converted from NVIDIA's Sortformer to CoreML.
-## Configuration
-**Gradient Descent** - Higher quality, more context:
-| Parameter | Value |
-|-----------|-------|
-| chunk_len | 6 |
-| chunk_right_context | 7 |
-| chunk_left_context | 1 |
-| fifo_len | 40 |
-| spkcache_len | 188 |
-| spkcache_update_period | 31 |
-## Model Input Shapes
-| Model | Input | Shape |
-|-------|-------|-------|
-| Preprocessor | audio_signal | [1, 18160] |
-| Preprocessor | length | [1] |
-| PreEncoder | chunk | [1, 112, 128] |
-| PreEncoder | chunk_lengths | [1] |
-| PreEncoder | spkcache | [1, 188, 512] |
-| PreEncoder | spkcache_lengths | [1] |
-| PreEncoder | fifo | [1, 40, 512] |
-| PreEncoder | fifo_lengths | [1] |
-| Head | pre_encoder_embs | [1, 242, 512] |
-| Head | pre_encoder_lengths | [1] |
-| Head | chunk_embs_in | [1, 14, 512] |
-| Head | chunk_lens_in | [1] |
-## Model Output Shapes
-| Model | Output | Shape |
-|-------|--------|-------|
-| Preprocessor | features | [1, 112, 128] |
-| Preprocessor | feature_lengths | [1] |
-| PreEncoder | pre_encoder_embs | [1, 242, 512] |
-| PreEncoder | pre_encoder_lengths | [1] |
-| PreEncoder | chunk_embs_in | [1, 14, 512] |
-| PreEncoder | chunk_lens_in | [1] |
-| Head | speaker_preds | [1, 242, 4] |
-| Head | chunk_pre_encoder_embs | [1, 14, 512] |
-| Head | chunk_pre_encoder_lengths | [1] |
-## Files
-### Models
-- `Pipeline_Preprocessor.mlpackage` / `.mlmodelc` - Audio to mel features
-- `Pipeline_PreEncoder.mlpackage` / `.mlmodelc` - Mel features + state to embeddings
-- `Pipeline_Head_Fixed.mlpackage` / `.mlmodelc` - Embeddings to speaker predictions
-### Scripts
-- `export_gradient_descent.py` - Export script used to create these models
-- `coreml_wrappers.py` - PyTorch wrapper classes for export
-- `streaming_inference.py` - Python streaming inference example
-- `mic_inference.py` - Real-time microphone demo
-## Usage with FluidAudio (Swift)
-```swift
-let config = SortformerConfig.gradientDescent
-let diarizer = try await SortformerDiarizer(config: config)
-// Process audio chunks
-while let samples = getAudioChunk() {
-    if let result = try diarizer.processChunk(samples) {
-        // result.probabilities - confirmed speaker probabilities
-        // result.tentativeProbabilities - preview (may change)
-    }
-}
 ```
-## Performance
-| Metric | Value |
-|--------|-------|
-| Latency | ~1.04s (7 * 80ms right context + chunk) |
-| DER (AMI) | ~30.8% |
-| RTFx | ~8.2x on Apple Silicon |
-## Source
-Original model: [nvidia/diar_streaming_sortformer_4spk-v2.1](https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1)
-## Credits & Acknowledgements
-This project would not have been possible without the significant technical contributions of [GradientDescent2718](https://huggingface.co/GradientDescent2718).
-Their work was instrumental in:
-Architecture Conversion: Developing the complex PyTorch-to-CoreML conversion pipeline for the 17-layer Fast-Conformer and 18-layer Transformer heads.
-Build & Optimization: Engineering the static shape configurations settings that allow the model to achieve ~8.2x RTF on Apple Silicon.
-Logic Implementation: Porting the critical streaming state logic (AOSC and FIFO management) to ensure zero-shot identity consistency in the CoreML wrapper.
-This project was built upon the foundational work of the NVIDIA NeMo team.

+  # Sortformer CoreML Models
+  Streaming speaker diarization models converted from NVIDIA's Sortformer to CoreML for Apple Silicon.
+  ## Model Variants
+  | Variant | File | Latency | Use Case |
+  |---------|------|---------|----------|
+  | **Default** | `Sortformer.mlmodelc` | ~1.12s | Low latency streaming |
+  | **NVIDIA Low** | `SortformerNvidiaLow.mlmodelc` | ~1.04s | Low latency streaming |
+  | **NVIDIA High** | `SortformerNvidiaHigh.mlmodelc` | ~30.4s | Best quality, offline |
+  ## Configuration Parameters
+  | Parameter | Default | NVIDIA Low | NVIDIA High |
+  |-----------|---------|------------|-------------|
+  | chunk_len | 6 | 6 | 340 |
+  | chunk_right_context | 7 | 7 | 40 |
+  | chunk_left_context | 1 | 1 | 1 |
+  | fifo_len | 40 | 188 | 40 |
+  | spkcache_len | 188 | 188 | 188 |
+  ## Model Input/Output Shapes
+  Combined model (Sortformer.mlmodelc - default config):
+  | Input | Shape | Description |
+  |-------|-------|-------------|
+  | chunk | [1, 112, 128] | Mel spectrogram features |
+  | chunk_lengths | [1] | Actual chunk length |
+  | spkcache | [1, 188, 512] | Speaker cache embeddings |
+  | spkcache_lengths | [1] | Actual cache length |
+  | fifo | [1, 40, 512] | FIFO queue embeddings |
+  | fifo_lengths | [1] | Actual FIFO length |
+  | Output | Shape | Description |
+  |--------|-------|-------------|
+  | speaker_preds | [T, 4] | Speaker probabilities (4 speakers) |
+  | chunk_pre_encoder_embs | [T', 512] | Embeddings for state update |
+  | chunk_pre_encoder_lengths | [1] | Actual embedding count |
+  ## Usage with FluidAudio (Swift)
+  ```swift
+  import FluidAudio
+  // Initialize with default config (auto-downloads from HuggingFace)
+  let diarizer = SortformerDiarizer(config: .default)
+  let models = try await SortformerModels.loadFromHuggingFace(config: .default)
+  diarizer.initialize(models: models)
+  // Streaming processing
+  for audioChunk in audioStream {
+      if let result = try diarizer.processSamples(audioChunk) {
+          for frame in 0..<result.frameCount {
+              for speaker in 0..<4 {
+                  let prob = result.getSpeakerPrediction(speaker: speaker, frame: frame)
+              }
+          }
+      }
+  }
+  // Or batch processing
+  let timeline = try diarizer.processComplete(audioSamples)
+  for (speakerIndex, segments) in timeline.segments.enumerated() {
+      for segment in segments {
+          print("Speaker \(speakerIndex): \(segment.startTime)s - \(segment.endTime)s")
+      }
+  }
 ```
+  Performance
+  | Metric        | Default | NVIDIA High |
+  |---------------|---------|-------------|
+  | Latency       | ~1.12s  | ~30.4s      |
+  | RTFx (M4 Pro) | ~120x   | ~118x       |
+  Files
+  Models
+  - Sortformer.mlpackage / .mlmodelc - Default config (low latency)
+  - SortformerNvidiaLow.mlpackage / .mlmodelc - NVIDIA low latency config
+  - SortformerNvidiaHigh.mlpackage / .mlmodelc - NVIDIA high latency config
+  Scripts
+  - convert_to_coreml.py - PyTorch to CoreML conversion
+  - streaming_inference.py - Python streaming inference example
+  - mic_inference.py - Real-time microphone demo
+  Source
+  Original model: https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1
+  Credits & Acknowledgements
+  This project would not have been possible without the significant technical contributions of https://huggingface.co/GradientDescent2718.
+  Their work was instrumental in:
+  - Architecture Conversion: Developing the complex PyTorch-to-CoreML conversion pipeline for the 17-layer Fast-Conformer and 18-layer Transformer heads.
+  - Build & Optimization: Engineering the static shape configurations that allow the model to achieve ~120x RTF on Apple Silicon.
+  - Logic Implementation: Porting the critical streaming state logic (speaker cache and FIFO management) to ensure consistent speaker identity tracking.
+  This project was built upon the foundational work of the NVIDIA NeMo team.
+  Key changes:
+  1. Describes all 3 model variants (Default, NVIDIA Low, NVIDIA High)
+  2. Updated model file names to match actual repo content
+  3. Fixed Swift API to match current `SortformerDiarizer` implementation
+  4. Updated performance numbers (RTFx ~120x based on your documentation)
+  5. Simplified input/output shapes table for combined model
+  6. Kept credits section intact