File size: 5,450 Bytes
6efbfb3
ff41bac
6efbfb3
 
 
ff41bac
 
 
 
 
 
 
 
 
 
 
6efbfb3
 
ca84a2d
 
 
 
 
 
 
 
fb3bc01
ca84a2d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6b7b7dc
ca84a2d
 
 
6b7b7dc
 
 
 
 
 
ca84a2d
 
 
6b7b7dc
 
 
 
 
 
 
 
 
 
 
3a5d56e
6b7b7dc
 
 
 
 
 
 
 
3a5d56e
6b7b7dc
 
 
 
 
ca84a2d
3a5d56e
 
 
 
 
 
ca84a2d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f9a579a
ca84a2d
 
fb3bc01
f9a579a
ca84a2d
f9a579a
ca84a2d
f9a579a
ca84a2d
 
 
f9a579a
ca84a2d
ea4a31e
ca84a2d
 
 
ea4a31e
ca84a2d
ea4a31e
ca84a2d
ea4a31e
ca84a2d
5774ae7
ca84a2d
5774ae7
ca84a2d
5774ae7
ca84a2d
 
 
5774ae7
ff41bac
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---
license: cc-by-4.0
library_name: coreml
base_model: nvidia/nemotron-speech-streaming-en-0.6b
tags:
- speech-recognition
- automatic-speech-recognition
- streaming-asr
- coreml
- apple
- ios
- macos
- FastConformer
- RNNT
- Parakeet
- ASR
pipeline_tag: automatic-speech-recognition
---
  # Sortformer CoreML Models

  Streaming speaker diarization models converted from NVIDIA's Sortformer to CoreML for Apple Silicon.

  ## Model Variants

  | Variant | File | Latency | Use Case |
  |---------|------|---------|----------|
  | **Default** | `Sortformer.mlmodelc` | ~1.04s | Low latency streaming |
  | **NVIDIA Low** | `SortformerNvidiaLow.mlmodelc` | ~1.04s | Low latency streaming |
  | **NVIDIA High** | `SortformerNvidiaHigh.mlmodelc` | ~30.4s | Best quality, offline |

  ## Configuration Parameters

  | Parameter | Default | NVIDIA Low | NVIDIA High |
  |-----------|---------|------------|-------------|
  | chunk_len | 6 | 6 | 340 |
  | chunk_right_context | 7 | 7 | 40 |
  | chunk_left_context | 1 | 1 | 1 |
  | fifo_len | 40 | 188 | 40 |
  | spkcache_len | 188 | 188 | 188 |

  ## Model Input/Output Shapes

  **General**:

  | Input | Shape | Description |
  |-------|-------|-------------|
  | chunk | `[1, 8*(C+L+R), 128]` | Mel spectrogram features |
  | chunk_lengths | `[1]` | Actual chunk length |
  | spkcache | `[1, S, 512]` | Speaker cache embeddings |
  | spkcache_lengths | `[1]` | Actual cache length |
  | fifo | `[1, F, 512]` | FIFO queue embeddings |
  | fifo_lengths | `[1]` | Actual FIFO length |

  | Output | Shape | Description |
  |--------|-------|-------------|
  | speaker_preds | `[C+L+R+S+F, 4]` | Speaker probabilities (4 speakers) |
  | chunk_pre_encoder_embs | `[C+L+R, 512]` | Embeddings for state update |
  | chunk_pre_encoder_lengths | `[1]` | Actual embedding count |
  | nest_encoder_embs | `[C+L+R+S+F, 192]` | Embeddings for speaker discrimination |
  | nest_encoder_lengths | `[1]` | Actual speaker embedding count |

  Note: `C = chunk_len`, `L = chunk_left_context`, `R = chunk_right_context`, `S = spkcache_len`, `F = fifo_len`.

  **Configuration-Specific Shapes**:
  
  | Input | Default | NVIDIA Low | NVIDIA High |
  |-------|---------|------------|-------------|
  | chunk | `[1, 112, 128]` | `[1, 112, 128]` | `[1, 3048, 128]` |
  | chunk_lengths | `[1]` | `[1]` | `[1]` |
  | spkcache | `[1, 188, 512]` | `[1, 188, 512]` | `[1, 188, 512]` |
  | spkcache_lengths | `[1]` | `[1]` | `[1]` |
  | fifo | `[1, 40, 512]` | `[1, 188, 512]` | `[1, 40, 512]`
  | fifo_lengths | `[1]` | `[1]` | `[1]` |
  
  | Output | Default | NVIDIA Low | NVIDIA High |
  |--------|---------|------------|-------------|
  | speaker_preds | `[1, 242, 128]` | `[1, 390, 128]` | `[1, 609, 128]` |
  | chunk_pre_encoder_embs | `[1, 14, 512]` | `[1, 14, 512]` | `[1, 381, 512]` |
  | chunk_pre_encoder_lengths | `[1]` | `[1]` | `[1]` |
  | nest_encoder_embs | `[1, 242, 192]` | `[1, 390, 192]` | `[1, 609, 192]` |
  | nest_encoder_lengths | `[1]` | `[1]` | `[1]` |

  
  | Metric        | Default | NVIDIA High |
  |---------------|---------|-------------|
  | Latency       | ~1.12s  | ~30.4s      |
  | RTFx (M4 Max) | ~5.7x   | ~125.3x     |

  ## Usage with FluidAudio (Swift)

  ```swift
  import FluidAudio

  // Initialize with default config (auto-downloads from HuggingFace)
  let diarizer = SortformerDiarizer(config: .default)
  let models = try await SortformerModels.loadFromHuggingFace(config: .default)
  diarizer.initialize(models: models)

  // Streaming processing
  for audioChunk in audioStream {
      if let result = try diarizer.processSamples(audioChunk) {
          for frame in 0..<result.frameCount {
              for speaker in 0..<4 {
                  let prob = result.getSpeakerPrediction(speaker: speaker, frame: frame)
              }
          }
      }
  }

  // Or batch processing
  let timeline = try diarizer.processComplete(audioSamples)
  for (speakerIndex, segments) in timeline.segments.enumerated() {
      for segment in segments {
          print("Speaker \(speakerIndex): \(segment.startTime)s - \(segment.endTime)s")
      }
  }
```
  Performance

https://github.com/FluidInference/FluidAudio/blob/main/Documentation/Benchmarks.md

  Files

  Models

  - Sortformer.mlpackage / .mlmodelc - Default config (low latency)
  - SortformerNvidiaLow.mlpackage / .mlmodelc - NVIDIA low latency config
  - SortformerNvidiaHigh.mlpackage / .mlmodelc - NVIDIA high latency config

  Scripts

  - convert_to_coreml.py - PyTorch to CoreML conversion
  - streaming_inference.py - Python streaming inference example
  - mic_inference.py - Real-time microphone demo

  Source

  Original model: https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1

  Credits & Acknowledgements

  This project would not have been possible without the significant technical contributions of https://huggingface.co/GradientDescent2718.

  Their work was instrumental in:

  - Architecture Conversion: Developing the complex PyTorch-to-CoreML conversion pipeline for the 17-layer Fast-Conformer and 18-layer Transformer heads.
  - Build & Optimization: Engineering the static shape configurations that allow the model to achieve ~120x RTF on Apple Silicon.
  - Logic Implementation: Porting the critical streaming state logic (speaker cache and FIFO management) to ensure consistent speaker identity tracking.

  This project was built upon the foundational work of the NVIDIA NeMo team.