FluidInference
/

diar-streaming-sortformer-coreml

@@ -7,7 +7,7 @@
   | Variant | File | Latency | Use Case |
   |---------|------|---------|----------|
-  | **Default** | `Sortformer.mlmodelc` | ~1.12s | Low latency streaming |
   | **NVIDIA Low** | `SortformerNvidiaLow.mlmodelc` | ~1.04s | Low latency streaming |
   | **NVIDIA High** | `SortformerNvidiaHigh.mlmodelc` | ~30.4s | Best quality, offline |
@@ -23,22 +23,43 @@
   ## Model Input/Output Shapes
-  Combined model (Sortformer.mlmodelc - default config):
   | Input | Shape | Description |
   |-------|-------|-------------|
-  | chunk | [1, 112, 128] | Mel spectrogram features |
-  | chunk_lengths | [1] | Actual chunk length |
-  | spkcache | [1, 188, 512] | Speaker cache embeddings |
-  | spkcache_lengths | [1] | Actual cache length |
-  | fifo | [1, 40, 512] | FIFO queue embeddings |
-  | fifo_lengths | [1] | Actual FIFO length |
   | Output | Shape | Description |
   |--------|-------|-------------|
-  | speaker_preds | [T, 4] | Speaker probabilities (4 speakers) |
-  | chunk_pre_encoder_embs | [T', 512] | Embeddings for state update |
-  | chunk_pre_encoder_lengths | [1] | Actual embedding count |
   ## Usage with FluidAudio (Swift)
@@ -61,7 +82,7 @@
       }
   }
-  // Or batch processing
   let timeline = try diarizer.processComplete(audioSamples)
   for (speakerIndex, segments) in timeline.segments.enumerated() {
       for segment in segments {
@@ -74,7 +95,7 @@
   | Metric        | Default | NVIDIA High |
   |---------------|---------|-------------|
   | Latency       | ~1.12s  | ~30.4s      |
-  | RTFx (M4 Pro) | ~120x   | ~118x       |
   Files

   | Variant | File | Latency | Use Case |
   |---------|------|---------|----------|
+  | **Default** | `Sortformer.mlmodelc` | ~1.04s | Low latency streaming |
   | **NVIDIA Low** | `SortformerNvidiaLow.mlmodelc` | ~1.04s | Low latency streaming |
   | **NVIDIA High** | `SortformerNvidiaHigh.mlmodelc` | ~30.4s | Best quality, offline |
   ## Model Input/Output Shapes
+  **General**:
   | Input | Shape | Description |
   |-------|-------|-------------|
+  | chunk | `[1, 8*(C+L+R), 128]` | Mel spectrogram features |
+  | chunk_lengths | `[1]` | Actual chunk length |
+  | spkcache | `[1, S, 512]` | Speaker cache embeddings |
+  | spkcache_lengths | `[1]` | Actual cache length |
+  | fifo | `[1, F, 512]` | FIFO queue embeddings |
+  | fifo_lengths | `[1]` | Actual FIFO length |
   | Output | Shape | Description |
   |--------|-------|-------------|
+  | speaker_preds | `[C+L+R+S+F, 4]` | Speaker probabilities (4 speakers) |
+  | chunk_pre_encoder_embs | `[C+L+R, 512]` | Embeddings for state update |
+  | chunk_pre_encoder_lengths | `[1]` | Actual embedding count |
+  | nest_encoder_embs | `[C+L+R+S+F, 192]` | Embeddings for speaker discrimination |
+  | nest_encoder_lengths | `[1]` | Actual speaker embedding count |
+  Note: `C = chunk_len`, `L = chunk_left_context`, `R = chunk_right_context`, `S = spkcache_len`, `F = fifo_len`.
+  **Configuration-Specific Shapes**:
+  | Input | Default | NVIDIA Low | NVIDIA High |
+  | chunk | `[1, 112, 128]` | `[1, 112, 128]` | `[1, 3048, 128]` |
+  | chunk_lengths | `[1]` | `[1]` | `[1]` |
+  | spkcache | `[1, 188, 512]` | `[1, 188, 512]` | `[1, 188, 512]` |
+  | spkcache_lengths | `[1]` | `[1]` | `[1]` |
+  | fifo | `[1, 40, 512]` | `[1, 188, 512]` | `[1, 40, 512]`
+  | fifo_lengths | `[1]` | `[1]` | `[1]` |
+  | Output | Default | NVIDIA Low | NVIDIA High |
+  | speaker_preds | `[1, 242, 128]` | `[1, 390, 128]` | `[1, 609, 128]` |
+  | chunk_pre_encoder_embs | `[1, 14, 512]` | `[1, 14, 512]` | `[1, 381, 512]` |
+  | chunk_pre_encoder_lengths | `[1]` | `[1]` | `[1]` |
+  | nest_encoder_embs | `[1, 242, 192]` | `[1, 390, 192]` | `[1, 609, 192]` |
+  | nest_encoder_lengths | `[1]` | `[1]` | `[1]` |
   ## Usage with FluidAudio (Swift)
       }
   }
+  // Or file processing
   let timeline = try diarizer.processComplete(audioSamples)
   for (speakerIndex, segments) in timeline.segments.enumerated() {
       for segment in segments {
   | Metric        | Default | NVIDIA High |
   |---------------|---------|-------------|
   | Latency       | ~1.12s  | ~30.4s      |
+  | RTFx (M4 Max) | ~5.7x   | ~125.3x     |
   Files