FluidInference
/

diar-streaming-sortformer-coreml

Automatic Speech Recognition

speaker-diarization

Model card Files Files and versions

alexwengg commited on Jan 5

Commit

6b7b7dc

·

verified ·

1 Parent(s): fb3bc01

Update README.md

Files changed (1) hide show

README.md +31 -10

README.md CHANGED Viewed

@@ -23,22 +23,43 @@
   ## Model Input/Output Shapes
-  Combined model (Sortformer.mlmodelc - default config):
   | Input | Shape | Description |
   |-------|-------|-------------|
-  | chunk | [1, 112, 128] | Mel spectrogram features |
-  | chunk_lengths | [1] | Actual chunk length |
-  | spkcache | [1, 188, 512] | Speaker cache embeddings |
-  | spkcache_lengths | [1] | Actual cache length |
-  | fifo | [1, 40, 512] | FIFO queue embeddings |
-  | fifo_lengths | [1] | Actual FIFO length |
   | Output | Shape | Description |
   |--------|-------|-------------|
-  | speaker_preds | [T, 4] | Speaker probabilities (4 speakers) |
-  | chunk_pre_encoder_embs | [T', 512] | Embeddings for state update |
-  | chunk_pre_encoder_lengths | [1] | Actual embedding count |
   ## Usage with FluidAudio (Swift)

   ## Model Input/Output Shapes
+  **General**:
   | Input | Shape | Description |
   |-------|-------|-------------|
+  | chunk | `[1, 8*(C+L+R), 128]` | Mel spectrogram features |
+  | chunk_lengths | `[1]` | Actual chunk length |
+  | spkcache | `[1, S, 512]` | Speaker cache embeddings |
+  | spkcache_lengths | `[1]` | Actual cache length |
+  | fifo | `[1, F, 512]` | FIFO queue embeddings |
+  | fifo_lengths | `[1]` | Actual FIFO length |
   | Output | Shape | Description |
   |--------|-------|-------------|
+  | speaker_preds | `[C+L+R+S+F, 4]` | Speaker probabilities (4 speakers) |
+  | chunk_pre_encoder_embs | `[C+L+R, 512]` | Embeddings for state update |
+  | chunk_pre_encoder_lengths | `[1]` | Actual embedding count |
+  | nest_encoder_embs | `[C+L+R+S+F, 192]` | Embeddings for speaker discrimination |
+  | nest_encoder_lengths | `[1]` | Actual speaker embedding count |
+  Note: `C = chunk_len`, `L = chunk_left_context`, `R = chunk_right_context`, `S = spkcache_len`, `F = fifo_len`.
+  **Configuration-Specific Shapes**:
+  | Input | Default | NVIDIA Low | NVIDIA High |
+  | chunk | `[1, 112, 128]` | `[1, 112, 128]` | `[1, 3048, 128]` |
+  | chunk_lengths | `[1]` | `[1]` | `[1]` |
+  | spkcache | `[1, 188, 512]` | `[1, 188, 512]` | `[1, 188, 512]` |
+  | spkcache_lengths | `[1]` | `[1]` | `[1]` |
+  | fifo | `[1, 40, 512]` | `[1, 188, 512]` | `[1, 40, 512]`
+  | fifo_lengths | `[1]` | `[1]` | `[1]` |
+  | Output | Default | NVIDIA Low | NVIDIA High |
+  | speaker_preds | `[1, 242, 128]` | `[1, 390, 128]` | `[1, 609, 128]` |
+  | chunk_pre_encoder_embs | `[1, 14, 512]` | `[1, 14, 512]` | `[1, 381, 512]` |
+  | chunk_pre_encoder_lengths | `[1]` | `[1]` | `[1]` |
+  | nest_encoder_embs | `[1, 242, 192]` | `[1, 390, 192]` | `[1, 609, 192]` |
+  | nest_encoder_lengths | `[1]` | `[1]` | `[1]` |
   ## Usage with FluidAudio (Swift)