Update README.md
Browse files
README.md
CHANGED
|
@@ -23,22 +23,43 @@
|
|
| 23 |
|
| 24 |
## Model Input/Output Shapes
|
| 25 |
|
| 26 |
-
|
| 27 |
|
| 28 |
| Input | Shape | Description |
|
| 29 |
|-------|-------|-------------|
|
| 30 |
-
| chunk | [1,
|
| 31 |
-
| chunk_lengths | [1] | Actual chunk length |
|
| 32 |
-
| spkcache | [1,
|
| 33 |
-
| spkcache_lengths | [1] | Actual cache length |
|
| 34 |
-
| fifo | [1,
|
| 35 |
-
| fifo_lengths | [1] | Actual FIFO length |
|
| 36 |
|
| 37 |
| Output | Shape | Description |
|
| 38 |
|--------|-------|-------------|
|
| 39 |
-
| speaker_preds | [
|
| 40 |
-
| chunk_pre_encoder_embs | [
|
| 41 |
-
| chunk_pre_encoder_lengths | [1] | Actual embedding count |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
|
| 43 |
## Usage with FluidAudio (Swift)
|
| 44 |
|
|
|
|
| 23 |
|
| 24 |
## Model Input/Output Shapes
|
| 25 |
|
| 26 |
+
**General**:
|
| 27 |
|
| 28 |
| Input | Shape | Description |
|
| 29 |
|-------|-------|-------------|
|
| 30 |
+
| chunk | `[1, 8*(C+L+R), 128]` | Mel spectrogram features |
|
| 31 |
+
| chunk_lengths | `[1]` | Actual chunk length |
|
| 32 |
+
| spkcache | `[1, S, 512]` | Speaker cache embeddings |
|
| 33 |
+
| spkcache_lengths | `[1]` | Actual cache length |
|
| 34 |
+
| fifo | `[1, F, 512]` | FIFO queue embeddings |
|
| 35 |
+
| fifo_lengths | `[1]` | Actual FIFO length |
|
| 36 |
|
| 37 |
| Output | Shape | Description |
|
| 38 |
|--------|-------|-------------|
|
| 39 |
+
| speaker_preds | `[C+L+R+S+F, 4]` | Speaker probabilities (4 speakers) |
|
| 40 |
+
| chunk_pre_encoder_embs | `[C+L+R, 512]` | Embeddings for state update |
|
| 41 |
+
| chunk_pre_encoder_lengths | `[1]` | Actual embedding count |
|
| 42 |
+
| nest_encoder_embs | `[C+L+R+S+F, 192]` | Embeddings for speaker discrimination |
|
| 43 |
+
| nest_encoder_lengths | `[1]` | Actual speaker embedding count |
|
| 44 |
+
|
| 45 |
+
Note: `C = chunk_len`, `L = chunk_left_context`, `R = chunk_right_context`, `S = spkcache_len`, `F = fifo_len`.
|
| 46 |
+
|
| 47 |
+
**Configuration-Specific Shapes**:
|
| 48 |
+
|
| 49 |
+
| Input | Default | NVIDIA Low | NVIDIA High |
|
| 50 |
+
| chunk | `[1, 112, 128]` | `[1, 112, 128]` | `[1, 3048, 128]` |
|
| 51 |
+
| chunk_lengths | `[1]` | `[1]` | `[1]` |
|
| 52 |
+
| spkcache | `[1, 188, 512]` | `[1, 188, 512]` | `[1, 188, 512]` |
|
| 53 |
+
| spkcache_lengths | `[1]` | `[1]` | `[1]` |
|
| 54 |
+
| fifo | `[1, 40, 512]` | `[1, 188, 512]` | `[1, 40, 512]`
|
| 55 |
+
| fifo_lengths | `[1]` | `[1]` | `[1]` |
|
| 56 |
+
|
| 57 |
+
| Output | Default | NVIDIA Low | NVIDIA High |
|
| 58 |
+
| speaker_preds | `[1, 242, 128]` | `[1, 390, 128]` | `[1, 609, 128]` |
|
| 59 |
+
| chunk_pre_encoder_embs | `[1, 14, 512]` | `[1, 14, 512]` | `[1, 381, 512]` |
|
| 60 |
+
| chunk_pre_encoder_lengths | `[1]` | `[1]` | `[1]` |
|
| 61 |
+
| nest_encoder_embs | `[1, 242, 192]` | `[1, 390, 192]` | `[1, 609, 192]` |
|
| 62 |
+
| nest_encoder_lengths | `[1]` | `[1]` | `[1]` |
|
| 63 |
|
| 64 |
## Usage with FluidAudio (Swift)
|
| 65 |
|