soniqo
/

Pyannote-Segmentation-LiteRT

+---
+license: mit
+language: multilingual
+tags:
+  - speaker-diarization
+  - voice-activity-detection
+  - pyannote
+  - litert
+  - tflite
+  - on-device
+  - android
+base_model: pyannote/segmentation-3.0
+library_name: litert
+pipeline_tag: automatic-speech-recognition
+---
+# Pyannote Segmentation 3.0 — LiteRT (streaming)
+Powerset speaker segmentation (up to 3 local speakers) for Android,
+exported in a streaming 1-second chunk configuration.
+## Model
+| Property | Value |
+|---|---|
+| Architecture | SincNet frontend + 4-layer BiLSTM + linear + powerset head |
+| Parameters | ~1.5 M |
+| Format | LiteRT (TFLite) |
+| Quantization | float32 |
+| Sample rate | 16 000 Hz |
+| Chunk | 1 second (16 000 samples) |
+| Output frames | 56 per chunk |
+| LSTM state | explicit I/O, `[2, 8, 1, 128]` (h+c, 4 layers × 2 directions) |
+## Files
+| File | Size | Description |
+|---|---|---|
+| `pyannote-segmentation.tflite` | 6.93 MB | Full model, FP32 |
+| `config.json` | 1 KB | Signature + usage hints |
+## Why streaming chunks
+pyannote/segmentation-3.0 at its trained 10-second window has 589 BiLSTM
+time steps. litert-torch has no native `aten.lstm` lowering and unrolls
+it into ~4700 cell operations. The resulting MLIR optimizer either hangs
+for hours or fails on duplicate `jax_lowering_*` symbols from repeated
+helper functions.
+Exporting at 1-second chunks (56 time steps) compiles in ~2 minutes and
+produces a valid TFLite. The caller runs 10 chunks in sequence, passing
+`lstm_state_out → lstm_state` between calls, to cover the full 10-second
+window. Each chunk produces 56 frames of powerset posteriors.
+The SincNet frontend has small per-chunk edge effects: 10 × 56 = 560
+frames versus 589 in the original model. Overlap chunks by ~500 ms on
+boundaries where high-precision stitching is required.
+## Signature
+```
+Inputs:
+  audio         [1, 1, 16000]     float32   1 s of audio @ 16 kHz
+  lstm_state    [2, 8, 1, 128]    float32   (h, c), zeros on first chunk
+Outputs:
+  posteriors    [1, 56, 7]        float32   powerset posteriors
+  lstm_state_out [2, 8, 1, 128]   float32   next-chunk state
+```
+Powerset classes (7): `{∅, s1, s2, s3, s1∪s2, s1∪s3, s2∪s3}` — up to 3 local
+speakers, no triple-overlap class.
+## Usage
+```kotlin
+val model = Interpreter(loadModelFile("pyannote-segmentation.tflite"))
+var state = FloatArray(2 * 8 * 1 * 128) // zero on first call
+fun segment(chunk: FloatArray): FloatArray {
+    val out = FloatArray(1 * 56 * 7)
+    val nextState = FloatArray(state.size)
+    model.runSignature(
+        mapOf(0 to chunk.toDirectBuffer(), 1 to state.toDirectBuffer()),
+        mapOf(0 to out, 1 to nextState),
+    )
+    state = nextState
+    return out // [56, 7] log-probs
+}
+```
+## Source
+Upstream: [pyannote/segmentation-3.0](https://huggingface.co/pyannote/segmentation-3.0)
+(MIT, gated — accept the license on the upstream page).
+## Links
+- [speech-android](https://github.com/soniqo/speech-android) — Android SDK
+- [soniqo.audio](https://soniqo.audio) — website
+- [blog](https://soniqo.audio/blog) — blog

config.json ADDED Viewed

	@@ -0,0 +1,61 @@

+{
+  "model": "pyannote-segmentation-3.0",
+  "format": "tflite",
+  "mode": "streaming",
+  "sample_rate": 16000,
+  "chunk_duration": 1.0,
+  "full_window_duration": 10.0,
+  "full_window_step": 5.0,
+  "num_chunks_per_window": 10,
+  "num_powerset_classes": 7,
+  "max_local_speakers": 3,
+  "frames_per_chunk": 56,
+  "frames_per_window": 560,
+  "lstm_state_shape": [
+    2,
+    8,
+    1,
+    128
+  ],
+  "inputs": {
+    "audio": {
+      "shape": [
+        1,
+        1,
+        16000
+      ],
+      "dtype": "float32"
+    },
+    "lstm_state": {
+      "shape": [
+        2,
+        8,
+        1,
+        128
+      ],
+      "dtype": "float32",
+      "note": "Pass zeros on first chunk. Carry forward between chunks."
+    }
+  },
+  "outputs": {
+    "posteriors": {
+      "shape": [
+        1,
+        56,
+        7
+      ],
+      "dtype": "float32"
+    },
+    "lstm_state_out": {
+      "shape": [
+        2,
+        8,
+        1,
+        128
+      ],
+      "dtype": "float32",
+      "note": "Feed back as lstm_state for the next chunk."
+    }
+  },
+  "usage": "Run 10 consecutive 1-second chunks with state carried between calls to reconstruct a full 10-second segmentation window. Initialize lstm_state to zeros for the first chunk."
+}

pyannote-segmentation.tflite ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0232d4098c5069d012b92cb4b5d8cf148807777aa214203e4706a282e640f259
+size 7265360