aufklarer commited on
Commit
97d0228
·
verified ·
1 Parent(s): 87a67d4

Initial LiteRT upload

Browse files
Files changed (3) hide show
  1. README.md +101 -0
  2. config.json +61 -0
  3. pyannote-segmentation.tflite +3 -0
README.md ADDED
@@ -0,0 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language: multilingual
4
+ tags:
5
+ - speaker-diarization
6
+ - voice-activity-detection
7
+ - pyannote
8
+ - litert
9
+ - tflite
10
+ - on-device
11
+ - android
12
+ base_model: pyannote/segmentation-3.0
13
+ library_name: litert
14
+ pipeline_tag: automatic-speech-recognition
15
+ ---
16
+
17
+ # Pyannote Segmentation 3.0 — LiteRT (streaming)
18
+
19
+ Powerset speaker segmentation (up to 3 local speakers) for Android,
20
+ exported in a streaming 1-second chunk configuration.
21
+
22
+ ## Model
23
+
24
+ | Property | Value |
25
+ |---|---|
26
+ | Architecture | SincNet frontend + 4-layer BiLSTM + linear + powerset head |
27
+ | Parameters | ~1.5 M |
28
+ | Format | LiteRT (TFLite) |
29
+ | Quantization | float32 |
30
+ | Sample rate | 16 000 Hz |
31
+ | Chunk | 1 second (16 000 samples) |
32
+ | Output frames | 56 per chunk |
33
+ | LSTM state | explicit I/O, `[2, 8, 1, 128]` (h+c, 4 layers × 2 directions) |
34
+
35
+ ## Files
36
+
37
+ | File | Size | Description |
38
+ |---|---|---|
39
+ | `pyannote-segmentation.tflite` | 6.93 MB | Full model, FP32 |
40
+ | `config.json` | 1 KB | Signature + usage hints |
41
+
42
+ ## Why streaming chunks
43
+
44
+ pyannote/segmentation-3.0 at its trained 10-second window has 589 BiLSTM
45
+ time steps. litert-torch has no native `aten.lstm` lowering and unrolls
46
+ it into ~4700 cell operations. The resulting MLIR optimizer either hangs
47
+ for hours or fails on duplicate `jax_lowering_*` symbols from repeated
48
+ helper functions.
49
+
50
+ Exporting at 1-second chunks (56 time steps) compiles in ~2 minutes and
51
+ produces a valid TFLite. The caller runs 10 chunks in sequence, passing
52
+ `lstm_state_out → lstm_state` between calls, to cover the full 10-second
53
+ window. Each chunk produces 56 frames of powerset posteriors.
54
+
55
+ The SincNet frontend has small per-chunk edge effects: 10 × 56 = 560
56
+ frames versus 589 in the original model. Overlap chunks by ~500 ms on
57
+ boundaries where high-precision stitching is required.
58
+
59
+ ## Signature
60
+
61
+ ```
62
+ Inputs:
63
+ audio [1, 1, 16000] float32 1 s of audio @ 16 kHz
64
+ lstm_state [2, 8, 1, 128] float32 (h, c), zeros on first chunk
65
+
66
+ Outputs:
67
+ posteriors [1, 56, 7] float32 powerset posteriors
68
+ lstm_state_out [2, 8, 1, 128] float32 next-chunk state
69
+ ```
70
+
71
+ Powerset classes (7): `{∅, s1, s2, s3, s1∪s2, s1∪s3, s2∪s3}` — up to 3 local
72
+ speakers, no triple-overlap class.
73
+
74
+ ## Usage
75
+
76
+ ```kotlin
77
+ val model = Interpreter(loadModelFile("pyannote-segmentation.tflite"))
78
+ var state = FloatArray(2 * 8 * 1 * 128) // zero on first call
79
+
80
+ fun segment(chunk: FloatArray): FloatArray {
81
+ val out = FloatArray(1 * 56 * 7)
82
+ val nextState = FloatArray(state.size)
83
+ model.runSignature(
84
+ mapOf(0 to chunk.toDirectBuffer(), 1 to state.toDirectBuffer()),
85
+ mapOf(0 to out, 1 to nextState),
86
+ )
87
+ state = nextState
88
+ return out // [56, 7] log-probs
89
+ }
90
+ ```
91
+
92
+ ## Source
93
+
94
+ Upstream: [pyannote/segmentation-3.0](https://huggingface.co/pyannote/segmentation-3.0)
95
+ (MIT, gated — accept the license on the upstream page).
96
+
97
+ ## Links
98
+
99
+ - [speech-android](https://github.com/soniqo/speech-android) — Android SDK
100
+ - [soniqo.audio](https://soniqo.audio) — website
101
+ - [blog](https://soniqo.audio/blog) — blog
config.json ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model": "pyannote-segmentation-3.0",
3
+ "format": "tflite",
4
+ "mode": "streaming",
5
+ "sample_rate": 16000,
6
+ "chunk_duration": 1.0,
7
+ "full_window_duration": 10.0,
8
+ "full_window_step": 5.0,
9
+ "num_chunks_per_window": 10,
10
+ "num_powerset_classes": 7,
11
+ "max_local_speakers": 3,
12
+ "frames_per_chunk": 56,
13
+ "frames_per_window": 560,
14
+ "lstm_state_shape": [
15
+ 2,
16
+ 8,
17
+ 1,
18
+ 128
19
+ ],
20
+ "inputs": {
21
+ "audio": {
22
+ "shape": [
23
+ 1,
24
+ 1,
25
+ 16000
26
+ ],
27
+ "dtype": "float32"
28
+ },
29
+ "lstm_state": {
30
+ "shape": [
31
+ 2,
32
+ 8,
33
+ 1,
34
+ 128
35
+ ],
36
+ "dtype": "float32",
37
+ "note": "Pass zeros on first chunk. Carry forward between chunks."
38
+ }
39
+ },
40
+ "outputs": {
41
+ "posteriors": {
42
+ "shape": [
43
+ 1,
44
+ 56,
45
+ 7
46
+ ],
47
+ "dtype": "float32"
48
+ },
49
+ "lstm_state_out": {
50
+ "shape": [
51
+ 2,
52
+ 8,
53
+ 1,
54
+ 128
55
+ ],
56
+ "dtype": "float32",
57
+ "note": "Feed back as lstm_state for the next chunk."
58
+ }
59
+ },
60
+ "usage": "Run 10 consecutive 1-second chunks with state carried between calls to reconstruct a full 10-second segmentation window. Initialize lstm_state to zeros for the first chunk."
61
+ }
pyannote-segmentation.tflite ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0232d4098c5069d012b92cb4b5d8cf148807777aa214203e4706a282e640f259
3
+ size 7265360