File size: 5,508 Bytes
97d0228
 
615cbc5
 
97d0228
 
 
 
615cbc5
97d0228
 
 
615cbc5
 
 
97d0228
 
615cbc5
97d0228
 
615cbc5
 
 
 
 
 
8422f41
 
 
33975b1
615cbc5
 
 
 
 
97d0228
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
615cbc5
 
 
 
 
 
8422f41
615cbc5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
---
license: mit
language:
  - multilingual
tags:
  - speaker-diarization
  - voice-activity-detection
  - pyannote
  - diarization
  - litert
  - tflite
  - on-device
  - soniqo
  - speech-cloud
  - speech-core
base_model: pyannote/segmentation-3.0
library_name: litert
pipeline_tag: voice-activity-detection
---

# Pyannote Segmentation 3.0 β€” LiteRT

Speaker-aware segmentation for diarization pipelines. 16 kHz, 5-second windows.

> Part of the [**soniqo.audio**](https://soniqo.audio) speech toolkit β€”
> an open, runtime-portable stack for speech AI. This bundle is the
> **LiteRT** export, designed to plug into the abstract interfaces in
> [`speech-core`](https://github.com/soniqo/speech-core) (C++ voice-agent
> orchestration library). Browse all LiteRT bundles in the
> [**soniqo LiteRT collection**](https://huggingface.co/collections/soniqo/litert-6a08268e11d5a47d7aacc02b).

## Use cases on soniqo.audio

- [Meeting transcription](https://soniqo.audio/transcription/)
- [Long-form transcription](https://soniqo.audio/long-form-speech/)

Powerset speaker segmentation (up to 3 local speakers) for Android,
exported in a streaming 1-second chunk configuration.

## Model

| Property | Value |
|---|---|
| Architecture | SincNet frontend + 4-layer BiLSTM + linear + powerset head |
| Parameters | ~1.5 M |
| Format | LiteRT (TFLite) |
| Quantization | float32 |
| Sample rate | 16 000 Hz |
| Chunk | 1 second (16 000 samples) |
| Output frames | 56 per chunk |
| LSTM state | explicit I/O, `[2, 8, 1, 128]` (h+c, 4 layers Γ— 2 directions) |

## Files

| File | Size | Description |
|---|---|---|
| `pyannote-segmentation.tflite` | 6.93 MB | Full model, FP32 |
| `config.json` | 1 KB | Signature + usage hints |

## Why streaming chunks

pyannote/segmentation-3.0 at its trained 10-second window has 589 BiLSTM
time steps. litert-torch has no native `aten.lstm` lowering and unrolls
it into ~4700 cell operations. The resulting MLIR optimizer either hangs
for hours or fails on duplicate `jax_lowering_*` symbols from repeated
helper functions.

Exporting at 1-second chunks (56 time steps) compiles in ~2 minutes and
produces a valid TFLite. The caller runs 10 chunks in sequence, passing
`lstm_state_out β†’ lstm_state` between calls, to cover the full 10-second
window. Each chunk produces 56 frames of powerset posteriors.

The SincNet frontend has small per-chunk edge effects: 10 Γ— 56 = 560
frames versus 589 in the original model. Overlap chunks by ~500 ms on
boundaries where high-precision stitching is required.

## Signature

```
Inputs:
  audio         [1, 1, 16000]     float32   1 s of audio @ 16 kHz
  lstm_state    [2, 8, 1, 128]    float32   (h, c), zeros on first chunk

Outputs:
  posteriors    [1, 56, 7]        float32   powerset posteriors
  lstm_state_out [2, 8, 1, 128]   float32   next-chunk state
```

Powerset classes (7): `{βˆ…, s1, s2, s3, s1βˆͺs2, s1βˆͺs3, s2βˆͺs3}` β€” up to 3 local
speakers, no triple-overlap class.

## Usage

```kotlin
val model = Interpreter(loadModelFile("pyannote-segmentation.tflite"))
var state = FloatArray(2 * 8 * 1 * 128) // zero on first call

fun segment(chunk: FloatArray): FloatArray {
    val out = FloatArray(1 * 56 * 7)
    val nextState = FloatArray(state.size)
    model.runSignature(
        mapOf(0 to chunk.toDirectBuffer(), 1 to state.toDirectBuffer()),
        mapOf(0 to out, 1 to nextState),
    )
    state = nextState
    return out // [56, 7] log-probs
}
```

## Source

Upstream: [pyannote/segmentation-3.0](https://huggingface.co/pyannote/segmentation-3.0)
(MIT, gated β€” accept the license on the upstream page).

## Links

- [speech-android](https://github.com/soniqo/speech-android) β€” Android SDK
- [soniqo.audio](https://soniqo.audio) β€” website
- [blog](https://soniqo.audio/blog) β€” blog

## Ecosystem

- [**soniqo.audio**](https://soniqo.audio) β€” use-case explorer (transcription, voice cloning, live ASR, voice agents).
- [**speech-core**](https://github.com/soniqo/speech-core) β€” C++ orchestration library for voice agents. Abstract `STTInterface` / `TTSInterface` / `VADInterface` / `EnhancerInterface`; LiteRT implementations plug straight into the interfaces.
- [**speech-swift**](https://github.com/soniqo/speech-swift) β€” Apple Silicon MLX companion runtime (model-specific MLX bundles linked above where applicable).
- [**speech-android**](https://github.com/soniqo/speech-android) β€” Android SDK consuming on-device LiteRT bundles.

## Other LiteRT models in this collection

**ASR / Transcription**

- [Parakeet TDT 0.6B v3 β€” LiteRT (INT8)](https://huggingface.co/soniqo/Parakeet-TDT-0.6B-v3-LiteRT-INT8)
- [Nemotron Speech Streaming 0.6B β€” LiteRT](https://huggingface.co/soniqo/Nemotron-Speech-Streaming-LiteRT)
- [Omnilingual ASR CTC 300M β€” LiteRT](https://huggingface.co/soniqo/Omnilingual-ASR-CTC-300M-LiteRT)
- [Omnilingual ASR CTC 300M β€” LiteRT (INT8)](https://huggingface.co/soniqo/Omnilingual-ASR-CTC-300M-LiteRT-INT8)
- [Qwen3 ASR 0.6B Encoder β€” LiteRT (INT8)](https://huggingface.co/soniqo/Qwen3-ASR-0.6B-Encoder-LiteRT-INT8)

**VAD / Diarization**

- [Silero VAD v5 β€” LiteRT](https://huggingface.co/soniqo/Silero-VAD-v5-LiteRT)
- [WeSpeaker ResNet34-LM β€” LiteRT](https://huggingface.co/soniqo/WeSpeaker-ResNet34-LM-LiteRT)

**TTS / Voice Cloning**

- [VoxCPM2 β€” LiteRT (INT8)](https://huggingface.co/soniqo/VoxCPM2-LiteRT-INT8)

## License

This bundle inherits the upstream model license (**mit**). See the
linked `base_model` repository for the full terms.