Voice Activity Detection
LiteRT
LiteRT
multilingual
speaker-diarization
pyannote
diarization
on-device
soniqo
speech-cloud
speech-core
Instructions to use soniqo/Pyannote-Segmentation-LiteRT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT
How to use soniqo/Pyannote-Segmentation-LiteRT with LiteRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
File size: 5,508 Bytes
97d0228 615cbc5 97d0228 615cbc5 97d0228 615cbc5 97d0228 615cbc5 97d0228 615cbc5 8422f41 33975b1 615cbc5 97d0228 615cbc5 8422f41 615cbc5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 | ---
license: mit
language:
- multilingual
tags:
- speaker-diarization
- voice-activity-detection
- pyannote
- diarization
- litert
- tflite
- on-device
- soniqo
- speech-cloud
- speech-core
base_model: pyannote/segmentation-3.0
library_name: litert
pipeline_tag: voice-activity-detection
---
# Pyannote Segmentation 3.0 β LiteRT
Speaker-aware segmentation for diarization pipelines. 16 kHz, 5-second windows.
> Part of the [**soniqo.audio**](https://soniqo.audio) speech toolkit β
> an open, runtime-portable stack for speech AI. This bundle is the
> **LiteRT** export, designed to plug into the abstract interfaces in
> [`speech-core`](https://github.com/soniqo/speech-core) (C++ voice-agent
> orchestration library). Browse all LiteRT bundles in the
> [**soniqo LiteRT collection**](https://huggingface.co/collections/soniqo/litert-6a08268e11d5a47d7aacc02b).
## Use cases on soniqo.audio
- [Meeting transcription](https://soniqo.audio/transcription/)
- [Long-form transcription](https://soniqo.audio/long-form-speech/)
Powerset speaker segmentation (up to 3 local speakers) for Android,
exported in a streaming 1-second chunk configuration.
## Model
| Property | Value |
|---|---|
| Architecture | SincNet frontend + 4-layer BiLSTM + linear + powerset head |
| Parameters | ~1.5 M |
| Format | LiteRT (TFLite) |
| Quantization | float32 |
| Sample rate | 16 000 Hz |
| Chunk | 1 second (16 000 samples) |
| Output frames | 56 per chunk |
| LSTM state | explicit I/O, `[2, 8, 1, 128]` (h+c, 4 layers Γ 2 directions) |
## Files
| File | Size | Description |
|---|---|---|
| `pyannote-segmentation.tflite` | 6.93 MB | Full model, FP32 |
| `config.json` | 1 KB | Signature + usage hints |
## Why streaming chunks
pyannote/segmentation-3.0 at its trained 10-second window has 589 BiLSTM
time steps. litert-torch has no native `aten.lstm` lowering and unrolls
it into ~4700 cell operations. The resulting MLIR optimizer either hangs
for hours or fails on duplicate `jax_lowering_*` symbols from repeated
helper functions.
Exporting at 1-second chunks (56 time steps) compiles in ~2 minutes and
produces a valid TFLite. The caller runs 10 chunks in sequence, passing
`lstm_state_out β lstm_state` between calls, to cover the full 10-second
window. Each chunk produces 56 frames of powerset posteriors.
The SincNet frontend has small per-chunk edge effects: 10 Γ 56 = 560
frames versus 589 in the original model. Overlap chunks by ~500 ms on
boundaries where high-precision stitching is required.
## Signature
```
Inputs:
audio [1, 1, 16000] float32 1 s of audio @ 16 kHz
lstm_state [2, 8, 1, 128] float32 (h, c), zeros on first chunk
Outputs:
posteriors [1, 56, 7] float32 powerset posteriors
lstm_state_out [2, 8, 1, 128] float32 next-chunk state
```
Powerset classes (7): `{β
, s1, s2, s3, s1βͺs2, s1βͺs3, s2βͺs3}` β up to 3 local
speakers, no triple-overlap class.
## Usage
```kotlin
val model = Interpreter(loadModelFile("pyannote-segmentation.tflite"))
var state = FloatArray(2 * 8 * 1 * 128) // zero on first call
fun segment(chunk: FloatArray): FloatArray {
val out = FloatArray(1 * 56 * 7)
val nextState = FloatArray(state.size)
model.runSignature(
mapOf(0 to chunk.toDirectBuffer(), 1 to state.toDirectBuffer()),
mapOf(0 to out, 1 to nextState),
)
state = nextState
return out // [56, 7] log-probs
}
```
## Source
Upstream: [pyannote/segmentation-3.0](https://huggingface.co/pyannote/segmentation-3.0)
(MIT, gated β accept the license on the upstream page).
## Links
- [speech-android](https://github.com/soniqo/speech-android) β Android SDK
- [soniqo.audio](https://soniqo.audio) β website
- [blog](https://soniqo.audio/blog) β blog
## Ecosystem
- [**soniqo.audio**](https://soniqo.audio) β use-case explorer (transcription, voice cloning, live ASR, voice agents).
- [**speech-core**](https://github.com/soniqo/speech-core) β C++ orchestration library for voice agents. Abstract `STTInterface` / `TTSInterface` / `VADInterface` / `EnhancerInterface`; LiteRT implementations plug straight into the interfaces.
- [**speech-swift**](https://github.com/soniqo/speech-swift) β Apple Silicon MLX companion runtime (model-specific MLX bundles linked above where applicable).
- [**speech-android**](https://github.com/soniqo/speech-android) β Android SDK consuming on-device LiteRT bundles.
## Other LiteRT models in this collection
**ASR / Transcription**
- [Parakeet TDT 0.6B v3 β LiteRT (INT8)](https://huggingface.co/soniqo/Parakeet-TDT-0.6B-v3-LiteRT-INT8)
- [Nemotron Speech Streaming 0.6B β LiteRT](https://huggingface.co/soniqo/Nemotron-Speech-Streaming-LiteRT)
- [Omnilingual ASR CTC 300M β LiteRT](https://huggingface.co/soniqo/Omnilingual-ASR-CTC-300M-LiteRT)
- [Omnilingual ASR CTC 300M β LiteRT (INT8)](https://huggingface.co/soniqo/Omnilingual-ASR-CTC-300M-LiteRT-INT8)
- [Qwen3 ASR 0.6B Encoder β LiteRT (INT8)](https://huggingface.co/soniqo/Qwen3-ASR-0.6B-Encoder-LiteRT-INT8)
**VAD / Diarization**
- [Silero VAD v5 β LiteRT](https://huggingface.co/soniqo/Silero-VAD-v5-LiteRT)
- [WeSpeaker ResNet34-LM β LiteRT](https://huggingface.co/soniqo/WeSpeaker-ResNet34-LM-LiteRT)
**TTS / Voice Cloning**
- [VoxCPM2 β LiteRT (INT8)](https://huggingface.co/soniqo/VoxCPM2-LiteRT-INT8)
## License
This bundle inherits the upstream model license (**mit**). See the
linked `base_model` repository for the full terms.
|