--- license: mit language: - multilingual tags: - speaker-diarization - voice-activity-detection - pyannote - diarization - litert - tflite - on-device - soniqo - speech-cloud - speech-core base_model: pyannote/segmentation-3.0 library_name: litert pipeline_tag: voice-activity-detection --- # Pyannote Segmentation 3.0 — LiteRT Speaker-aware segmentation for diarization pipelines. 16 kHz, 5-second windows. > Part of the [**soniqo.audio**](https://soniqo.audio) speech toolkit — > an open, runtime-portable stack for speech AI. This bundle is the > **LiteRT** export, designed to plug into the abstract interfaces in > [`speech-core`](https://github.com/soniqo/speech-core) (C++ voice-agent > orchestration library). Browse all LiteRT bundles in the > [**soniqo LiteRT collection**](https://huggingface.co/collections/soniqo/litert-6a08268e11d5a47d7aacc02b). ## Use cases on soniqo.audio - [Meeting transcription](https://soniqo.audio/transcription/) - [Long-form transcription](https://soniqo.audio/long-form-speech/) Powerset speaker segmentation (up to 3 local speakers) for Android, exported in a streaming 1-second chunk configuration. ## Model | Property | Value | |---|---| | Architecture | SincNet frontend + 4-layer BiLSTM + linear + powerset head | | Parameters | ~1.5 M | | Format | LiteRT (TFLite) | | Quantization | float32 | | Sample rate | 16 000 Hz | | Chunk | 1 second (16 000 samples) | | Output frames | 56 per chunk | | LSTM state | explicit I/O, `[2, 8, 1, 128]` (h+c, 4 layers × 2 directions) | ## Files | File | Size | Description | |---|---|---| | `pyannote-segmentation.tflite` | 6.93 MB | Full model, FP32 | | `config.json` | 1 KB | Signature + usage hints | ## Why streaming chunks pyannote/segmentation-3.0 at its trained 10-second window has 589 BiLSTM time steps. litert-torch has no native `aten.lstm` lowering and unrolls it into ~4700 cell operations. The resulting MLIR optimizer either hangs for hours or fails on duplicate `jax_lowering_*` symbols from repeated helper functions. Exporting at 1-second chunks (56 time steps) compiles in ~2 minutes and produces a valid TFLite. The caller runs 10 chunks in sequence, passing `lstm_state_out → lstm_state` between calls, to cover the full 10-second window. Each chunk produces 56 frames of powerset posteriors. The SincNet frontend has small per-chunk edge effects: 10 × 56 = 560 frames versus 589 in the original model. Overlap chunks by ~500 ms on boundaries where high-precision stitching is required. ## Signature ``` Inputs: audio [1, 1, 16000] float32 1 s of audio @ 16 kHz lstm_state [2, 8, 1, 128] float32 (h, c), zeros on first chunk Outputs: posteriors [1, 56, 7] float32 powerset posteriors lstm_state_out [2, 8, 1, 128] float32 next-chunk state ``` Powerset classes (7): `{∅, s1, s2, s3, s1∪s2, s1∪s3, s2∪s3}` — up to 3 local speakers, no triple-overlap class. ## Usage ```kotlin val model = Interpreter(loadModelFile("pyannote-segmentation.tflite")) var state = FloatArray(2 * 8 * 1 * 128) // zero on first call fun segment(chunk: FloatArray): FloatArray { val out = FloatArray(1 * 56 * 7) val nextState = FloatArray(state.size) model.runSignature( mapOf(0 to chunk.toDirectBuffer(), 1 to state.toDirectBuffer()), mapOf(0 to out, 1 to nextState), ) state = nextState return out // [56, 7] log-probs } ``` ## Source Upstream: [pyannote/segmentation-3.0](https://huggingface.co/pyannote/segmentation-3.0) (MIT, gated — accept the license on the upstream page). ## Links - [speech-android](https://github.com/soniqo/speech-android) — Android SDK - [soniqo.audio](https://soniqo.audio) — website - [blog](https://soniqo.audio/blog) — blog ## Ecosystem - [**soniqo.audio**](https://soniqo.audio) — use-case explorer (transcription, voice cloning, live ASR, voice agents). - [**speech-core**](https://github.com/soniqo/speech-core) — C++ orchestration library for voice agents. Abstract `STTInterface` / `TTSInterface` / `VADInterface` / `EnhancerInterface`; LiteRT implementations plug straight into the interfaces. - [**speech-swift**](https://github.com/soniqo/speech-swift) — Apple Silicon MLX companion runtime (model-specific MLX bundles linked above where applicable). - [**speech-android**](https://github.com/soniqo/speech-android) — Android SDK consuming on-device LiteRT bundles. ## Other LiteRT models in this collection **ASR / Transcription** - [Parakeet TDT 0.6B v3 — LiteRT (INT8)](https://huggingface.co/soniqo/Parakeet-TDT-0.6B-v3-LiteRT-INT8) - [Nemotron Speech Streaming 0.6B — LiteRT](https://huggingface.co/soniqo/Nemotron-Speech-Streaming-LiteRT) - [Omnilingual ASR CTC 300M — LiteRT](https://huggingface.co/soniqo/Omnilingual-ASR-CTC-300M-LiteRT) - [Omnilingual ASR CTC 300M — LiteRT (INT8)](https://huggingface.co/soniqo/Omnilingual-ASR-CTC-300M-LiteRT-INT8) - [Qwen3 ASR 0.6B Encoder — LiteRT (INT8)](https://huggingface.co/soniqo/Qwen3-ASR-0.6B-Encoder-LiteRT-INT8) **VAD / Diarization** - [Silero VAD v5 — LiteRT](https://huggingface.co/soniqo/Silero-VAD-v5-LiteRT) - [WeSpeaker ResNet34-LM — LiteRT](https://huggingface.co/soniqo/WeSpeaker-ResNet34-LM-LiteRT) **TTS / Voice Cloning** - [VoxCPM2 — LiteRT (INT8)](https://huggingface.co/soniqo/VoxCPM2-LiteRT-INT8) ## License This bundle inherits the upstream model license (**mit**). See the linked `base_model` repository for the full terms.