aufklarer's picture
card: unified LiteRT model card with soniqo.audio + ecosystem links
8422f41 verified
---
license: mit
language:
- multilingual
tags:
- speaker-diarization
- voice-activity-detection
- pyannote
- diarization
- litert
- tflite
- on-device
- soniqo
- speech-cloud
- speech-core
base_model: pyannote/segmentation-3.0
library_name: litert
pipeline_tag: voice-activity-detection
---
# Pyannote Segmentation 3.0 β€” LiteRT
Speaker-aware segmentation for diarization pipelines. 16 kHz, 5-second windows.
> Part of the [**soniqo.audio**](https://soniqo.audio) speech toolkit β€”
> an open, runtime-portable stack for speech AI. This bundle is the
> **LiteRT** export, designed to plug into the abstract interfaces in
> [`speech-core`](https://github.com/soniqo/speech-core) (C++ voice-agent
> orchestration library). Browse all LiteRT bundles in the
> [**soniqo LiteRT collection**](https://huggingface.co/collections/soniqo/litert-6a08268e11d5a47d7aacc02b).
## Use cases on soniqo.audio
- [Meeting transcription](https://soniqo.audio/transcription/)
- [Long-form transcription](https://soniqo.audio/long-form-speech/)
Powerset speaker segmentation (up to 3 local speakers) for Android,
exported in a streaming 1-second chunk configuration.
## Model
| Property | Value |
|---|---|
| Architecture | SincNet frontend + 4-layer BiLSTM + linear + powerset head |
| Parameters | ~1.5 M |
| Format | LiteRT (TFLite) |
| Quantization | float32 |
| Sample rate | 16 000 Hz |
| Chunk | 1 second (16 000 samples) |
| Output frames | 56 per chunk |
| LSTM state | explicit I/O, `[2, 8, 1, 128]` (h+c, 4 layers Γ— 2 directions) |
## Files
| File | Size | Description |
|---|---|---|
| `pyannote-segmentation.tflite` | 6.93 MB | Full model, FP32 |
| `config.json` | 1 KB | Signature + usage hints |
## Why streaming chunks
pyannote/segmentation-3.0 at its trained 10-second window has 589 BiLSTM
time steps. litert-torch has no native `aten.lstm` lowering and unrolls
it into ~4700 cell operations. The resulting MLIR optimizer either hangs
for hours or fails on duplicate `jax_lowering_*` symbols from repeated
helper functions.
Exporting at 1-second chunks (56 time steps) compiles in ~2 minutes and
produces a valid TFLite. The caller runs 10 chunks in sequence, passing
`lstm_state_out β†’ lstm_state` between calls, to cover the full 10-second
window. Each chunk produces 56 frames of powerset posteriors.
The SincNet frontend has small per-chunk edge effects: 10 Γ— 56 = 560
frames versus 589 in the original model. Overlap chunks by ~500 ms on
boundaries where high-precision stitching is required.
## Signature
```
Inputs:
audio [1, 1, 16000] float32 1 s of audio @ 16 kHz
lstm_state [2, 8, 1, 128] float32 (h, c), zeros on first chunk
Outputs:
posteriors [1, 56, 7] float32 powerset posteriors
lstm_state_out [2, 8, 1, 128] float32 next-chunk state
```
Powerset classes (7): `{βˆ…, s1, s2, s3, s1βˆͺs2, s1βˆͺs3, s2βˆͺs3}` β€” up to 3 local
speakers, no triple-overlap class.
## Usage
```kotlin
val model = Interpreter(loadModelFile("pyannote-segmentation.tflite"))
var state = FloatArray(2 * 8 * 1 * 128) // zero on first call
fun segment(chunk: FloatArray): FloatArray {
val out = FloatArray(1 * 56 * 7)
val nextState = FloatArray(state.size)
model.runSignature(
mapOf(0 to chunk.toDirectBuffer(), 1 to state.toDirectBuffer()),
mapOf(0 to out, 1 to nextState),
)
state = nextState
return out // [56, 7] log-probs
}
```
## Source
Upstream: [pyannote/segmentation-3.0](https://huggingface.co/pyannote/segmentation-3.0)
(MIT, gated β€” accept the license on the upstream page).
## Links
- [speech-android](https://github.com/soniqo/speech-android) β€” Android SDK
- [soniqo.audio](https://soniqo.audio) β€” website
- [blog](https://soniqo.audio/blog) β€” blog
## Ecosystem
- [**soniqo.audio**](https://soniqo.audio) β€” use-case explorer (transcription, voice cloning, live ASR, voice agents).
- [**speech-core**](https://github.com/soniqo/speech-core) β€” C++ orchestration library for voice agents. Abstract `STTInterface` / `TTSInterface` / `VADInterface` / `EnhancerInterface`; LiteRT implementations plug straight into the interfaces.
- [**speech-swift**](https://github.com/soniqo/speech-swift) β€” Apple Silicon MLX companion runtime (model-specific MLX bundles linked above where applicable).
- [**speech-android**](https://github.com/soniqo/speech-android) β€” Android SDK consuming on-device LiteRT bundles.
## Other LiteRT models in this collection
**ASR / Transcription**
- [Parakeet TDT 0.6B v3 β€” LiteRT (INT8)](https://huggingface.co/soniqo/Parakeet-TDT-0.6B-v3-LiteRT-INT8)
- [Nemotron Speech Streaming 0.6B β€” LiteRT](https://huggingface.co/soniqo/Nemotron-Speech-Streaming-LiteRT)
- [Omnilingual ASR CTC 300M β€” LiteRT](https://huggingface.co/soniqo/Omnilingual-ASR-CTC-300M-LiteRT)
- [Omnilingual ASR CTC 300M β€” LiteRT (INT8)](https://huggingface.co/soniqo/Omnilingual-ASR-CTC-300M-LiteRT-INT8)
- [Qwen3 ASR 0.6B Encoder β€” LiteRT (INT8)](https://huggingface.co/soniqo/Qwen3-ASR-0.6B-Encoder-LiteRT-INT8)
**VAD / Diarization**
- [Silero VAD v5 β€” LiteRT](https://huggingface.co/soniqo/Silero-VAD-v5-LiteRT)
- [WeSpeaker ResNet34-LM β€” LiteRT](https://huggingface.co/soniqo/WeSpeaker-ResNet34-LM-LiteRT)
**TTS / Voice Cloning**
- [VoxCPM2 β€” LiteRT (INT8)](https://huggingface.co/soniqo/VoxCPM2-LiteRT-INT8)
## License
This bundle inherits the upstream model license (**mit**). See the
linked `base_model` repository for the full terms.