Voice Activity Detection
LiteRT
LiteRT
multilingual
speaker-diarization
pyannote
diarization
on-device
soniqo
speech-cloud
speech-core
Instructions to use soniqo/Pyannote-Segmentation-LiteRT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT
How to use soniqo/Pyannote-Segmentation-LiteRT with LiteRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
| license: mit | |
| language: | |
| - multilingual | |
| tags: | |
| - speaker-diarization | |
| - voice-activity-detection | |
| - pyannote | |
| - diarization | |
| - litert | |
| - tflite | |
| - on-device | |
| - soniqo | |
| - speech-cloud | |
| - speech-core | |
| base_model: pyannote/segmentation-3.0 | |
| library_name: litert | |
| pipeline_tag: voice-activity-detection | |
| # Pyannote Segmentation 3.0 β LiteRT | |
| Speaker-aware segmentation for diarization pipelines. 16 kHz, 5-second windows. | |
| > Part of the [**soniqo.audio**](https://soniqo.audio) speech toolkit β | |
| > an open, runtime-portable stack for speech AI. This bundle is the | |
| > **LiteRT** export, designed to plug into the abstract interfaces in | |
| > [`speech-core`](https://github.com/soniqo/speech-core) (C++ voice-agent | |
| > orchestration library). Browse all LiteRT bundles in the | |
| > [**soniqo LiteRT collection**](https://huggingface.co/collections/soniqo/litert-6a08268e11d5a47d7aacc02b). | |
| ## Use cases on soniqo.audio | |
| - [Meeting transcription](https://soniqo.audio/transcription/) | |
| - [Long-form transcription](https://soniqo.audio/long-form-speech/) | |
| Powerset speaker segmentation (up to 3 local speakers) for Android, | |
| exported in a streaming 1-second chunk configuration. | |
| ## Model | |
| | Property | Value | | |
| |---|---| | |
| | Architecture | SincNet frontend + 4-layer BiLSTM + linear + powerset head | | |
| | Parameters | ~1.5 M | | |
| | Format | LiteRT (TFLite) | | |
| | Quantization | float32 | | |
| | Sample rate | 16 000 Hz | | |
| | Chunk | 1 second (16 000 samples) | | |
| | Output frames | 56 per chunk | | |
| | LSTM state | explicit I/O, `[2, 8, 1, 128]` (h+c, 4 layers Γ 2 directions) | | |
| ## Files | |
| | File | Size | Description | | |
| |---|---|---| | |
| | `pyannote-segmentation.tflite` | 6.93 MB | Full model, FP32 | | |
| | `config.json` | 1 KB | Signature + usage hints | | |
| ## Why streaming chunks | |
| pyannote/segmentation-3.0 at its trained 10-second window has 589 BiLSTM | |
| time steps. litert-torch has no native `aten.lstm` lowering and unrolls | |
| it into ~4700 cell operations. The resulting MLIR optimizer either hangs | |
| for hours or fails on duplicate `jax_lowering_*` symbols from repeated | |
| helper functions. | |
| Exporting at 1-second chunks (56 time steps) compiles in ~2 minutes and | |
| produces a valid TFLite. The caller runs 10 chunks in sequence, passing | |
| `lstm_state_out β lstm_state` between calls, to cover the full 10-second | |
| window. Each chunk produces 56 frames of powerset posteriors. | |
| The SincNet frontend has small per-chunk edge effects: 10 Γ 56 = 560 | |
| frames versus 589 in the original model. Overlap chunks by ~500 ms on | |
| boundaries where high-precision stitching is required. | |
| ## Signature | |
| ``` | |
| Inputs: | |
| audio [1, 1, 16000] float32 1 s of audio @ 16 kHz | |
| lstm_state [2, 8, 1, 128] float32 (h, c), zeros on first chunk | |
| Outputs: | |
| posteriors [1, 56, 7] float32 powerset posteriors | |
| lstm_state_out [2, 8, 1, 128] float32 next-chunk state | |
| ``` | |
| Powerset classes (7): `{β , s1, s2, s3, s1βͺs2, s1βͺs3, s2βͺs3}` β up to 3 local | |
| speakers, no triple-overlap class. | |
| ## Usage | |
| ```kotlin | |
| val model = Interpreter(loadModelFile("pyannote-segmentation.tflite")) | |
| var state = FloatArray(2 * 8 * 1 * 128) // zero on first call | |
| fun segment(chunk: FloatArray): FloatArray { | |
| val out = FloatArray(1 * 56 * 7) | |
| val nextState = FloatArray(state.size) | |
| model.runSignature( | |
| mapOf(0 to chunk.toDirectBuffer(), 1 to state.toDirectBuffer()), | |
| mapOf(0 to out, 1 to nextState), | |
| ) | |
| state = nextState | |
| return out // [56, 7] log-probs | |
| } | |
| ``` | |
| ## Source | |
| Upstream: [pyannote/segmentation-3.0](https://huggingface.co/pyannote/segmentation-3.0) | |
| (MIT, gated β accept the license on the upstream page). | |
| ## Links | |
| - [speech-android](https://github.com/soniqo/speech-android) β Android SDK | |
| - [soniqo.audio](https://soniqo.audio) β website | |
| - [blog](https://soniqo.audio/blog) β blog | |
| ## Ecosystem | |
| - [**soniqo.audio**](https://soniqo.audio) β use-case explorer (transcription, voice cloning, live ASR, voice agents). | |
| - [**speech-core**](https://github.com/soniqo/speech-core) β C++ orchestration library for voice agents. Abstract `STTInterface` / `TTSInterface` / `VADInterface` / `EnhancerInterface`; LiteRT implementations plug straight into the interfaces. | |
| - [**speech-swift**](https://github.com/soniqo/speech-swift) β Apple Silicon MLX companion runtime (model-specific MLX bundles linked above where applicable). | |
| - [**speech-android**](https://github.com/soniqo/speech-android) β Android SDK consuming on-device LiteRT bundles. | |
| ## Other LiteRT models in this collection | |
| **ASR / Transcription** | |
| - [Parakeet TDT 0.6B v3 β LiteRT (INT8)](https://huggingface.co/soniqo/Parakeet-TDT-0.6B-v3-LiteRT-INT8) | |
| - [Nemotron Speech Streaming 0.6B β LiteRT](https://huggingface.co/soniqo/Nemotron-Speech-Streaming-LiteRT) | |
| - [Omnilingual ASR CTC 300M β LiteRT](https://huggingface.co/soniqo/Omnilingual-ASR-CTC-300M-LiteRT) | |
| - [Omnilingual ASR CTC 300M β LiteRT (INT8)](https://huggingface.co/soniqo/Omnilingual-ASR-CTC-300M-LiteRT-INT8) | |
| - [Qwen3 ASR 0.6B Encoder β LiteRT (INT8)](https://huggingface.co/soniqo/Qwen3-ASR-0.6B-Encoder-LiteRT-INT8) | |
| **VAD / Diarization** | |
| - [Silero VAD v5 β LiteRT](https://huggingface.co/soniqo/Silero-VAD-v5-LiteRT) | |
| - [WeSpeaker ResNet34-LM β LiteRT](https://huggingface.co/soniqo/WeSpeaker-ResNet34-LM-LiteRT) | |
| **TTS / Voice Cloning** | |
| - [VoxCPM2 β LiteRT (INT8)](https://huggingface.co/soniqo/VoxCPM2-LiteRT-INT8) | |
| ## License | |
| This bundle inherits the upstream model license (**mit**). See the | |
| linked `base_model` repository for the full terms. | |