File size: 5,508 Bytes

---
license: mit
language:
  - multilingual
tags:
  - speaker-diarization
  - voice-activity-detection
  - pyannote
  - diarization
  - litert
  - tflite
  - on-device
  - soniqo
  - speech-cloud
  - speech-core
base_model: pyannote/segmentation-3.0
library_name: litert
pipeline_tag: voice-activity-detection
---

# Pyannote Segmentation 3.0 — LiteRT

Speaker-aware segmentation for diarization pipelines. 16 kHz, 5-second windows.

> Part of the [**soniqo.audio**](https://soniqo.audio) speech toolkit —
> an open, runtime-portable stack for speech AI. This bundle is the
> **LiteRT** export, designed to plug into the abstract interfaces in
> [`speech-core`](https://github.com/soniqo/speech-core) (C++ voice-agent
> orchestration library). Browse all LiteRT bundles in the
> [**soniqo LiteRT collection**](https://huggingface.co/collections/soniqo/litert-6a08268e11d5a47d7aacc02b).

## Use cases on soniqo.audio

- [Meeting transcription](https://soniqo.audio/transcription/)
- [Long-form transcription](https://soniqo.audio/long-form-speech/)

Powerset speaker segmentation (up to 3 local speakers) for Android,
exported in a streaming 1-second chunk configuration.

## Model

| Property | Value |
|---|---|
| Architecture | SincNet frontend + 4-layer BiLSTM + linear + powerset head |
| Parameters | ~1.5 M |
| Format | LiteRT (TFLite) |
| Quantization | float32 |
| Sample rate | 16 000 Hz |
| Chunk | 1 second (16 000 samples) |
| Output frames | 56 per chunk |
| LSTM state | explicit I/O, `[2, 8, 1, 128]` (h+c, 4 layers × 2 directions) |

## Files

| File | Size | Description |
|---|---|---|
| `pyannote-segmentation.tflite` | 6.93 MB | Full model, FP32 |
| `config.json` | 1 KB | Signature + usage hints |

## Why streaming chunks

pyannote/segmentation-3.0 at its trained 10-second window has 589 BiLSTM
time steps. litert-torch has no native `aten.lstm` lowering and unrolls
it into ~4700 cell operations. The resulting MLIR optimizer either hangs
for hours or fails on duplicate `jax_lowering_*` symbols from repeated
helper functions.

Exporting at 1-second chunks (56 time steps) compiles in ~2 minutes and
produces a valid TFLite. The caller runs 10 chunks in sequence, passing
`lstm_state_out → lstm_state` between calls, to cover the full 10-second
window. Each chunk produces 56 frames of powerset posteriors.

The SincNet frontend has small per-chunk edge effects: 10 × 56 = 560
frames versus 589 in the original model. Overlap chunks by ~500 ms on
boundaries where high-precision stitching is required.

## Signature

```
Inputs:
  audio         [1, 1, 16000]     float32   1 s of audio @ 16 kHz
  lstm_state    [2, 8, 1, 128]    float32   (h, c), zeros on first chunk

Outputs:
  posteriors    [1, 56, 7]        float32   powerset posteriors
  lstm_state_out [2, 8, 1, 128]   float32   next-chunk state
```

Powerset classes (7): `{∅, s1, s2, s3, s1∪s2, s1∪s3, s2∪s3}` — up to 3 local
speakers, no triple-overlap class.

## Usage

```kotlin
val model = Interpreter(loadModelFile("pyannote-segmentation.tflite"))
var state = FloatArray(2 * 8 * 1 * 128) // zero on first call

fun segment(chunk: FloatArray): FloatArray {
    val out = FloatArray(1 * 56 * 7)
    val nextState = FloatArray(state.size)
    model.runSignature(
        mapOf(0 to chunk.toDirectBuffer(), 1 to state.toDirectBuffer()),
        mapOf(0 to out, 1 to nextState),
    )
    state = nextState
    return out // [56, 7] log-probs
}
```

## Source

Upstream: [pyannote/segmentation-3.0](https://huggingface.co/pyannote/segmentation-3.0)
(MIT, gated — accept the license on the upstream page).

## Links

- [speech-android](https://github.com/soniqo/speech-android) — Android SDK
- [soniqo.audio](https://soniqo.audio) — website
- [blog](https://soniqo.audio/blog) — blog

## Ecosystem

- [**soniqo.audio**](https://soniqo.audio) — use-case explorer (transcription, voice cloning, live ASR, voice agents).
- [**speech-core**](https://github.com/soniqo/speech-core) — C++ orchestration library for voice agents. Abstract `STTInterface` / `TTSInterface` / `VADInterface` / `EnhancerInterface`; LiteRT implementations plug straight into the interfaces.
- [**speech-swift**](https://github.com/soniqo/speech-swift) — Apple Silicon MLX companion runtime (model-specific MLX bundles linked above where applicable).
- [**speech-android**](https://github.com/soniqo/speech-android) — Android SDK consuming on-device LiteRT bundles.

## Other LiteRT models in this collection

**ASR / Transcription**

- [Parakeet TDT 0.6B v3 — LiteRT (INT8)](https://huggingface.co/soniqo/Parakeet-TDT-0.6B-v3-LiteRT-INT8)
- [Nemotron Speech Streaming 0.6B — LiteRT](https://huggingface.co/soniqo/Nemotron-Speech-Streaming-LiteRT)
- [Omnilingual ASR CTC 300M — LiteRT](https://huggingface.co/soniqo/Omnilingual-ASR-CTC-300M-LiteRT)
- [Omnilingual ASR CTC 300M — LiteRT (INT8)](https://huggingface.co/soniqo/Omnilingual-ASR-CTC-300M-LiteRT-INT8)
- [Qwen3 ASR 0.6B Encoder — LiteRT (INT8)](https://huggingface.co/soniqo/Qwen3-ASR-0.6B-Encoder-LiteRT-INT8)

**VAD / Diarization**

- [Silero VAD v5 — LiteRT](https://huggingface.co/soniqo/Silero-VAD-v5-LiteRT)
- [WeSpeaker ResNet34-LM — LiteRT](https://huggingface.co/soniqo/WeSpeaker-ResNet34-LM-LiteRT)

**TTS / Voice Cloning**

- [VoxCPM2 — LiteRT (INT8)](https://huggingface.co/soniqo/VoxCPM2-LiteRT-INT8)

## License

This bundle inherits the upstream model license (**mit**). See the
linked `base_model` repository for the full terms.