Pyannote Segmentation 3.0 — LiteRT

Speaker-aware segmentation for diarization pipelines. 16 kHz, 5-second windows.

Part of the soniqo.audio speech toolkit — an open, runtime-portable stack for speech AI. This bundle is the LiteRT export, designed to plug into the abstract interfaces in speech-core (C++ voice-agent orchestration library). Browse all LiteRT bundles in the soniqo LiteRT collection.

Use cases on soniqo.audio

Powerset speaker segmentation (up to 3 local speakers) for Android, exported in a streaming 1-second chunk configuration.

Model

Property	Value
Architecture	SincNet frontend + 4-layer BiLSTM + linear + powerset head
Parameters	~1.5 M
Format	LiteRT (TFLite)
Quantization	float32
Sample rate	16 000 Hz
Chunk	1 second (16 000 samples)
Output frames	56 per chunk
LSTM state	explicit I/O, `[2, 8, 1, 128]` (h+c, 4 layers × 2 directions)

Files

File	Size	Description
`pyannote-segmentation.tflite`	6.93 MB	Full model, FP32
`config.json`	1 KB	Signature + usage hints

Why streaming chunks

pyannote/segmentation-3.0 at its trained 10-second window has 589 BiLSTM time steps. litert-torch has no native aten.lstm lowering and unrolls it into ~4700 cell operations. The resulting MLIR optimizer either hangs for hours or fails on duplicate jax_lowering_* symbols from repeated helper functions.

Exporting at 1-second chunks (56 time steps) compiles in ~2 minutes and produces a valid TFLite. The caller runs 10 chunks in sequence, passing lstm_state_out → lstm_state between calls, to cover the full 10-second window. Each chunk produces 56 frames of powerset posteriors.

The SincNet frontend has small per-chunk edge effects: 10 × 56 = 560 frames versus 589 in the original model. Overlap chunks by ~500 ms on boundaries where high-precision stitching is required.

Signature

Inputs:
  audio         [1, 1, 16000]     float32   1 s of audio @ 16 kHz
  lstm_state    [2, 8, 1, 128]    float32   (h, c), zeros on first chunk

Outputs:
  posteriors    [1, 56, 7]        float32   powerset posteriors
  lstm_state_out [2, 8, 1, 128]   float32   next-chunk state

Powerset classes (7): {∅, s1, s2, s3, s1∪s2, s1∪s3, s2∪s3} — up to 3 local speakers, no triple-overlap class.

Usage

val model = Interpreter(loadModelFile("pyannote-segmentation.tflite"))
var state = FloatArray(2 * 8 * 1 * 128) // zero on first call

fun segment(chunk: FloatArray): FloatArray {
    val out = FloatArray(1 * 56 * 7)
    val nextState = FloatArray(state.size)
    model.runSignature(
        mapOf(0 to chunk.toDirectBuffer(), 1 to state.toDirectBuffer()),
        mapOf(0 to out, 1 to nextState),
    )
    state = nextState
    return out // [56, 7] log-probs
}

Source

Upstream: pyannote/segmentation-3.0 (MIT, gated — accept the license on the upstream page).

Ecosystem

soniqo.audio — use-case explorer (transcription, voice cloning, live ASR, voice agents).
speech-core — C++ orchestration library for voice agents. Abstract STTInterface / TTSInterface / VADInterface / EnhancerInterface; LiteRT implementations plug straight into the interfaces.
speech-swift — Apple Silicon MLX companion runtime (model-specific MLX bundles linked above where applicable).
speech-android — Android SDK consuming on-device LiteRT bundles.

Other LiteRT models in this collection

ASR / Transcription

VAD / Diarization

TTS / Voice Cloning

VoxCPM2 — LiteRT (INT8)

License

This bundle inherits the upstream model license (mit). See the linked base_model repository for the full terms.

Downloads last month: 19

Inference Providers NEW

Voice Activity Detection

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for soniqo/Pyannote-Segmentation-LiteRT

Base model

pyannote/segmentation-3.0

Finetuned

(95)

this model

Collection including soniqo/Pyannote-Segmentation-LiteRT

LiteRT

Collection

LiteRT (.tflite) bundles for soniqo.audio. ASR, VAD, diarization, speaker ID, streaming, TTS — served by speech-cloud and speech-core. • 17 items • Updated about 6 hours ago • 1

soniqo
/

Pyannote-Segmentation-LiteRT