Pyannote Segmentation 3.0 β€” LiteRT

Speaker-aware segmentation for diarization pipelines. 16 kHz, 5-second windows.

Part of the soniqo.audio speech toolkit β€” an open, runtime-portable stack for speech AI. This bundle is the LiteRT export, designed to plug into the abstract interfaces in speech-core (C++ voice-agent orchestration library). Browse all LiteRT bundles in the soniqo LiteRT collection.

Use cases on soniqo.audio

Powerset speaker segmentation (up to 3 local speakers) for Android, exported in a streaming 1-second chunk configuration.

Model

Property Value
Architecture SincNet frontend + 4-layer BiLSTM + linear + powerset head
Parameters ~1.5 M
Format LiteRT (TFLite)
Quantization float32
Sample rate 16 000 Hz
Chunk 1 second (16 000 samples)
Output frames 56 per chunk
LSTM state explicit I/O, [2, 8, 1, 128] (h+c, 4 layers Γ— 2 directions)

Files

File Size Description
pyannote-segmentation.tflite 6.93 MB Full model, FP32
config.json 1 KB Signature + usage hints

Why streaming chunks

pyannote/segmentation-3.0 at its trained 10-second window has 589 BiLSTM time steps. litert-torch has no native aten.lstm lowering and unrolls it into ~4700 cell operations. The resulting MLIR optimizer either hangs for hours or fails on duplicate jax_lowering_* symbols from repeated helper functions.

Exporting at 1-second chunks (56 time steps) compiles in ~2 minutes and produces a valid TFLite. The caller runs 10 chunks in sequence, passing lstm_state_out β†’ lstm_state between calls, to cover the full 10-second window. Each chunk produces 56 frames of powerset posteriors.

The SincNet frontend has small per-chunk edge effects: 10 Γ— 56 = 560 frames versus 589 in the original model. Overlap chunks by ~500 ms on boundaries where high-precision stitching is required.

Signature

Inputs:
  audio         [1, 1, 16000]     float32   1 s of audio @ 16 kHz
  lstm_state    [2, 8, 1, 128]    float32   (h, c), zeros on first chunk

Outputs:
  posteriors    [1, 56, 7]        float32   powerset posteriors
  lstm_state_out [2, 8, 1, 128]   float32   next-chunk state

Powerset classes (7): {βˆ…, s1, s2, s3, s1βˆͺs2, s1βˆͺs3, s2βˆͺs3} β€” up to 3 local speakers, no triple-overlap class.

Usage

val model = Interpreter(loadModelFile("pyannote-segmentation.tflite"))
var state = FloatArray(2 * 8 * 1 * 128) // zero on first call

fun segment(chunk: FloatArray): FloatArray {
    val out = FloatArray(1 * 56 * 7)
    val nextState = FloatArray(state.size)
    model.runSignature(
        mapOf(0 to chunk.toDirectBuffer(), 1 to state.toDirectBuffer()),
        mapOf(0 to out, 1 to nextState),
    )
    state = nextState
    return out // [56, 7] log-probs
}

Source

Upstream: pyannote/segmentation-3.0 (MIT, gated β€” accept the license on the upstream page).

Links

Ecosystem

  • soniqo.audio β€” use-case explorer (transcription, voice cloning, live ASR, voice agents).
  • speech-core β€” C++ orchestration library for voice agents. Abstract STTInterface / TTSInterface / VADInterface / EnhancerInterface; LiteRT implementations plug straight into the interfaces.
  • speech-swift β€” Apple Silicon MLX companion runtime (model-specific MLX bundles linked above where applicable).
  • speech-android β€” Android SDK consuming on-device LiteRT bundles.

Other LiteRT models in this collection

ASR / Transcription

VAD / Diarization

TTS / Voice Cloning

License

This bundle inherits the upstream model license (mit). See the linked base_model repository for the full terms.

Downloads last month
35
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for soniqo/Pyannote-Segmentation-LiteRT

Finetuned
(93)
this model

Collection including soniqo/Pyannote-Segmentation-LiteRT