card: unified LiteRT model card with soniqo.audio + ecosystem links

8422f41 verified 4 days ago

5.51 kB

	---
	license: mit
	language:
	- multilingual
	tags:
	- speaker-diarization
	- voice-activity-detection
	- pyannote
	- diarization
	- litert
	- tflite
	- on-device
	- soniqo
	- speech-cloud
	- speech-core
	base_model: pyannote/segmentation-3.0
	library_name: litert
	pipeline_tag: voice-activity-detection
	---

	# Pyannote Segmentation 3.0 — LiteRT

	Speaker-aware segmentation for diarization pipelines. 16 kHz, 5-second windows.

	> Part of the [soniqo.audio](https://soniqo.audio) speech toolkit —
	> an open, runtime-portable stack for speech AI. This bundle is the
	> LiteRT export, designed to plug into the abstract interfaces in
	> [`speech-core`](https://github.com/soniqo/speech-core) (C++ voice-agent
	> orchestration library). Browse all LiteRT bundles in the
	> [soniqo LiteRT collection](https://huggingface.co/collections/soniqo/litert-6a08268e11d5a47d7aacc02b).

	## Use cases on soniqo.audio

	- [Meeting transcription](https://soniqo.audio/transcription/)
	- [Long-form transcription](https://soniqo.audio/long-form-speech/)

	Powerset speaker segmentation (up to 3 local speakers) for Android,
	exported in a streaming 1-second chunk configuration.

	## Model

	\| Property \| Value \|
	\|---\|---\|
	\| Architecture \| SincNet frontend + 4-layer BiLSTM + linear + powerset head \|
	\| Parameters \| ~1.5 M \|
	\| Format \| LiteRT (TFLite) \|
	\| Quantization \| float32 \|
	\| Sample rate \| 16 000 Hz \|
	\| Chunk \| 1 second (16 000 samples) \|
	\| Output frames \| 56 per chunk \|
	\| LSTM state \| explicit I/O, `[2, 8, 1, 128]` (h+c, 4 layers × 2 directions) \|

	## Files

	\| File \| Size \| Description \|
	\|---\|---\|---\|
	\| `pyannote-segmentation.tflite` \| 6.93 MB \| Full model, FP32 \|
	\| `config.json` \| 1 KB \| Signature + usage hints \|

	## Why streaming chunks

	pyannote/segmentation-3.0 at its trained 10-second window has 589 BiLSTM
	time steps. litert-torch has no native `aten.lstm` lowering and unrolls
	it into ~4700 cell operations. The resulting MLIR optimizer either hangs
	for hours or fails on duplicate `jax_lowering_*` symbols from repeated
	helper functions.

	Exporting at 1-second chunks (56 time steps) compiles in ~2 minutes and
	produces a valid TFLite. The caller runs 10 chunks in sequence, passing
	`lstm_state_out → lstm_state` between calls, to cover the full 10-second
	window. Each chunk produces 56 frames of powerset posteriors.

	The SincNet frontend has small per-chunk edge effects: 10 × 56 = 560
	frames versus 589 in the original model. Overlap chunks by ~500 ms on
	boundaries where high-precision stitching is required.

	## Signature

	```
	Inputs:
	audio [1, 1, 16000] float32 1 s of audio @ 16 kHz
	lstm_state [2, 8, 1, 128] float32 (h, c), zeros on first chunk

	Outputs:
	posteriors [1, 56, 7] float32 powerset posteriors
	lstm_state_out [2, 8, 1, 128] float32 next-chunk state
	```

	Powerset classes (7): `{∅, s1, s2, s3, s1∪s2, s1∪s3, s2∪s3}` — up to 3 local
	speakers, no triple-overlap class.

	## Usage

	```kotlin
	val model = Interpreter(loadModelFile("pyannote-segmentation.tflite"))
	var state = FloatArray(2 * 8 * 1 * 128) // zero on first call

	fun segment(chunk: FloatArray): FloatArray {
	val out = FloatArray(1 * 56 * 7)
	val nextState = FloatArray(state.size)
	model.runSignature(
	mapOf(0 to chunk.toDirectBuffer(), 1 to state.toDirectBuffer()),
	mapOf(0 to out, 1 to nextState),
	)
	state = nextState
	return out // [56, 7] log-probs
	}
	```

	## Source

	Upstream: [pyannote/segmentation-3.0](https://huggingface.co/pyannote/segmentation-3.0)
	(MIT, gated — accept the license on the upstream page).

	## Links

	- [speech-android](https://github.com/soniqo/speech-android) — Android SDK
	- [soniqo.audio](https://soniqo.audio) — website
	- [blog](https://soniqo.audio/blog) — blog

	## Ecosystem

	- [soniqo.audio](https://soniqo.audio) — use-case explorer (transcription, voice cloning, live ASR, voice agents).
	- [speech-core](https://github.com/soniqo/speech-core) — C++ orchestration library for voice agents. Abstract `STTInterface` / `TTSInterface` / `VADInterface` / `EnhancerInterface`; LiteRT implementations plug straight into the interfaces.
	- [speech-swift](https://github.com/soniqo/speech-swift) — Apple Silicon MLX companion runtime (model-specific MLX bundles linked above where applicable).
	- [speech-android](https://github.com/soniqo/speech-android) — Android SDK consuming on-device LiteRT bundles.

	## Other LiteRT models in this collection

	ASR / Transcription

	- [Parakeet TDT 0.6B v3 — LiteRT (INT8)](https://huggingface.co/soniqo/Parakeet-TDT-0.6B-v3-LiteRT-INT8)
	- [Nemotron Speech Streaming 0.6B — LiteRT](https://huggingface.co/soniqo/Nemotron-Speech-Streaming-LiteRT)
	- [Omnilingual ASR CTC 300M — LiteRT](https://huggingface.co/soniqo/Omnilingual-ASR-CTC-300M-LiteRT)
	- [Omnilingual ASR CTC 300M — LiteRT (INT8)](https://huggingface.co/soniqo/Omnilingual-ASR-CTC-300M-LiteRT-INT8)
	- [Qwen3 ASR 0.6B Encoder — LiteRT (INT8)](https://huggingface.co/soniqo/Qwen3-ASR-0.6B-Encoder-LiteRT-INT8)

	VAD / Diarization

	- [Silero VAD v5 — LiteRT](https://huggingface.co/soniqo/Silero-VAD-v5-LiteRT)
	- [WeSpeaker ResNet34-LM — LiteRT](https://huggingface.co/soniqo/WeSpeaker-ResNet34-LM-LiteRT)

	TTS / Voice Cloning

	- [VoxCPM2 — LiteRT (INT8)](https://huggingface.co/soniqo/VoxCPM2-LiteRT-INT8)

	## License

	This bundle inherits the upstream model license (mit). See the
	linked `base_model` repository for the full terms.