metadata
license: cc-by-4.0
tags:
- speaker-diarization
- coreml
- apple-silicon
- neural-engine
- sortformer
datasets:
- voxconverse
language:
- en
pipeline_tag: audio-classification
Sortformer Diarization (CoreML)
CoreML port of NVIDIA Sortformer for end-to-end streaming speaker diarization on Apple Silicon.
Runs on the Neural Engine via CoreML. No separate embedding extraction or clustering — the model directly predicts per-frame speaker activity for up to 4 speakers.
Model Details
- Architecture: Sortformer (Sort Loss + 17-layer FastConformer encoder + 18-layer Transformer)
- Base model:
nvidia/diar_streaming_sortformer_4spk-v2.1(117M params) - Task: Speaker diarization (up to 4 speakers)
- Input: 128-dim log-mel features, streamed in chunks
- Output: Per-frame speaker activity probabilities (sigmoid)
- Format: CoreML
.mlmodelc(compiled pipeline, 2 sub-models) - Size: ~230 MB
Streaming Configuration
| Parameter | Value |
|---|---|
| Sample rate | 16 kHz |
| Mel bins | 128 |
| n_fft | 400 |
| Hop length | 160 |
| Chunk length | 6s |
| Left context | 1 chunk |
| Right context | 7 chunks |
| Subsampling factor | 8 |
| Speaker cache length | 188 frames |
| FIFO length | 40 frames |
| Max speakers | 4 |
Input/Output Shapes
Inputs:
| Tensor | Shape | Description |
|---|---|---|
chunk |
[1, 112, 128] |
Mel features for current chunk |
chunk_lengths |
[1] |
Valid frames in chunk |
spkcache |
[1, 188, 512] |
Speaker cache state |
spkcache_lengths |
[1] |
Valid entries in speaker cache |
fifo |
[1, 40, 512] |
FIFO buffer state |
fifo_lengths |
[1] |
Valid entries in FIFO |
Outputs:
| Tensor | Shape | Description |
|---|---|---|
speaker_preds_out |
[1, 242, 4] |
Speaker activity probabilities |
chunk_pre_encoder_embs_out |
[1, 14, 512] |
Chunk embeddings for state update |
chunk_pre_encoder_lengths_out |
[1] |
Valid embedding frames |
Usage
Used by speech-swift for speaker diarization:
audio diarize meeting.wav --engine sortformer
let diarizer = try await SortformerDiarizer.fromPretrained()
let result = diarizer.diarize(audio: samples, sampleRate: 16000)
for segment in result.segments {
print("Speaker \(segment.speakerId): \(segment.startTime)s - \(segment.endTime)s")
}
Pipeline Architecture
The model is a CoreML pipeline with two sub-models:
- PreEncoder (model0) — Runs
pre_encodeon the mel chunk, concatenates with speaker cache and FIFO state - Head (model1) — Full FastConformer encoder + Transformer + sigmoid speaker heads
State management (FIFO rotation, speaker cache compression) is handled in Swift outside the model.
License
CC-BY-4.0