| --- |
| license: cc-by-4.0 |
| tags: |
| - speaker-diarization |
| - coreml |
| - apple-silicon |
| - neural-engine |
| - sortformer |
| datasets: |
| - voxconverse |
| language: |
| - en |
| pipeline_tag: audio-classification |
| --- |
| |
| # Sortformer Diarization (CoreML) |
|
|
| CoreML port of [NVIDIA Sortformer](https://arxiv.org/abs/2409.06656) for end-to-end streaming speaker diarization on Apple Silicon. |
|
|
| Runs on the Neural Engine via CoreML. No separate embedding extraction or clustering — the model directly predicts per-frame speaker activity for up to 4 speakers. |
|
|
| ## Model Details |
|
|
| - **Architecture**: Sortformer (Sort Loss + 17-layer FastConformer encoder + 18-layer Transformer) |
| - **Base model**: `nvidia/diar_streaming_sortformer_4spk-v2.1` (117M params) |
| - **Task**: Speaker diarization (up to 4 speakers) |
| - **Input**: 128-dim log-mel features, streamed in chunks |
| - **Output**: Per-frame speaker activity probabilities (sigmoid) |
| - **Format**: CoreML `.mlmodelc` (compiled pipeline, 2 sub-models) |
| - **Size**: ~230 MB |
|
|
| ## Streaming Configuration |
|
|
| | Parameter | Value | |
| |-----------|-------| |
| | Sample rate | 16 kHz | |
| | Mel bins | 128 | |
| | n_fft | 400 | |
| | Hop length | 160 | |
| | Chunk length | 6s | |
| | Left context | 1 chunk | |
| | Right context | 7 chunks | |
| | Subsampling factor | 8 | |
| | Speaker cache length | 188 frames | |
| | FIFO length | 40 frames | |
| | Max speakers | 4 | |
| |
| ## Input/Output Shapes |
| |
| **Inputs:** |
| |
| | Tensor | Shape | Description | |
| |--------|-------|-------------| |
| | `chunk` | `[1, 112, 128]` | Mel features for current chunk | |
| | `chunk_lengths` | `[1]` | Valid frames in chunk | |
| | `spkcache` | `[1, 188, 512]` | Speaker cache state | |
| | `spkcache_lengths` | `[1]` | Valid entries in speaker cache | |
| | `fifo` | `[1, 40, 512]` | FIFO buffer state | |
| | `fifo_lengths` | `[1]` | Valid entries in FIFO | |
|
|
| **Outputs:** |
|
|
| | Tensor | Shape | Description | |
| |--------|-------|-------------| |
| | `speaker_preds_out` | `[1, 242, 4]` | Speaker activity probabilities | |
| | `chunk_pre_encoder_embs_out` | `[1, 14, 512]` | Chunk embeddings for state update | |
| | `chunk_pre_encoder_lengths_out` | `[1]` | Valid embedding frames | |
|
|
| ## Usage |
|
|
| Used by [speech-swift](https://github.com/soniqo/speech-swift) for speaker diarization: |
|
|
| ```bash |
| audio diarize meeting.wav --engine sortformer |
| ``` |
|
|
| ```swift |
| let diarizer = try await SortformerDiarizer.fromPretrained() |
| let result = diarizer.diarize(audio: samples, sampleRate: 16000) |
| for segment in result.segments { |
| print("Speaker \(segment.speakerId): \(segment.startTime)s - \(segment.endTime)s") |
| } |
| ``` |
|
|
| ## Pipeline Architecture |
|
|
| The model is a CoreML pipeline with two sub-models: |
|
|
| 1. **PreEncoder** (model0) — Runs `pre_encode` on the mel chunk, concatenates with speaker cache and FIFO state |
| 2. **Head** (model1) — Full FastConformer encoder + Transformer + sigmoid speaker heads |
|
|
| State management (FIFO rotation, speaker cache compression) is handled in Swift outside the model. |
|
|
| ## License |
|
|
| CC-BY-4.0 |
|
|