File size: 2,907 Bytes
da643fc | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 | ---
license: cc-by-4.0
tags:
- speaker-diarization
- coreml
- apple-silicon
- neural-engine
- sortformer
datasets:
- voxconverse
language:
- en
pipeline_tag: audio-classification
---
# Sortformer Diarization (CoreML)
CoreML port of [NVIDIA Sortformer](https://arxiv.org/abs/2409.06656) for end-to-end streaming speaker diarization on Apple Silicon.
Runs on the Neural Engine via CoreML. No separate embedding extraction or clustering — the model directly predicts per-frame speaker activity for up to 4 speakers.
## Model Details
- **Architecture**: Sortformer (Sort Loss + 17-layer FastConformer encoder + 18-layer Transformer)
- **Base model**: `nvidia/diar_streaming_sortformer_4spk-v2.1` (117M params)
- **Task**: Speaker diarization (up to 4 speakers)
- **Input**: 128-dim log-mel features, streamed in chunks
- **Output**: Per-frame speaker activity probabilities (sigmoid)
- **Format**: CoreML `.mlmodelc` (compiled pipeline, 2 sub-models)
- **Size**: ~230 MB
## Streaming Configuration
| Parameter | Value |
|-----------|-------|
| Sample rate | 16 kHz |
| Mel bins | 128 |
| n_fft | 400 |
| Hop length | 160 |
| Chunk length | 6s |
| Left context | 1 chunk |
| Right context | 7 chunks |
| Subsampling factor | 8 |
| Speaker cache length | 188 frames |
| FIFO length | 40 frames |
| Max speakers | 4 |
## Input/Output Shapes
**Inputs:**
| Tensor | Shape | Description |
|--------|-------|-------------|
| `chunk` | `[1, 112, 128]` | Mel features for current chunk |
| `chunk_lengths` | `[1]` | Valid frames in chunk |
| `spkcache` | `[1, 188, 512]` | Speaker cache state |
| `spkcache_lengths` | `[1]` | Valid entries in speaker cache |
| `fifo` | `[1, 40, 512]` | FIFO buffer state |
| `fifo_lengths` | `[1]` | Valid entries in FIFO |
**Outputs:**
| Tensor | Shape | Description |
|--------|-------|-------------|
| `speaker_preds_out` | `[1, 242, 4]` | Speaker activity probabilities |
| `chunk_pre_encoder_embs_out` | `[1, 14, 512]` | Chunk embeddings for state update |
| `chunk_pre_encoder_lengths_out` | `[1]` | Valid embedding frames |
## Usage
Used by [speech-swift](https://github.com/soniqo/speech-swift) for speaker diarization:
```bash
audio diarize meeting.wav --engine sortformer
```
```swift
let diarizer = try await SortformerDiarizer.fromPretrained()
let result = diarizer.diarize(audio: samples, sampleRate: 16000)
for segment in result.segments {
print("Speaker \(segment.speakerId): \(segment.startTime)s - \(segment.endTime)s")
}
```
## Pipeline Architecture
The model is a CoreML pipeline with two sub-models:
1. **PreEncoder** (model0) — Runs `pre_encode` on the mel chunk, concatenates with speaker cache and FIFO state
2. **Head** (model1) — Full FastConformer encoder + Transformer + sigmoid speaker heads
State management (FIFO rotation, speaker cache compression) is handled in Swift outside the model.
## License
CC-BY-4.0
|