File size: 2,907 Bytes
da643fc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
---
license: cc-by-4.0
tags:
  - speaker-diarization
  - coreml
  - apple-silicon
  - neural-engine
  - sortformer
datasets:
  - voxconverse
language:
  - en
pipeline_tag: audio-classification
---

# Sortformer Diarization (CoreML)

CoreML port of [NVIDIA Sortformer](https://arxiv.org/abs/2409.06656) for end-to-end streaming speaker diarization on Apple Silicon.

Runs on the Neural Engine via CoreML. No separate embedding extraction or clustering — the model directly predicts per-frame speaker activity for up to 4 speakers.

## Model Details

- **Architecture**: Sortformer (Sort Loss + 17-layer FastConformer encoder + 18-layer Transformer)
- **Base model**: `nvidia/diar_streaming_sortformer_4spk-v2.1` (117M params)
- **Task**: Speaker diarization (up to 4 speakers)
- **Input**: 128-dim log-mel features, streamed in chunks
- **Output**: Per-frame speaker activity probabilities (sigmoid)
- **Format**: CoreML `.mlmodelc` (compiled pipeline, 2 sub-models)
- **Size**: ~230 MB

## Streaming Configuration

| Parameter | Value |
|-----------|-------|
| Sample rate | 16 kHz |
| Mel bins | 128 |
| n_fft | 400 |
| Hop length | 160 |
| Chunk length | 6s |
| Left context | 1 chunk |
| Right context | 7 chunks |
| Subsampling factor | 8 |
| Speaker cache length | 188 frames |
| FIFO length | 40 frames |
| Max speakers | 4 |

## Input/Output Shapes

**Inputs:**

| Tensor | Shape | Description |
|--------|-------|-------------|
| `chunk` | `[1, 112, 128]` | Mel features for current chunk |
| `chunk_lengths` | `[1]` | Valid frames in chunk |
| `spkcache` | `[1, 188, 512]` | Speaker cache state |
| `spkcache_lengths` | `[1]` | Valid entries in speaker cache |
| `fifo` | `[1, 40, 512]` | FIFO buffer state |
| `fifo_lengths` | `[1]` | Valid entries in FIFO |

**Outputs:**

| Tensor | Shape | Description |
|--------|-------|-------------|
| `speaker_preds_out` | `[1, 242, 4]` | Speaker activity probabilities |
| `chunk_pre_encoder_embs_out` | `[1, 14, 512]` | Chunk embeddings for state update |
| `chunk_pre_encoder_lengths_out` | `[1]` | Valid embedding frames |

## Usage

Used by [speech-swift](https://github.com/soniqo/speech-swift) for speaker diarization:

```bash
audio diarize meeting.wav --engine sortformer
```

```swift
let diarizer = try await SortformerDiarizer.fromPretrained()
let result = diarizer.diarize(audio: samples, sampleRate: 16000)
for segment in result.segments {
    print("Speaker \(segment.speakerId): \(segment.startTime)s - \(segment.endTime)s")
}
```

## Pipeline Architecture

The model is a CoreML pipeline with two sub-models:

1. **PreEncoder** (model0) — Runs `pre_encode` on the mel chunk, concatenates with speaker cache and FIFO state
2. **Head** (model1) — Full FastConformer encoder + Transformer + sigmoid speaker heads

State management (FIFO rotation, speaker cache compression) is handled in Swift outside the model.

## License

CC-BY-4.0