Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,96 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: cc-by-4.0
|
| 3 |
+
tags:
|
| 4 |
+
- speaker-diarization
|
| 5 |
+
- coreml
|
| 6 |
+
- apple-silicon
|
| 7 |
+
- neural-engine
|
| 8 |
+
- sortformer
|
| 9 |
+
datasets:
|
| 10 |
+
- voxconverse
|
| 11 |
+
language:
|
| 12 |
+
- en
|
| 13 |
+
pipeline_tag: audio-classification
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
# Sortformer Diarization (CoreML)
|
| 17 |
+
|
| 18 |
+
CoreML port of [NVIDIA Sortformer](https://arxiv.org/abs/2409.06656) for end-to-end streaming speaker diarization on Apple Silicon.
|
| 19 |
+
|
| 20 |
+
Runs on the Neural Engine via CoreML. No separate embedding extraction or clustering — the model directly predicts per-frame speaker activity for up to 4 speakers.
|
| 21 |
+
|
| 22 |
+
## Model Details
|
| 23 |
+
|
| 24 |
+
- **Architecture**: Sortformer (Sort Loss + 17-layer FastConformer encoder + 18-layer Transformer)
|
| 25 |
+
- **Base model**: `nvidia/diar_streaming_sortformer_4spk-v2.1` (117M params)
|
| 26 |
+
- **Task**: Speaker diarization (up to 4 speakers)
|
| 27 |
+
- **Input**: 128-dim log-mel features, streamed in chunks
|
| 28 |
+
- **Output**: Per-frame speaker activity probabilities (sigmoid)
|
| 29 |
+
- **Format**: CoreML `.mlmodelc` (compiled pipeline, 2 sub-models)
|
| 30 |
+
- **Size**: ~230 MB
|
| 31 |
+
|
| 32 |
+
## Streaming Configuration
|
| 33 |
+
|
| 34 |
+
| Parameter | Value |
|
| 35 |
+
|-----------|-------|
|
| 36 |
+
| Sample rate | 16 kHz |
|
| 37 |
+
| Mel bins | 128 |
|
| 38 |
+
| n_fft | 400 |
|
| 39 |
+
| Hop length | 160 |
|
| 40 |
+
| Chunk length | 6s |
|
| 41 |
+
| Left context | 1 chunk |
|
| 42 |
+
| Right context | 7 chunks |
|
| 43 |
+
| Subsampling factor | 8 |
|
| 44 |
+
| Speaker cache length | 188 frames |
|
| 45 |
+
| FIFO length | 40 frames |
|
| 46 |
+
| Max speakers | 4 |
|
| 47 |
+
|
| 48 |
+
## Input/Output Shapes
|
| 49 |
+
|
| 50 |
+
**Inputs:**
|
| 51 |
+
|
| 52 |
+
| Tensor | Shape | Description |
|
| 53 |
+
|--------|-------|-------------|
|
| 54 |
+
| `chunk` | `[1, 112, 128]` | Mel features for current chunk |
|
| 55 |
+
| `chunk_lengths` | `[1]` | Valid frames in chunk |
|
| 56 |
+
| `spkcache` | `[1, 188, 512]` | Speaker cache state |
|
| 57 |
+
| `spkcache_lengths` | `[1]` | Valid entries in speaker cache |
|
| 58 |
+
| `fifo` | `[1, 40, 512]` | FIFO buffer state |
|
| 59 |
+
| `fifo_lengths` | `[1]` | Valid entries in FIFO |
|
| 60 |
+
|
| 61 |
+
**Outputs:**
|
| 62 |
+
|
| 63 |
+
| Tensor | Shape | Description |
|
| 64 |
+
|--------|-------|-------------|
|
| 65 |
+
| `speaker_preds_out` | `[1, 242, 4]` | Speaker activity probabilities |
|
| 66 |
+
| `chunk_pre_encoder_embs_out` | `[1, 14, 512]` | Chunk embeddings for state update |
|
| 67 |
+
| `chunk_pre_encoder_lengths_out` | `[1]` | Valid embedding frames |
|
| 68 |
+
|
| 69 |
+
## Usage
|
| 70 |
+
|
| 71 |
+
Used by [speech-swift](https://github.com/soniqo/speech-swift) for speaker diarization:
|
| 72 |
+
|
| 73 |
+
```bash
|
| 74 |
+
audio diarize meeting.wav --engine sortformer
|
| 75 |
+
```
|
| 76 |
+
|
| 77 |
+
```swift
|
| 78 |
+
let diarizer = try await SortformerDiarizer.fromPretrained()
|
| 79 |
+
let result = diarizer.diarize(audio: samples, sampleRate: 16000)
|
| 80 |
+
for segment in result.segments {
|
| 81 |
+
print("Speaker \(segment.speakerId): \(segment.startTime)s - \(segment.endTime)s")
|
| 82 |
+
}
|
| 83 |
+
```
|
| 84 |
+
|
| 85 |
+
## Pipeline Architecture
|
| 86 |
+
|
| 87 |
+
The model is a CoreML pipeline with two sub-models:
|
| 88 |
+
|
| 89 |
+
1. **PreEncoder** (model0) — Runs `pre_encode` on the mel chunk, concatenates with speaker cache and FIFO state
|
| 90 |
+
2. **Head** (model1) — Full FastConformer encoder + Transformer + sigmoid speaker heads
|
| 91 |
+
|
| 92 |
+
State management (FIFO rotation, speaker cache compression) is handled in Swift outside the model.
|
| 93 |
+
|
| 94 |
+
## License
|
| 95 |
+
|
| 96 |
+
CC-BY-4.0
|