aufklarer commited on
Commit
da643fc
·
verified ·
1 Parent(s): 070f513

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +96 -0
README.md ADDED
@@ -0,0 +1,96 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ tags:
4
+ - speaker-diarization
5
+ - coreml
6
+ - apple-silicon
7
+ - neural-engine
8
+ - sortformer
9
+ datasets:
10
+ - voxconverse
11
+ language:
12
+ - en
13
+ pipeline_tag: audio-classification
14
+ ---
15
+
16
+ # Sortformer Diarization (CoreML)
17
+
18
+ CoreML port of [NVIDIA Sortformer](https://arxiv.org/abs/2409.06656) for end-to-end streaming speaker diarization on Apple Silicon.
19
+
20
+ Runs on the Neural Engine via CoreML. No separate embedding extraction or clustering — the model directly predicts per-frame speaker activity for up to 4 speakers.
21
+
22
+ ## Model Details
23
+
24
+ - **Architecture**: Sortformer (Sort Loss + 17-layer FastConformer encoder + 18-layer Transformer)
25
+ - **Base model**: `nvidia/diar_streaming_sortformer_4spk-v2.1` (117M params)
26
+ - **Task**: Speaker diarization (up to 4 speakers)
27
+ - **Input**: 128-dim log-mel features, streamed in chunks
28
+ - **Output**: Per-frame speaker activity probabilities (sigmoid)
29
+ - **Format**: CoreML `.mlmodelc` (compiled pipeline, 2 sub-models)
30
+ - **Size**: ~230 MB
31
+
32
+ ## Streaming Configuration
33
+
34
+ | Parameter | Value |
35
+ |-----------|-------|
36
+ | Sample rate | 16 kHz |
37
+ | Mel bins | 128 |
38
+ | n_fft | 400 |
39
+ | Hop length | 160 |
40
+ | Chunk length | 6s |
41
+ | Left context | 1 chunk |
42
+ | Right context | 7 chunks |
43
+ | Subsampling factor | 8 |
44
+ | Speaker cache length | 188 frames |
45
+ | FIFO length | 40 frames |
46
+ | Max speakers | 4 |
47
+
48
+ ## Input/Output Shapes
49
+
50
+ **Inputs:**
51
+
52
+ | Tensor | Shape | Description |
53
+ |--------|-------|-------------|
54
+ | `chunk` | `[1, 112, 128]` | Mel features for current chunk |
55
+ | `chunk_lengths` | `[1]` | Valid frames in chunk |
56
+ | `spkcache` | `[1, 188, 512]` | Speaker cache state |
57
+ | `spkcache_lengths` | `[1]` | Valid entries in speaker cache |
58
+ | `fifo` | `[1, 40, 512]` | FIFO buffer state |
59
+ | `fifo_lengths` | `[1]` | Valid entries in FIFO |
60
+
61
+ **Outputs:**
62
+
63
+ | Tensor | Shape | Description |
64
+ |--------|-------|-------------|
65
+ | `speaker_preds_out` | `[1, 242, 4]` | Speaker activity probabilities |
66
+ | `chunk_pre_encoder_embs_out` | `[1, 14, 512]` | Chunk embeddings for state update |
67
+ | `chunk_pre_encoder_lengths_out` | `[1]` | Valid embedding frames |
68
+
69
+ ## Usage
70
+
71
+ Used by [speech-swift](https://github.com/soniqo/speech-swift) for speaker diarization:
72
+
73
+ ```bash
74
+ audio diarize meeting.wav --engine sortformer
75
+ ```
76
+
77
+ ```swift
78
+ let diarizer = try await SortformerDiarizer.fromPretrained()
79
+ let result = diarizer.diarize(audio: samples, sampleRate: 16000)
80
+ for segment in result.segments {
81
+ print("Speaker \(segment.speakerId): \(segment.startTime)s - \(segment.endTime)s")
82
+ }
83
+ ```
84
+
85
+ ## Pipeline Architecture
86
+
87
+ The model is a CoreML pipeline with two sub-models:
88
+
89
+ 1. **PreEncoder** (model0) — Runs `pre_encode` on the mel chunk, concatenates with speaker cache and FIFO state
90
+ 2. **Head** (model1) — Full FastConformer encoder + Transformer + sigmoid speaker heads
91
+
92
+ State management (FIFO rotation, speaker cache compression) is handled in Swift outside the model.
93
+
94
+ ## License
95
+
96
+ CC-BY-4.0