alexwengg commited on
Commit
435fb20
·
verified ·
1 Parent(s): 2d00bb3

Upload 25 files

Browse files
Files changed (1) hide show
  1. README.md +75 -125
README.md CHANGED
@@ -1,137 +1,87 @@
1
- # Streaming Sortformer CoreML
2
 
3
- CoreML conversion of NVIDIA's Streaming Sortformer 4-Speaker Diarization model for Apple Silicon.
4
-
5
- ## Original Model
6
-
7
- - **Source**: [nvidia/diar_streaming_sortformer_4spk-v2.1](https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1)
8
- - **Paper**: [Sortformer: Seamless Integration of Speaker Diarization and ASR](https://arxiv.org/abs/2409.06656)
9
- - **Benchmark**: 20.57% DER on AMI SDM (NVIDIA reported)
10
-
11
- ## Models
12
-
13
- | Model | Description | Input | Output |
14
- |-------|-------------|-------|--------|
15
- | `Pipeline_Preprocessor.mlpackage` | Mel spectrogram extraction | Audio waveform | 128-dim mel features |
16
- | `Pipeline_PreEncoder.mlpackage` | FastConformer encoder + Transformer | Mel features + state | Encoded embeddings |
17
- | `Pipeline_Head_Fixed.mlpackage` | Speaker prediction head | Embeddings | 4-speaker probabilities |
18
 
19
  ## Configuration
20
 
21
- ```python
22
- CONFIG = {
23
- "chunk_len": 6, # Core chunk length (encoder frames)
24
- "chunk_left_context": 1, # Left context frames
25
- "chunk_right_context": 7, # Right context frames
26
- "fifo_len": 188, # FIFO buffer length
27
- "spkcache_len": 188, # Speaker cache length
28
- "subsampling_factor": 8, # 8x subsampling (80ms per encoder frame)
29
- "sample_rate": 16000,
30
- "mel_features": 128,
31
- "n_speakers": 4,
32
- }
33
- ```
34
-
35
- ## Usage
36
-
37
- ### Python (coremltools)
38
-
39
- ```python
40
- import coremltools as ct
41
- import numpy as np
42
-
43
- # Load models
44
- pre_encoder = ct.models.MLModel("Pipeline_PreEncoder.mlpackage",
45
- compute_units=ct.ComputeUnit.CPU_ONLY)
46
- head = ct.models.MLModel("Pipeline_Head_Fixed.mlpackage",
47
- compute_units=ct.ComputeUnit.CPU_ONLY)
48
-
49
- # Initialize state
50
- spkcache = np.zeros((1, 188, 512), dtype=np.float32)
51
- fifo = np.zeros((1, 188, 512), dtype=np.float32)
52
-
53
- # Process chunk (mel_features: [1, 112, 128])
54
- pre_out = pre_encoder.predict({
55
- "chunk": mel_features,
56
- "chunk_lengths": np.array([actual_length], dtype=np.int32),
57
- "spkcache": spkcache,
58
- "spkcache_lengths": np.array([0], dtype=np.int32),
59
- "fifo": fifo,
60
- "fifo_lengths": np.array([0], dtype=np.int32)
61
- })
62
-
63
- head_out = head.predict({
64
- "pre_encoder_embs": pre_out["pre_encoder_embs"],
65
- "pre_encoder_lengths": pre_out["pre_encoder_lengths"],
66
- "chunk_embs_in": pre_out["chunk_embs_in"],
67
- "chunk_lens_in": pre_out["chunk_lens_in"]
68
- })
69
-
70
- predictions = head_out["speaker_preds"] # [1, T, 4]
71
- ```
72
-
73
- ### Swift (Core ML)
 
 
 
74
 
75
  ```swift
76
- import CoreML
77
-
78
- let preEncoder = try MLModel(contentsOf: preEncoderURL)
79
- let head = try MLModel(contentsOf: headURL)
80
-
81
- // Create input with MLMultiArray for chunk, spkcache, fifo
82
- let preEncoderInput = try preEncoder.prediction(from: inputProvider)
83
- let headInput = try head.prediction(from: preEncoderInput)
84
-
85
- let predictions = headInput.featureValue(for: "speaker_preds")
86
- ```
87
-
88
- ## Mel Spectrogram Settings
89
-
90
- For compatibility with the original NeMo model:
91
-
92
- ```python
93
- mel_config = {
94
- "sample_rate": 16000,
95
- "n_fft": 512,
96
- "win_length": 400, # 25ms
97
- "hop_length": 160, # 10ms
98
- "n_mels": 128,
99
- "preemph": 0.97,
100
- "log_zero_guard_value": 2**-24,
101
- "normalize": "per_feature",
102
  }
103
  ```
104
 
105
- ## Streaming Pipeline
106
-
107
- 1. **Chunk audio** into ~480ms windows (48 mel frames core + context)
108
- 2. **Compute mel spectrogram** for each chunk
109
- 3. **Run PreEncoder** with current state (spkcache + fifo)
110
- 4. **Run Head** to get 4-speaker probabilities
111
- 5. **Update state** (spkcache/fifo buffers)
112
- 6. **Threshold predictions** (default: 0.5) for binary speaker activity
113
-
114
- ## Accuracy
115
 
116
- Verified within 0.12% of original NeMo PyTorch model on chunk-level predictions.
 
 
 
 
117
 
118
- ## Requirements
119
 
120
- - macOS 12+ or iOS 15+
121
- - Apple Silicon (M1/M2/M3) recommended
122
- - Python: `coremltools`, `numpy`, `torch`, `torchaudio`
123
-
124
- ## License
125
-
126
- Apache 2.0 (following NVIDIA NeMo licensing)
127
-
128
- ## Citation
129
-
130
- ```bibtex
131
- @article{park2024sortformer,
132
- title={Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens},
133
- author={Park, Taejin and Huang, He and Koluguri, Nithin and Georgiou, Panagiotis and Watanabe, Shinji and Ginsburg, Boris},
134
- journal={arXiv preprint arXiv:2409.06656},
135
- year={2024}
136
- }
137
- ```
 
1
+ # Sortformer CoreML Models - Gradient Descent Configuration
2
 
3
+ Streaming speaker diarization models converted from NVIDIA's Sortformer to CoreML.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
 
5
  ## Configuration
6
 
7
+ **Gradient Descent** - Higher quality, more context:
8
+
9
+ | Parameter | Value |
10
+ |-----------|-------|
11
+ | chunk_len | 6 |
12
+ | chunk_right_context | 7 |
13
+ | chunk_left_context | 1 |
14
+ | fifo_len | 40 |
15
+ | spkcache_len | 188 |
16
+ | spkcache_update_period | 31 |
17
+
18
+ ## Model Input Shapes
19
+
20
+ | Model | Input | Shape |
21
+ |-------|-------|-------|
22
+ | Preprocessor | audio_signal | [1, 18160] |
23
+ | Preprocessor | length | [1] |
24
+ | PreEncoder | chunk | [1, 112, 128] |
25
+ | PreEncoder | chunk_lengths | [1] |
26
+ | PreEncoder | spkcache | [1, 188, 512] |
27
+ | PreEncoder | spkcache_lengths | [1] |
28
+ | PreEncoder | fifo | [1, 40, 512] |
29
+ | PreEncoder | fifo_lengths | [1] |
30
+ | Head | pre_encoder_embs | [1, 242, 512] |
31
+ | Head | pre_encoder_lengths | [1] |
32
+ | Head | chunk_embs_in | [1, 14, 512] |
33
+ | Head | chunk_lens_in | [1] |
34
+
35
+ ## Model Output Shapes
36
+
37
+ | Model | Output | Shape |
38
+ |-------|--------|-------|
39
+ | Preprocessor | features | [1, 112, 128] |
40
+ | Preprocessor | feature_lengths | [1] |
41
+ | PreEncoder | pre_encoder_embs | [1, 242, 512] |
42
+ | PreEncoder | pre_encoder_lengths | [1] |
43
+ | PreEncoder | chunk_embs_in | [1, 14, 512] |
44
+ | PreEncoder | chunk_lens_in | [1] |
45
+ | Head | speaker_preds | [1, 242, 4] |
46
+ | Head | chunk_pre_encoder_embs | [1, 14, 512] |
47
+ | Head | chunk_pre_encoder_lengths | [1] |
48
+
49
+ ## Files
50
+
51
+ ### Models
52
+ - `Pipeline_Preprocessor.mlpackage` / `.mlmodelc` - Audio to mel features
53
+ - `Pipeline_PreEncoder.mlpackage` / `.mlmodelc` - Mel features + state to embeddings
54
+ - `Pipeline_Head_Fixed.mlpackage` / `.mlmodelc` - Embeddings to speaker predictions
55
+
56
+ ### Scripts
57
+ - `export_gradient_descent.py` - Export script used to create these models
58
+ - `coreml_wrappers.py` - PyTorch wrapper classes for export
59
+ - `streaming_inference.py` - Python streaming inference example
60
+ - `mic_inference.py` - Real-time microphone demo
61
+
62
+ ## Usage with FluidAudio (Swift)
63
 
64
  ```swift
65
+ let config = SortformerConfig.gradientDescent
66
+ let diarizer = try await SortformerDiarizer(config: config)
67
+
68
+ // Process audio chunks
69
+ while let samples = getAudioChunk() {
70
+ if let result = try diarizer.processChunk(samples) {
71
+ // result.probabilities - confirmed speaker probabilities
72
+ // result.tentativeProbabilities - preview (may change)
73
+ }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74
  }
75
  ```
76
 
77
+ ## Performance
 
 
 
 
 
 
 
 
 
78
 
79
+ | Metric | Value |
80
+ |--------|-------|
81
+ | Latency | ~1.04s (7 * 80ms right context + chunk) |
82
+ | DER (AMI) | ~30.8% |
83
+ | RTFx | ~8.2x on Apple Silicon |
84
 
85
+ ## Source
86
 
87
+ Original model: [nvidia/diar_streaming_sortformer_4spk-v2.1](https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1)