alexwengg commited on
Commit
ca84a2d
·
verified ·
1 Parent(s): 5774ae7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +101 -89
README.md CHANGED
@@ -1,103 +1,115 @@
1
- # Sortformer CoreML Models - Gradient Descent Configuration
2
-
3
- Streaming speaker diarization models converted from NVIDIA's Sortformer to CoreML.
4
-
5
- ## Configuration
6
-
7
- **Gradient Descent** - Higher quality, more context:
8
-
9
- | Parameter | Value |
10
- |-----------|-------|
11
- | chunk_len | 6 |
12
- | chunk_right_context | 7 |
13
- | chunk_left_context | 1 |
14
- | fifo_len | 40 |
15
- | spkcache_len | 188 |
16
- | spkcache_update_period | 31 |
17
-
18
- ## Model Input Shapes
19
-
20
- | Model | Input | Shape |
21
- |-------|-------|-------|
22
- | Preprocessor | audio_signal | [1, 18160] |
23
- | Preprocessor | length | [1] |
24
- | PreEncoder | chunk | [1, 112, 128] |
25
- | PreEncoder | chunk_lengths | [1] |
26
- | PreEncoder | spkcache | [1, 188, 512] |
27
- | PreEncoder | spkcache_lengths | [1] |
28
- | PreEncoder | fifo | [1, 40, 512] |
29
- | PreEncoder | fifo_lengths | [1] |
30
- | Head | pre_encoder_embs | [1, 242, 512] |
31
- | Head | pre_encoder_lengths | [1] |
32
- | Head | chunk_embs_in | [1, 14, 512] |
33
- | Head | chunk_lens_in | [1] |
34
-
35
- ## Model Output Shapes
36
-
37
- | Model | Output | Shape |
38
- |-------|--------|-------|
39
- | Preprocessor | features | [1, 112, 128] |
40
- | Preprocessor | feature_lengths | [1] |
41
- | PreEncoder | pre_encoder_embs | [1, 242, 512] |
42
- | PreEncoder | pre_encoder_lengths | [1] |
43
- | PreEncoder | chunk_embs_in | [1, 14, 512] |
44
- | PreEncoder | chunk_lens_in | [1] |
45
- | Head | speaker_preds | [1, 242, 4] |
46
- | Head | chunk_pre_encoder_embs | [1, 14, 512] |
47
- | Head | chunk_pre_encoder_lengths | [1] |
48
-
49
- ## Files
50
-
51
- ### Models
52
- - `Pipeline_Preprocessor.mlpackage` / `.mlmodelc` - Audio to mel features
53
- - `Pipeline_PreEncoder.mlpackage` / `.mlmodelc` - Mel features + state to embeddings
54
- - `Pipeline_Head_Fixed.mlpackage` / `.mlmodelc` - Embeddings to speaker predictions
55
-
56
- ### Scripts
57
- - `export_gradient_descent.py` - Export script used to create these models
58
- - `coreml_wrappers.py` - PyTorch wrapper classes for export
59
- - `streaming_inference.py` - Python streaming inference example
60
- - `mic_inference.py` - Real-time microphone demo
61
-
62
- ## Usage with FluidAudio (Swift)
63
-
64
- ```swift
65
- let config = SortformerConfig.gradientDescent
66
- let diarizer = try await SortformerDiarizer(config: config)
67
-
68
- // Process audio chunks
69
- while let samples = getAudioChunk() {
70
- if let result = try diarizer.processChunk(samples) {
71
- // result.probabilities - confirmed speaker probabilities
72
- // result.tentativeProbabilities - preview (may change)
73
- }
74
- }
75
  ```
 
 
 
 
 
 
76
 
77
- ## Performance
78
 
79
- | Metric | Value |
80
- |--------|-------|
81
- | Latency | ~1.04s (7 * 80ms right context + chunk) |
82
- | DER (AMI) | ~30.8% |
83
- | RTFx | ~8.2x on Apple Silicon |
84
 
85
- ## Source
 
 
86
 
87
- Original model: [nvidia/diar_streaming_sortformer_4spk-v2.1](https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1)
88
 
 
 
 
89
 
 
90
 
91
- ## Credits & Acknowledgements
92
 
93
- This project would not have been possible without the significant technical contributions of [GradientDescent2718](https://huggingface.co/GradientDescent2718).
94
 
95
- Their work was instrumental in:
96
 
97
- Architecture Conversion: Developing the complex PyTorch-to-CoreML conversion pipeline for the 17-layer Fast-Conformer and 18-layer Transformer heads.
98
 
99
- Build & Optimization: Engineering the static shape configurations settings that allow the model to achieve ~8.2x RTF on Apple Silicon.
 
 
100
 
101
- Logic Implementation: Porting the critical streaming state logic (AOSC and FIFO management) to ensure zero-shot identity consistency in the CoreML wrapper.
102
 
103
- This project was built upon the foundational work of the NVIDIA NeMo team.
 
 
 
 
 
 
 
1
+
2
+ # Sortformer CoreML Models
3
+
4
+ Streaming speaker diarization models converted from NVIDIA's Sortformer to CoreML for Apple Silicon.
5
+
6
+ ## Model Variants
7
+
8
+ | Variant | File | Latency | Use Case |
9
+ |---------|------|---------|----------|
10
+ | **Default** | `Sortformer.mlmodelc` | ~1.12s | Low latency streaming |
11
+ | **NVIDIA Low** | `SortformerNvidiaLow.mlmodelc` | ~1.04s | Low latency streaming |
12
+ | **NVIDIA High** | `SortformerNvidiaHigh.mlmodelc` | ~30.4s | Best quality, offline |
13
+
14
+ ## Configuration Parameters
15
+
16
+ | Parameter | Default | NVIDIA Low | NVIDIA High |
17
+ |-----------|---------|------------|-------------|
18
+ | chunk_len | 6 | 6 | 340 |
19
+ | chunk_right_context | 7 | 7 | 40 |
20
+ | chunk_left_context | 1 | 1 | 1 |
21
+ | fifo_len | 40 | 188 | 40 |
22
+ | spkcache_len | 188 | 188 | 188 |
23
+
24
+ ## Model Input/Output Shapes
25
+
26
+ Combined model (Sortformer.mlmodelc - default config):
27
+
28
+ | Input | Shape | Description |
29
+ |-------|-------|-------------|
30
+ | chunk | [1, 112, 128] | Mel spectrogram features |
31
+ | chunk_lengths | [1] | Actual chunk length |
32
+ | spkcache | [1, 188, 512] | Speaker cache embeddings |
33
+ | spkcache_lengths | [1] | Actual cache length |
34
+ | fifo | [1, 40, 512] | FIFO queue embeddings |
35
+ | fifo_lengths | [1] | Actual FIFO length |
36
+
37
+ | Output | Shape | Description |
38
+ |--------|-------|-------------|
39
+ | speaker_preds | [T, 4] | Speaker probabilities (4 speakers) |
40
+ | chunk_pre_encoder_embs | [T', 512] | Embeddings for state update |
41
+ | chunk_pre_encoder_lengths | [1] | Actual embedding count |
42
+
43
+ ## Usage with FluidAudio (Swift)
44
+
45
+ ```swift
46
+ import FluidAudio
47
+
48
+ // Initialize with default config (auto-downloads from HuggingFace)
49
+ let diarizer = SortformerDiarizer(config: .default)
50
+ let models = try await SortformerModels.loadFromHuggingFace(config: .default)
51
+ diarizer.initialize(models: models)
52
+
53
+ // Streaming processing
54
+ for audioChunk in audioStream {
55
+ if let result = try diarizer.processSamples(audioChunk) {
56
+ for frame in 0..<result.frameCount {
57
+ for speaker in 0..<4 {
58
+ let prob = result.getSpeakerPrediction(speaker: speaker, frame: frame)
59
+ }
60
+ }
61
+ }
62
+ }
63
+
64
+ // Or batch processing
65
+ let timeline = try diarizer.processComplete(audioSamples)
66
+ for (speakerIndex, segments) in timeline.segments.enumerated() {
67
+ for segment in segments {
68
+ print("Speaker \(speakerIndex): \(segment.startTime)s - \(segment.endTime)s")
69
+ }
70
+ }
 
 
 
 
71
  ```
72
+ Performance
73
+
74
+ | Metric | Default | NVIDIA High |
75
+ |---------------|---------|-------------|
76
+ | Latency | ~1.12s | ~30.4s |
77
+ | RTFx (M4 Pro) | ~120x | ~118x |
78
 
79
+ Files
80
 
81
+ Models
 
 
 
 
82
 
83
+ - Sortformer.mlpackage / .mlmodelc - Default config (low latency)
84
+ - SortformerNvidiaLow.mlpackage / .mlmodelc - NVIDIA low latency config
85
+ - SortformerNvidiaHigh.mlpackage / .mlmodelc - NVIDIA high latency config
86
 
87
+ Scripts
88
 
89
+ - convert_to_coreml.py - PyTorch to CoreML conversion
90
+ - streaming_inference.py - Python streaming inference example
91
+ - mic_inference.py - Real-time microphone demo
92
 
93
+ Source
94
 
95
+ Original model: https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1
96
 
97
+ Credits & Acknowledgements
98
 
99
+ This project would not have been possible without the significant technical contributions of https://huggingface.co/GradientDescent2718.
100
 
101
+ Their work was instrumental in:
102
 
103
+ - Architecture Conversion: Developing the complex PyTorch-to-CoreML conversion pipeline for the 17-layer Fast-Conformer and 18-layer Transformer heads.
104
+ - Build & Optimization: Engineering the static shape configurations that allow the model to achieve ~120x RTF on Apple Silicon.
105
+ - Logic Implementation: Porting the critical streaming state logic (speaker cache and FIFO management) to ensure consistent speaker identity tracking.
106
 
107
+ This project was built upon the foundational work of the NVIDIA NeMo team.
108
 
109
+ Key changes:
110
+ 1. Describes all 3 model variants (Default, NVIDIA Low, NVIDIA High)
111
+ 2. Updated model file names to match actual repo content
112
+ 3. Fixed Swift API to match current `SortformerDiarizer` implementation
113
+ 4. Updated performance numbers (RTFx ~120x based on your documentation)
114
+ 5. Simplified input/output shapes table for combined model
115
+ 6. Kept credits section intact