File size: 2,658 Bytes
435fb20
f9a579a
435fb20
f9a579a
 
 
435fb20
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f9a579a
 
435fb20
 
 
 
 
 
 
 
 
f9a579a
 
 
435fb20
f9a579a
435fb20
 
 
 
 
f9a579a
435fb20
f9a579a
435fb20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
# Sortformer CoreML Models - Gradient Descent Configuration

Streaming speaker diarization models converted from NVIDIA's Sortformer to CoreML.

## Configuration

**Gradient Descent** - Higher quality, more context:

| Parameter | Value |
|-----------|-------|
| chunk_len | 6 |
| chunk_right_context | 7 |
| chunk_left_context | 1 |
| fifo_len | 40 |
| spkcache_len | 188 |
| spkcache_update_period | 31 |

## Model Input Shapes

| Model | Input | Shape |
|-------|-------|-------|
| Preprocessor | audio_signal | [1, 18160] |
| Preprocessor | length | [1] |
| PreEncoder | chunk | [1, 112, 128] |
| PreEncoder | chunk_lengths | [1] |
| PreEncoder | spkcache | [1, 188, 512] |
| PreEncoder | spkcache_lengths | [1] |
| PreEncoder | fifo | [1, 40, 512] |
| PreEncoder | fifo_lengths | [1] |
| Head | pre_encoder_embs | [1, 242, 512] |
| Head | pre_encoder_lengths | [1] |
| Head | chunk_embs_in | [1, 14, 512] |
| Head | chunk_lens_in | [1] |

## Model Output Shapes

| Model | Output | Shape |
|-------|--------|-------|
| Preprocessor | features | [1, 112, 128] |
| Preprocessor | feature_lengths | [1] |
| PreEncoder | pre_encoder_embs | [1, 242, 512] |
| PreEncoder | pre_encoder_lengths | [1] |
| PreEncoder | chunk_embs_in | [1, 14, 512] |
| PreEncoder | chunk_lens_in | [1] |
| Head | speaker_preds | [1, 242, 4] |
| Head | chunk_pre_encoder_embs | [1, 14, 512] |
| Head | chunk_pre_encoder_lengths | [1] |

## Files

### Models
- `Pipeline_Preprocessor.mlpackage` / `.mlmodelc` - Audio to mel features
- `Pipeline_PreEncoder.mlpackage` / `.mlmodelc` - Mel features + state to embeddings
- `Pipeline_Head_Fixed.mlpackage` / `.mlmodelc` - Embeddings to speaker predictions

### Scripts
- `export_gradient_descent.py` - Export script used to create these models
- `coreml_wrappers.py` - PyTorch wrapper classes for export
- `streaming_inference.py` - Python streaming inference example
- `mic_inference.py` - Real-time microphone demo

## Usage with FluidAudio (Swift)

```swift
let config = SortformerConfig.gradientDescent
let diarizer = try await SortformerDiarizer(config: config)

// Process audio chunks
while let samples = getAudioChunk() {
    if let result = try diarizer.processChunk(samples) {
        // result.probabilities - confirmed speaker probabilities
        // result.tentativeProbabilities - preview (may change)
    }
}
```

## Performance

| Metric | Value |
|--------|-------|
| Latency | ~1.04s (7 * 80ms right context + chunk) |
| DER (AMI) | ~30.8% |
| RTFx | ~8.2x on Apple Silicon |

## Source

Original model: [nvidia/diar_streaming_sortformer_4spk-v2.1](https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1)