alexwengg's picture
Upload 25 files
435fb20 verified
|
raw
history blame
2.66 kB
# Sortformer CoreML Models - Gradient Descent Configuration
Streaming speaker diarization models converted from NVIDIA's Sortformer to CoreML.
## Configuration
**Gradient Descent** - Higher quality, more context:
| Parameter | Value |
|-----------|-------|
| chunk_len | 6 |
| chunk_right_context | 7 |
| chunk_left_context | 1 |
| fifo_len | 40 |
| spkcache_len | 188 |
| spkcache_update_period | 31 |
## Model Input Shapes
| Model | Input | Shape |
|-------|-------|-------|
| Preprocessor | audio_signal | [1, 18160] |
| Preprocessor | length | [1] |
| PreEncoder | chunk | [1, 112, 128] |
| PreEncoder | chunk_lengths | [1] |
| PreEncoder | spkcache | [1, 188, 512] |
| PreEncoder | spkcache_lengths | [1] |
| PreEncoder | fifo | [1, 40, 512] |
| PreEncoder | fifo_lengths | [1] |
| Head | pre_encoder_embs | [1, 242, 512] |
| Head | pre_encoder_lengths | [1] |
| Head | chunk_embs_in | [1, 14, 512] |
| Head | chunk_lens_in | [1] |
## Model Output Shapes
| Model | Output | Shape |
|-------|--------|-------|
| Preprocessor | features | [1, 112, 128] |
| Preprocessor | feature_lengths | [1] |
| PreEncoder | pre_encoder_embs | [1, 242, 512] |
| PreEncoder | pre_encoder_lengths | [1] |
| PreEncoder | chunk_embs_in | [1, 14, 512] |
| PreEncoder | chunk_lens_in | [1] |
| Head | speaker_preds | [1, 242, 4] |
| Head | chunk_pre_encoder_embs | [1, 14, 512] |
| Head | chunk_pre_encoder_lengths | [1] |
## Files
### Models
- `Pipeline_Preprocessor.mlpackage` / `.mlmodelc` - Audio to mel features
- `Pipeline_PreEncoder.mlpackage` / `.mlmodelc` - Mel features + state to embeddings
- `Pipeline_Head_Fixed.mlpackage` / `.mlmodelc` - Embeddings to speaker predictions
### Scripts
- `export_gradient_descent.py` - Export script used to create these models
- `coreml_wrappers.py` - PyTorch wrapper classes for export
- `streaming_inference.py` - Python streaming inference example
- `mic_inference.py` - Real-time microphone demo
## Usage with FluidAudio (Swift)
```swift
let config = SortformerConfig.gradientDescent
let diarizer = try await SortformerDiarizer(config: config)
// Process audio chunks
while let samples = getAudioChunk() {
if let result = try diarizer.processChunk(samples) {
// result.probabilities - confirmed speaker probabilities
// result.tentativeProbabilities - preview (may change)
}
}
```
## Performance
| Metric | Value |
|--------|-------|
| Latency | ~1.04s (7 * 80ms right context + chunk) |
| DER (AMI) | ~30.8% |
| RTFx | ~8.2x on Apple Silicon |
## Source
Original model: [nvidia/diar_streaming_sortformer_4spk-v2.1](https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1)