FluidInference
/

diar-streaming-sortformer-coreml

@@ -1,137 +1,87 @@
-# Streaming Sortformer CoreML
-CoreML conversion of NVIDIA's Streaming Sortformer 4-Speaker Diarization model for Apple Silicon.
-## Original Model
-- **Source**: [nvidia/diar_streaming_sortformer_4spk-v2.1](https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1)
-- **Paper**: [Sortformer: Seamless Integration of Speaker Diarization and ASR](https://arxiv.org/abs/2409.06656)
-- **Benchmark**: 20.57% DER on AMI SDM (NVIDIA reported)
-## Models
-| Model | Description | Input | Output |
-|-------|-------------|-------|--------|
-| `Pipeline_Preprocessor.mlpackage` | Mel spectrogram extraction | Audio waveform | 128-dim mel features |
-| `Pipeline_PreEncoder.mlpackage` | FastConformer encoder + Transformer | Mel features + state | Encoded embeddings |
-| `Pipeline_Head_Fixed.mlpackage` | Speaker prediction head | Embeddings | 4-speaker probabilities |
 ## Configuration
-```python
-CONFIG = {
-    "chunk_len": 6,              # Core chunk length (encoder frames)
-    "chunk_left_context": 1,     # Left context frames
-    "chunk_right_context": 7,    # Right context frames
-    "fifo_len": 188,             # FIFO buffer length
-    "spkcache_len": 188,         # Speaker cache length
-    "subsampling_factor": 8,     # 8x subsampling (80ms per encoder frame)
-    "sample_rate": 16000,
-    "mel_features": 128,
-    "n_speakers": 4,
-}
-```
-## Usage
-### Python (coremltools)
-```python
-import coremltools as ct
-import numpy as np
-# Load models
-pre_encoder = ct.models.MLModel("Pipeline_PreEncoder.mlpackage",
-                                 compute_units=ct.ComputeUnit.CPU_ONLY)
-head = ct.models.MLModel("Pipeline_Head_Fixed.mlpackage",
-                         compute_units=ct.ComputeUnit.CPU_ONLY)
-# Initialize state
-spkcache = np.zeros((1, 188, 512), dtype=np.float32)
-fifo = np.zeros((1, 188, 512), dtype=np.float32)
-# Process chunk (mel_features: [1, 112, 128])
-pre_out = pre_encoder.predict({
-    "chunk": mel_features,
-    "chunk_lengths": np.array([actual_length], dtype=np.int32),
-    "spkcache": spkcache,
-    "spkcache_lengths": np.array([0], dtype=np.int32),
-    "fifo": fifo,
-    "fifo_lengths": np.array([0], dtype=np.int32)
-})
-head_out = head.predict({
-    "pre_encoder_embs": pre_out["pre_encoder_embs"],
-    "pre_encoder_lengths": pre_out["pre_encoder_lengths"],
-    "chunk_embs_in": pre_out["chunk_embs_in"],
-    "chunk_lens_in": pre_out["chunk_lens_in"]
-})
-predictions = head_out["speaker_preds"]  # [1, T, 4]
-```
-### Swift (Core ML)
 ```swift
-import CoreML
-let preEncoder = try MLModel(contentsOf: preEncoderURL)
-let head = try MLModel(contentsOf: headURL)
-// Create input with MLMultiArray for chunk, spkcache, fifo
-let preEncoderInput = try preEncoder.prediction(from: inputProvider)
-let headInput = try head.prediction(from: preEncoderInput)
-let predictions = headInput.featureValue(for: "speaker_preds")
-```
-## Mel Spectrogram Settings
-For compatibility with the original NeMo model:
-```python
-mel_config = {
-    "sample_rate": 16000,
-    "n_fft": 512,
-    "win_length": 400,      # 25ms
-    "hop_length": 160,      # 10ms
-    "n_mels": 128,
-    "preemph": 0.97,
-    "log_zero_guard_value": 2**-24,
-    "normalize": "per_feature",
 }
 ```
-## Streaming Pipeline
-1. **Chunk audio** into ~480ms windows (48 mel frames core + context)
-2. **Compute mel spectrogram** for each chunk
-3. **Run PreEncoder** with current state (spkcache + fifo)
-4. **Run Head** to get 4-speaker probabilities
-5. **Update state** (spkcache/fifo buffers)
-6. **Threshold predictions** (default: 0.5) for binary speaker activity
-## Accuracy
-Verified within 0.12% of original NeMo PyTorch model on chunk-level predictions.
-## Requirements
-- macOS 12+ or iOS 15+
-- Apple Silicon (M1/M2/M3) recommended
-- Python: `coremltools`, `numpy`, `torch`, `torchaudio`
-## License
-Apache 2.0 (following NVIDIA NeMo licensing)
-## Citation
-```bibtex
-@article{park2024sortformer,
-  title={Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens},
-  author={Park, Taejin and Huang, He and Koluguri, Nithin and Georgiou, Panagiotis and Watanabe, Shinji and Ginsburg, Boris},
-  journal={arXiv preprint arXiv:2409.06656},
-  year={2024}
-}
-```

+# Sortformer CoreML Models - Gradient Descent Configuration
+Streaming speaker diarization models converted from NVIDIA's Sortformer to CoreML.
 ## Configuration
+**Gradient Descent** - Higher quality, more context:
+| Parameter | Value |
+|-----------|-------|
+| chunk_len | 6 |
+| chunk_right_context | 7 |
+| chunk_left_context | 1 |
+| fifo_len | 40 |
+| spkcache_len | 188 |
+| spkcache_update_period | 31 |
+## Model Input Shapes
+| Model | Input | Shape |
+|-------|-------|-------|
+| Preprocessor | audio_signal | [1, 18160] |
+| Preprocessor | length | [1] |
+| PreEncoder | chunk | [1, 112, 128] |
+| PreEncoder | chunk_lengths | [1] |
+| PreEncoder | spkcache | [1, 188, 512] |
+| PreEncoder | spkcache_lengths | [1] |
+| PreEncoder | fifo | [1, 40, 512] |
+| PreEncoder | fifo_lengths | [1] |
+| Head | pre_encoder_embs | [1, 242, 512] |
+| Head | pre_encoder_lengths | [1] |
+| Head | chunk_embs_in | [1, 14, 512] |
+| Head | chunk_lens_in | [1] |
+## Model Output Shapes
+| Model | Output | Shape |
+|-------|--------|-------|
+| Preprocessor | features | [1, 112, 128] |
+| Preprocessor | feature_lengths | [1] |
+| PreEncoder | pre_encoder_embs | [1, 242, 512] |
+| PreEncoder | pre_encoder_lengths | [1] |
+| PreEncoder | chunk_embs_in | [1, 14, 512] |
+| PreEncoder | chunk_lens_in | [1] |
+| Head | speaker_preds | [1, 242, 4] |
+| Head | chunk_pre_encoder_embs | [1, 14, 512] |
+| Head | chunk_pre_encoder_lengths | [1] |
+## Files
+### Models
+- `Pipeline_Preprocessor.mlpackage` / `.mlmodelc` - Audio to mel features
+- `Pipeline_PreEncoder.mlpackage` / `.mlmodelc` - Mel features + state to embeddings
+- `Pipeline_Head_Fixed.mlpackage` / `.mlmodelc` - Embeddings to speaker predictions
+### Scripts
+- `export_gradient_descent.py` - Export script used to create these models
+- `coreml_wrappers.py` - PyTorch wrapper classes for export
+- `streaming_inference.py` - Python streaming inference example
+- `mic_inference.py` - Real-time microphone demo
+## Usage with FluidAudio (Swift)
 ```swift
+let config = SortformerConfig.gradientDescent
+let diarizer = try await SortformerDiarizer(config: config)
+// Process audio chunks
+while let samples = getAudioChunk() {
+    if let result = try diarizer.processChunk(samples) {
+        // result.probabilities - confirmed speaker probabilities
+        // result.tentativeProbabilities - preview (may change)
+    }
 }
 ```
+## Performance
+| Metric | Value |
+|--------|-------|
+| Latency | ~1.04s (7 * 80ms right context + chunk) |
+| DER (AMI) | ~30.8% |
+| RTFx | ~8.2x on Apple Silicon |
+## Source
+Original model: [nvidia/diar_streaming_sortformer_4spk-v2.1](https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1)