|
|
--- |
|
|
license: cc-by-4.0 |
|
|
library_name: coreml |
|
|
base_model: nvidia/nemotron-speech-streaming-en-0.6b |
|
|
tags: |
|
|
- speech-recognition |
|
|
- automatic-speech-recognition |
|
|
- streaming-asr |
|
|
- coreml |
|
|
- apple |
|
|
- ios |
|
|
- macos |
|
|
- FastConformer |
|
|
- RNNT |
|
|
- Parakeet |
|
|
- ASR |
|
|
pipeline_tag: automatic-speech-recognition |
|
|
--- |
|
|
# Sortformer CoreML Models |
|
|
|
|
|
Streaming speaker diarization models converted from NVIDIA's Sortformer to CoreML for Apple Silicon. |
|
|
|
|
|
## Model Variants |
|
|
|
|
|
| Variant | File | Latency | Use Case | |
|
|
|---------|------|---------|----------| |
|
|
| **Default** | `Sortformer.mlmodelc` | ~1.04s | Low latency streaming | |
|
|
| **NVIDIA Low** | `SortformerNvidiaLow.mlmodelc` | ~1.04s | Low latency streaming | |
|
|
| **NVIDIA High** | `SortformerNvidiaHigh.mlmodelc` | ~30.4s | Best quality, offline | |
|
|
|
|
|
## Configuration Parameters |
|
|
|
|
|
| Parameter | Default | NVIDIA Low | NVIDIA High | |
|
|
|-----------|---------|------------|-------------| |
|
|
| chunk_len | 6 | 6 | 340 | |
|
|
| chunk_right_context | 7 | 7 | 40 | |
|
|
| chunk_left_context | 1 | 1 | 1 | |
|
|
| fifo_len | 40 | 188 | 40 | |
|
|
| spkcache_len | 188 | 188 | 188 | |
|
|
|
|
|
## Model Input/Output Shapes |
|
|
|
|
|
**General**: |
|
|
|
|
|
| Input | Shape | Description | |
|
|
|-------|-------|-------------| |
|
|
| chunk | `[1, 8*(C+L+R), 128]` | Mel spectrogram features | |
|
|
| chunk_lengths | `[1]` | Actual chunk length | |
|
|
| spkcache | `[1, S, 512]` | Speaker cache embeddings | |
|
|
| spkcache_lengths | `[1]` | Actual cache length | |
|
|
| fifo | `[1, F, 512]` | FIFO queue embeddings | |
|
|
| fifo_lengths | `[1]` | Actual FIFO length | |
|
|
|
|
|
| Output | Shape | Description | |
|
|
|--------|-------|-------------| |
|
|
| speaker_preds | `[C+L+R+S+F, 4]` | Speaker probabilities (4 speakers) | |
|
|
| chunk_pre_encoder_embs | `[C+L+R, 512]` | Embeddings for state update | |
|
|
| chunk_pre_encoder_lengths | `[1]` | Actual embedding count | |
|
|
| nest_encoder_embs | `[C+L+R+S+F, 192]` | Embeddings for speaker discrimination | |
|
|
| nest_encoder_lengths | `[1]` | Actual speaker embedding count | |
|
|
|
|
|
Note: `C = chunk_len`, `L = chunk_left_context`, `R = chunk_right_context`, `S = spkcache_len`, `F = fifo_len`. |
|
|
|
|
|
**Configuration-Specific Shapes**: |
|
|
|
|
|
| Input | Default | NVIDIA Low | NVIDIA High | |
|
|
|-------|---------|------------|-------------| |
|
|
| chunk | `[1, 112, 128]` | `[1, 112, 128]` | `[1, 3048, 128]` | |
|
|
| chunk_lengths | `[1]` | `[1]` | `[1]` | |
|
|
| spkcache | `[1, 188, 512]` | `[1, 188, 512]` | `[1, 188, 512]` | |
|
|
| spkcache_lengths | `[1]` | `[1]` | `[1]` | |
|
|
| fifo | `[1, 40, 512]` | `[1, 188, 512]` | `[1, 40, 512]` |
|
|
| fifo_lengths | `[1]` | `[1]` | `[1]` | |
|
|
|
|
|
| Output | Default | NVIDIA Low | NVIDIA High | |
|
|
|--------|---------|------------|-------------| |
|
|
| speaker_preds | `[1, 242, 128]` | `[1, 390, 128]` | `[1, 609, 128]` | |
|
|
| chunk_pre_encoder_embs | `[1, 14, 512]` | `[1, 14, 512]` | `[1, 381, 512]` | |
|
|
| chunk_pre_encoder_lengths | `[1]` | `[1]` | `[1]` | |
|
|
| nest_encoder_embs | `[1, 242, 192]` | `[1, 390, 192]` | `[1, 609, 192]` | |
|
|
| nest_encoder_lengths | `[1]` | `[1]` | `[1]` | |
|
|
|
|
|
|
|
|
| Metric | Default | NVIDIA High | |
|
|
|---------------|---------|-------------| |
|
|
| Latency | ~1.12s | ~30.4s | |
|
|
| RTFx (M4 Max) | ~5.7x | ~125.3x | |
|
|
|
|
|
## Usage with FluidAudio (Swift) |
|
|
|
|
|
```swift |
|
|
import FluidAudio |
|
|
|
|
|
// Initialize with default config (auto-downloads from HuggingFace) |
|
|
let diarizer = SortformerDiarizer(config: .default) |
|
|
let models = try await SortformerModels.loadFromHuggingFace(config: .default) |
|
|
diarizer.initialize(models: models) |
|
|
|
|
|
// Streaming processing |
|
|
for audioChunk in audioStream { |
|
|
if let result = try diarizer.processSamples(audioChunk) { |
|
|
for frame in 0..<result.frameCount { |
|
|
for speaker in 0..<4 { |
|
|
let prob = result.getSpeakerPrediction(speaker: speaker, frame: frame) |
|
|
} |
|
|
} |
|
|
} |
|
|
} |
|
|
|
|
|
// Or batch processing |
|
|
let timeline = try diarizer.processComplete(audioSamples) |
|
|
for (speakerIndex, segments) in timeline.segments.enumerated() { |
|
|
for segment in segments { |
|
|
print("Speaker \(speakerIndex): \(segment.startTime)s - \(segment.endTime)s") |
|
|
} |
|
|
} |
|
|
``` |
|
|
Performance |
|
|
|
|
|
https://github.com/FluidInference/FluidAudio/blob/main/Documentation/Benchmarks.md |
|
|
|
|
|
Files |
|
|
|
|
|
Models |
|
|
|
|
|
- Sortformer.mlpackage / .mlmodelc - Default config (low latency) |
|
|
- SortformerNvidiaLow.mlpackage / .mlmodelc - NVIDIA low latency config |
|
|
- SortformerNvidiaHigh.mlpackage / .mlmodelc - NVIDIA high latency config |
|
|
|
|
|
Scripts |
|
|
|
|
|
- convert_to_coreml.py - PyTorch to CoreML conversion |
|
|
- streaming_inference.py - Python streaming inference example |
|
|
- mic_inference.py - Real-time microphone demo |
|
|
|
|
|
Source |
|
|
|
|
|
Original model: https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1 |
|
|
|
|
|
Credits & Acknowledgements |
|
|
|
|
|
This project would not have been possible without the significant technical contributions of https://huggingface.co/GradientDescent2718. |
|
|
|
|
|
Their work was instrumental in: |
|
|
|
|
|
- Architecture Conversion: Developing the complex PyTorch-to-CoreML conversion pipeline for the 17-layer Fast-Conformer and 18-layer Transformer heads. |
|
|
- Build & Optimization: Engineering the static shape configurations that allow the model to achieve ~120x RTF on Apple Silicon. |
|
|
- Logic Implementation: Porting the critical streaming state logic (speaker cache and FIFO management) to ensure consistent speaker identity tracking. |
|
|
|
|
|
This project was built upon the foundational work of the NVIDIA NeMo team. |