File size: 5,450 Bytes
6efbfb3 ff41bac 6efbfb3 ff41bac 6efbfb3 ca84a2d fb3bc01 ca84a2d 6b7b7dc ca84a2d 6b7b7dc ca84a2d 6b7b7dc 3a5d56e 6b7b7dc 3a5d56e 6b7b7dc ca84a2d 3a5d56e ca84a2d f9a579a ca84a2d fb3bc01 f9a579a ca84a2d f9a579a ca84a2d f9a579a ca84a2d f9a579a ca84a2d ea4a31e ca84a2d ea4a31e ca84a2d ea4a31e ca84a2d ea4a31e ca84a2d 5774ae7 ca84a2d 5774ae7 ca84a2d 5774ae7 ca84a2d 5774ae7 ff41bac |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 |
---
license: cc-by-4.0
library_name: coreml
base_model: nvidia/nemotron-speech-streaming-en-0.6b
tags:
- speech-recognition
- automatic-speech-recognition
- streaming-asr
- coreml
- apple
- ios
- macos
- FastConformer
- RNNT
- Parakeet
- ASR
pipeline_tag: automatic-speech-recognition
---
# Sortformer CoreML Models
Streaming speaker diarization models converted from NVIDIA's Sortformer to CoreML for Apple Silicon.
## Model Variants
| Variant | File | Latency | Use Case |
|---------|------|---------|----------|
| **Default** | `Sortformer.mlmodelc` | ~1.04s | Low latency streaming |
| **NVIDIA Low** | `SortformerNvidiaLow.mlmodelc` | ~1.04s | Low latency streaming |
| **NVIDIA High** | `SortformerNvidiaHigh.mlmodelc` | ~30.4s | Best quality, offline |
## Configuration Parameters
| Parameter | Default | NVIDIA Low | NVIDIA High |
|-----------|---------|------------|-------------|
| chunk_len | 6 | 6 | 340 |
| chunk_right_context | 7 | 7 | 40 |
| chunk_left_context | 1 | 1 | 1 |
| fifo_len | 40 | 188 | 40 |
| spkcache_len | 188 | 188 | 188 |
## Model Input/Output Shapes
**General**:
| Input | Shape | Description |
|-------|-------|-------------|
| chunk | `[1, 8*(C+L+R), 128]` | Mel spectrogram features |
| chunk_lengths | `[1]` | Actual chunk length |
| spkcache | `[1, S, 512]` | Speaker cache embeddings |
| spkcache_lengths | `[1]` | Actual cache length |
| fifo | `[1, F, 512]` | FIFO queue embeddings |
| fifo_lengths | `[1]` | Actual FIFO length |
| Output | Shape | Description |
|--------|-------|-------------|
| speaker_preds | `[C+L+R+S+F, 4]` | Speaker probabilities (4 speakers) |
| chunk_pre_encoder_embs | `[C+L+R, 512]` | Embeddings for state update |
| chunk_pre_encoder_lengths | `[1]` | Actual embedding count |
| nest_encoder_embs | `[C+L+R+S+F, 192]` | Embeddings for speaker discrimination |
| nest_encoder_lengths | `[1]` | Actual speaker embedding count |
Note: `C = chunk_len`, `L = chunk_left_context`, `R = chunk_right_context`, `S = spkcache_len`, `F = fifo_len`.
**Configuration-Specific Shapes**:
| Input | Default | NVIDIA Low | NVIDIA High |
|-------|---------|------------|-------------|
| chunk | `[1, 112, 128]` | `[1, 112, 128]` | `[1, 3048, 128]` |
| chunk_lengths | `[1]` | `[1]` | `[1]` |
| spkcache | `[1, 188, 512]` | `[1, 188, 512]` | `[1, 188, 512]` |
| spkcache_lengths | `[1]` | `[1]` | `[1]` |
| fifo | `[1, 40, 512]` | `[1, 188, 512]` | `[1, 40, 512]`
| fifo_lengths | `[1]` | `[1]` | `[1]` |
| Output | Default | NVIDIA Low | NVIDIA High |
|--------|---------|------------|-------------|
| speaker_preds | `[1, 242, 128]` | `[1, 390, 128]` | `[1, 609, 128]` |
| chunk_pre_encoder_embs | `[1, 14, 512]` | `[1, 14, 512]` | `[1, 381, 512]` |
| chunk_pre_encoder_lengths | `[1]` | `[1]` | `[1]` |
| nest_encoder_embs | `[1, 242, 192]` | `[1, 390, 192]` | `[1, 609, 192]` |
| nest_encoder_lengths | `[1]` | `[1]` | `[1]` |
| Metric | Default | NVIDIA High |
|---------------|---------|-------------|
| Latency | ~1.12s | ~30.4s |
| RTFx (M4 Max) | ~5.7x | ~125.3x |
## Usage with FluidAudio (Swift)
```swift
import FluidAudio
// Initialize with default config (auto-downloads from HuggingFace)
let diarizer = SortformerDiarizer(config: .default)
let models = try await SortformerModels.loadFromHuggingFace(config: .default)
diarizer.initialize(models: models)
// Streaming processing
for audioChunk in audioStream {
if let result = try diarizer.processSamples(audioChunk) {
for frame in 0..<result.frameCount {
for speaker in 0..<4 {
let prob = result.getSpeakerPrediction(speaker: speaker, frame: frame)
}
}
}
}
// Or batch processing
let timeline = try diarizer.processComplete(audioSamples)
for (speakerIndex, segments) in timeline.segments.enumerated() {
for segment in segments {
print("Speaker \(speakerIndex): \(segment.startTime)s - \(segment.endTime)s")
}
}
```
Performance
https://github.com/FluidInference/FluidAudio/blob/main/Documentation/Benchmarks.md
Files
Models
- Sortformer.mlpackage / .mlmodelc - Default config (low latency)
- SortformerNvidiaLow.mlpackage / .mlmodelc - NVIDIA low latency config
- SortformerNvidiaHigh.mlpackage / .mlmodelc - NVIDIA high latency config
Scripts
- convert_to_coreml.py - PyTorch to CoreML conversion
- streaming_inference.py - Python streaming inference example
- mic_inference.py - Real-time microphone demo
Source
Original model: https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1
Credits & Acknowledgements
This project would not have been possible without the significant technical contributions of https://huggingface.co/GradientDescent2718.
Their work was instrumental in:
- Architecture Conversion: Developing the complex PyTorch-to-CoreML conversion pipeline for the 17-layer Fast-Conformer and 18-layer Transformer heads.
- Build & Optimization: Engineering the static shape configurations that allow the model to achieve ~120x RTF on Apple Silicon.
- Logic Implementation: Porting the critical streaming state logic (speaker cache and FIFO management) to ensure consistent speaker identity tracking.
This project was built upon the foundational work of the NVIDIA NeMo team. |