--- license: cc-by-4.0 library_name: coreml base_model: nvidia/nemotron-speech-streaming-en-0.6b tags: - speech-recognition - automatic-speech-recognition - streaming-asr - coreml - apple - ios - macos - FastConformer - RNNT - Parakeet - ASR pipeline_tag: automatic-speech-recognition --- # Sortformer CoreML Models Streaming speaker diarization models converted from NVIDIA's Sortformer to CoreML for Apple Silicon. ## Model Variants | Variant | File | Latency | Use Case | |---------|------|---------|----------| | **Default** | `Sortformer.mlmodelc` | ~1.04s | Low latency streaming | | **NVIDIA Low** | `SortformerNvidiaLow.mlmodelc` | ~1.04s | Low latency streaming | | **NVIDIA High** | `SortformerNvidiaHigh.mlmodelc` | ~30.4s | Best quality, offline | ## Configuration Parameters | Parameter | Default | NVIDIA Low | NVIDIA High | |-----------|---------|------------|-------------| | chunk_len | 6 | 6 | 340 | | chunk_right_context | 7 | 7 | 40 | | chunk_left_context | 1 | 1 | 1 | | fifo_len | 40 | 188 | 40 | | spkcache_len | 188 | 188 | 188 | ## Model Input/Output Shapes **General**: | Input | Shape | Description | |-------|-------|-------------| | chunk | `[1, 8*(C+L+R), 128]` | Mel spectrogram features | | chunk_lengths | `[1]` | Actual chunk length | | spkcache | `[1, S, 512]` | Speaker cache embeddings | | spkcache_lengths | `[1]` | Actual cache length | | fifo | `[1, F, 512]` | FIFO queue embeddings | | fifo_lengths | `[1]` | Actual FIFO length | | Output | Shape | Description | |--------|-------|-------------| | speaker_preds | `[C+L+R+S+F, 4]` | Speaker probabilities (4 speakers) | | chunk_pre_encoder_embs | `[C+L+R, 512]` | Embeddings for state update | | chunk_pre_encoder_lengths | `[1]` | Actual embedding count | | nest_encoder_embs | `[C+L+R+S+F, 192]` | Embeddings for speaker discrimination | | nest_encoder_lengths | `[1]` | Actual speaker embedding count | Note: `C = chunk_len`, `L = chunk_left_context`, `R = chunk_right_context`, `S = spkcache_len`, `F = fifo_len`. **Configuration-Specific Shapes**: | Input | Default | NVIDIA Low | NVIDIA High | |-------|---------|------------|-------------| | chunk | `[1, 112, 128]` | `[1, 112, 128]` | `[1, 3048, 128]` | | chunk_lengths | `[1]` | `[1]` | `[1]` | | spkcache | `[1, 188, 512]` | `[1, 188, 512]` | `[1, 188, 512]` | | spkcache_lengths | `[1]` | `[1]` | `[1]` | | fifo | `[1, 40, 512]` | `[1, 188, 512]` | `[1, 40, 512]` | fifo_lengths | `[1]` | `[1]` | `[1]` | | Output | Default | NVIDIA Low | NVIDIA High | |--------|---------|------------|-------------| | speaker_preds | `[1, 242, 128]` | `[1, 390, 128]` | `[1, 609, 128]` | | chunk_pre_encoder_embs | `[1, 14, 512]` | `[1, 14, 512]` | `[1, 381, 512]` | | chunk_pre_encoder_lengths | `[1]` | `[1]` | `[1]` | | nest_encoder_embs | `[1, 242, 192]` | `[1, 390, 192]` | `[1, 609, 192]` | | nest_encoder_lengths | `[1]` | `[1]` | `[1]` | | Metric | Default | NVIDIA High | |---------------|---------|-------------| | Latency | ~1.12s | ~30.4s | | RTFx (M4 Max) | ~5.7x | ~125.3x | ## Usage with FluidAudio (Swift) ```swift import FluidAudio // Initialize with default config (auto-downloads from HuggingFace) let diarizer = SortformerDiarizer(config: .default) let models = try await SortformerModels.loadFromHuggingFace(config: .default) diarizer.initialize(models: models) // Streaming processing for audioChunk in audioStream { if let result = try diarizer.processSamples(audioChunk) { for frame in 0..