alexwengg's picture
Update README.md
ff41bac verified
metadata
license: cc-by-4.0
library_name: coreml
base_model: nvidia/nemotron-speech-streaming-en-0.6b
tags:
  - speech-recognition
  - automatic-speech-recognition
  - streaming-asr
  - coreml
  - apple
  - ios
  - macos
  - FastConformer
  - RNNT
  - Parakeet
  - ASR
pipeline_tag: automatic-speech-recognition

Sortformer CoreML Models

Streaming speaker diarization models converted from NVIDIA's Sortformer to CoreML for Apple Silicon.

Model Variants

Variant File Latency Use Case
Default Sortformer.mlmodelc ~1.04s Low latency streaming
NVIDIA Low SortformerNvidiaLow.mlmodelc ~1.04s Low latency streaming
NVIDIA High SortformerNvidiaHigh.mlmodelc ~30.4s Best quality, offline

Configuration Parameters

Parameter Default NVIDIA Low NVIDIA High
chunk_len 6 6 340
chunk_right_context 7 7 40
chunk_left_context 1 1 1
fifo_len 40 188 40
spkcache_len 188 188 188

Model Input/Output Shapes

General:

Input Shape Description
chunk [1, 8*(C+L+R), 128] Mel spectrogram features
chunk_lengths [1] Actual chunk length
spkcache [1, S, 512] Speaker cache embeddings
spkcache_lengths [1] Actual cache length
fifo [1, F, 512] FIFO queue embeddings
fifo_lengths [1] Actual FIFO length
Output Shape Description
speaker_preds [C+L+R+S+F, 4] Speaker probabilities (4 speakers)
chunk_pre_encoder_embs [C+L+R, 512] Embeddings for state update
chunk_pre_encoder_lengths [1] Actual embedding count
nest_encoder_embs [C+L+R+S+F, 192] Embeddings for speaker discrimination
nest_encoder_lengths [1] Actual speaker embedding count

Note: C = chunk_len, L = chunk_left_context, R = chunk_right_context, S = spkcache_len, F = fifo_len.

Configuration-Specific Shapes:

Input Default NVIDIA Low NVIDIA High
chunk [1, 112, 128] [1, 112, 128] [1, 3048, 128]
chunk_lengths [1] [1] [1]
spkcache [1, 188, 512] [1, 188, 512] [1, 188, 512]
spkcache_lengths [1] [1] [1]
fifo [1, 40, 512] [1, 188, 512] [1, 40, 512]
fifo_lengths [1] [1] [1]
Output Default NVIDIA Low NVIDIA High
speaker_preds [1, 242, 128] [1, 390, 128] [1, 609, 128]
chunk_pre_encoder_embs [1, 14, 512] [1, 14, 512] [1, 381, 512]
chunk_pre_encoder_lengths [1] [1] [1]
nest_encoder_embs [1, 242, 192] [1, 390, 192] [1, 609, 192]
nest_encoder_lengths [1] [1] [1]
Metric Default NVIDIA High
Latency ~1.12s ~30.4s
RTFx (M4 Max) ~5.7x ~125.3x

Usage with FluidAudio (Swift)

import FluidAudio

// Initialize with default config (auto-downloads from HuggingFace)
let diarizer = SortformerDiarizer(config: .default)
let models = try await SortformerModels.loadFromHuggingFace(config: .default)
diarizer.initialize(models: models)

// Streaming processing
for audioChunk in audioStream {
    if let result = try diarizer.processSamples(audioChunk) {
        for frame in 0..<result.frameCount {
            for speaker in 0..<4 {
                let prob = result.getSpeakerPrediction(speaker: speaker, frame: frame)
            }
        }
    }
}

// Or batch processing
let timeline = try diarizer.processComplete(audioSamples)
for (speakerIndex, segments) in timeline.segments.enumerated() {
    for segment in segments {
        print("Speaker \(speakerIndex): \(segment.startTime)s - \(segment.endTime)s")
    }
}

Performance

https://github.com/FluidInference/FluidAudio/blob/main/Documentation/Benchmarks.md

Files

Models

  • Sortformer.mlpackage / .mlmodelc - Default config (low latency)
  • SortformerNvidiaLow.mlpackage / .mlmodelc - NVIDIA low latency config
  • SortformerNvidiaHigh.mlpackage / .mlmodelc - NVIDIA high latency config

Scripts

  • convert_to_coreml.py - PyTorch to CoreML conversion
  • streaming_inference.py - Python streaming inference example
  • mic_inference.py - Real-time microphone demo

Source

Original model: https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1

Credits & Acknowledgements

This project would not have been possible without the significant technical contributions of https://huggingface.co/GradientDescent2718.

Their work was instrumental in:

  • Architecture Conversion: Developing the complex PyTorch-to-CoreML conversion pipeline for the 17-layer Fast-Conformer and 18-layer Transformer heads.
  • Build & Optimization: Engineering the static shape configurations that allow the model to achieve ~120x RTF on Apple Silicon.
  • Logic Implementation: Porting the critical streaming state logic (speaker cache and FIFO management) to ensure consistent speaker identity tracking.

This project was built upon the foundational work of the NVIDIA NeMo team.