diar-streaming-sortformer-coreml / README.md

alexwengg

Update README.md

ff41bac verified 16 days ago

preview code

raw

history blame contribute delete

5.45 kB

metadata

license: cc-by-4.0
library_name: coreml
base_model: nvidia/nemotron-speech-streaming-en-0.6b
tags:
  - speech-recognition
  - automatic-speech-recognition
  - streaming-asr
  - coreml
  - apple
  - ios
  - macos
  - FastConformer
  - RNNT
  - Parakeet
  - ASR
pipeline_tag: automatic-speech-recognition

Sortformer CoreML Models

Streaming speaker diarization models converted from NVIDIA's Sortformer to CoreML for Apple Silicon.

Model Variants

Variant	File	Latency	Use Case
Default	`Sortformer.mlmodelc`	~1.04s	Low latency streaming
NVIDIA Low	`SortformerNvidiaLow.mlmodelc`	~1.04s	Low latency streaming
NVIDIA High	`SortformerNvidiaHigh.mlmodelc`	~30.4s	Best quality, offline

Configuration Parameters

Parameter	Default	NVIDIA Low	NVIDIA High
chunk_len	6	6	340
chunk_right_context	7	7	40
chunk_left_context	1	1	1
fifo_len	40	188	40
spkcache_len	188	188	188

Model Input/Output Shapes

General:

Input	Shape	Description
chunk	`[1, 8*(C+L+R), 128]`	Mel spectrogram features
chunk_lengths	`[1]`	Actual chunk length
spkcache	`[1, S, 512]`	Speaker cache embeddings
spkcache_lengths	`[1]`	Actual cache length
fifo	`[1, F, 512]`	FIFO queue embeddings
fifo_lengths	`[1]`	Actual FIFO length

Output	Shape	Description
speaker_preds	`[C+L+R+S+F, 4]`	Speaker probabilities (4 speakers)
chunk_pre_encoder_embs	`[C+L+R, 512]`	Embeddings for state update
chunk_pre_encoder_lengths	`[1]`	Actual embedding count
nest_encoder_embs	`[C+L+R+S+F, 192]`	Embeddings for speaker discrimination
nest_encoder_lengths	`[1]`	Actual speaker embedding count

Note: C = chunk_len, L = chunk_left_context, R = chunk_right_context, S = spkcache_len, F = fifo_len.

Configuration-Specific Shapes:

Input	Default	NVIDIA Low	NVIDIA High
chunk	`[1, 112, 128]`	`[1, 112, 128]`	`[1, 3048, 128]`
chunk_lengths	`[1]`	`[1]`	`[1]`
spkcache	`[1, 188, 512]`	`[1, 188, 512]`	`[1, 188, 512]`
spkcache_lengths	`[1]`	`[1]`	`[1]`
fifo	`[1, 40, 512]`	`[1, 188, 512]`	`[1, 40, 512]`
fifo_lengths	`[1]`	`[1]`	`[1]`

Output	Default	NVIDIA Low	NVIDIA High
speaker_preds	`[1, 242, 128]`	`[1, 390, 128]`	`[1, 609, 128]`
chunk_pre_encoder_embs	`[1, 14, 512]`	`[1, 14, 512]`	`[1, 381, 512]`
chunk_pre_encoder_lengths	`[1]`	`[1]`	`[1]`
nest_encoder_embs	`[1, 242, 192]`	`[1, 390, 192]`	`[1, 609, 192]`
nest_encoder_lengths	`[1]`	`[1]`	`[1]`

Metric	Default	NVIDIA High
Latency	~1.12s	~30.4s
RTFx (M4 Max)	~5.7x	~125.3x

Usage with FluidAudio (Swift)

import FluidAudio

// Initialize with default config (auto-downloads from HuggingFace)
let diarizer = SortformerDiarizer(config: .default)
let models = try await SortformerModels.loadFromHuggingFace(config: .default)
diarizer.initialize(models: models)

// Streaming processing
for audioChunk in audioStream {
    if let result = try diarizer.processSamples(audioChunk) {
        for frame in 0..<result.frameCount {
            for speaker in 0..<4 {
                let prob = result.getSpeakerPrediction(speaker: speaker, frame: frame)
            }
        }
    }
}

// Or batch processing
let timeline = try diarizer.processComplete(audioSamples)
for (speakerIndex, segments) in timeline.segments.enumerated() {
    for segment in segments {
        print("Speaker \(speakerIndex): \(segment.startTime)s - \(segment.endTime)s")
    }
}

Performance

https://github.com/FluidInference/FluidAudio/blob/main/Documentation/Benchmarks.md

Files

Models

Sortformer.mlpackage / .mlmodelc - Default config (low latency)
SortformerNvidiaLow.mlpackage / .mlmodelc - NVIDIA low latency config
SortformerNvidiaHigh.mlpackage / .mlmodelc - NVIDIA high latency config

Scripts

convert_to_coreml.py - PyTorch to CoreML conversion
streaming_inference.py - Python streaming inference example
mic_inference.py - Real-time microphone demo

Source

Original model: https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1

Credits & Acknowledgements

This project would not have been possible without the significant technical contributions of https://huggingface.co/GradientDescent2718.

Their work was instrumental in:

Architecture Conversion: Developing the complex PyTorch-to-CoreML conversion pipeline for the 17-layer Fast-Conformer and 18-layer Transformer heads.
Build & Optimization: Engineering the static shape configurations that allow the model to achieve ~120x RTF on Apple Silicon.
Logic Implementation: Porting the critical streaming state logic (speaker cache and FIFO management) to ensure consistent speaker identity tracking.

This project was built upon the foundational work of the NVIDIA NeMo team.