diar-streaming-sortformer-coreml / README.md

Update README.md

6e0b538 verified about 1 month ago

4.98 kB


	# Sortformer CoreML Models

	Streaming speaker diarization models converted from NVIDIA's Sortformer to CoreML for Apple Silicon.

	## Model Variants

	\| Variant \| File \| Latency \| Use Case \|
	\|---------\|------\|---------\|----------\|
	\| Default \| `Sortformer.mlmodelc` \| ~1.04s \| Low latency streaming \|
	\| NVIDIA Low \| `SortformerNvidiaLow.mlmodelc` \| ~1.04s \| Low latency streaming \|
	\| NVIDIA High \| `SortformerNvidiaHigh.mlmodelc` \| ~30.4s \| Best quality, offline \|

	## Configuration Parameters

	\| Parameter \| Default \| NVIDIA Low \| NVIDIA High \|
	\|-----------\|---------\|------------\|-------------\|
	\| chunk_len \| 6 \| 6 \| 340 \|
	\| chunk_right_context \| 7 \| 7 \| 40 \|
	\| chunk_left_context \| 1 \| 1 \| 1 \|
	\| fifo_len \| 40 \| 188 \| 40 \|
	\| spkcache_len \| 188 \| 188 \| 188 \|

	## Model Input/Output Shapes

	General:

	\| Input \| Shape \| Description \|
	\|-------\|-------\|-------------\|
	\| chunk \| `[1, 8*(C+L+R), 128]` \| Mel spectrogram features \|
	\| chunk_lengths \| `[1]` \| Actual chunk length \|
	\| spkcache \| `[1, S, 512]` \| Speaker cache embeddings \|
	\| spkcache_lengths \| `[1]` \| Actual cache length \|
	\| fifo \| `[1, F, 512]` \| FIFO queue embeddings \|
	\| fifo_lengths \| `[1]` \| Actual FIFO length \|

	\| Output \| Shape \| Description \|
	\|--------\|-------\|-------------\|
	\| speaker_preds \| `[C+L+R+S+F, 4]` \| Speaker probabilities (4 speakers) \|
	\| chunk_pre_encoder_embs \| `[C+L+R, 512]` \| Embeddings for state update \|
	\| chunk_pre_encoder_lengths \| `[1]` \| Actual embedding count \|
	\| nest_encoder_embs \| `[C+L+R+S+F, 192]` \| Embeddings for speaker discrimination \|
	\| nest_encoder_lengths \| `[1]` \| Actual speaker embedding count \|

	Note: `C = chunk_len`, `L = chunk_left_context`, `R = chunk_right_context`, `S = spkcache_len`, `F = fifo_len`.

	Configuration-Specific Shapes:

	\| Input \| Default \| NVIDIA Low \| NVIDIA High \|
	\| chunk \| `[1, 112, 128]` \| `[1, 112, 128]` \| `[1, 3048, 128]` \|
	\| chunk_lengths \| `[1]` \| `[1]` \| `[1]` \|
	\| spkcache \| `[1, 188, 512]` \| `[1, 188, 512]` \| `[1, 188, 512]` \|
	\| spkcache_lengths \| `[1]` \| `[1]` \| `[1]` \|
	\| fifo \| `[1, 40, 512]` \| `[1, 188, 512]` \| `[1, 40, 512]`
	\| fifo_lengths \| `[1]` \| `[1]` \| `[1]` \|

	\| Output \| Default \| NVIDIA Low \| NVIDIA High \|
	\| speaker_preds \| `[1, 242, 128]` \| `[1, 390, 128]` \| `[1, 609, 128]` \|
	\| chunk_pre_encoder_embs \| `[1, 14, 512]` \| `[1, 14, 512]` \| `[1, 381, 512]` \|
	\| chunk_pre_encoder_lengths \| `[1]` \| `[1]` \| `[1]` \|
	\| nest_encoder_embs \| `[1, 242, 192]` \| `[1, 390, 192]` \| `[1, 609, 192]` \|
	\| nest_encoder_lengths \| `[1]` \| `[1]` \| `[1]` \|

	## Usage with FluidAudio (Swift)

	```swift
	import FluidAudio

	// Initialize with default config (auto-downloads from HuggingFace)
	let diarizer = SortformerDiarizer(config: .default)
	let models = try await SortformerModels.loadFromHuggingFace(config: .default)
	diarizer.initialize(models: models)

	// Streaming processing
	for audioChunk in audioStream {
	if let result = try diarizer.processSamples(audioChunk) {
	for frame in 0..<result.frameCount {
	for speaker in 0..<4 {
	let prob = result.getSpeakerPrediction(speaker: speaker, frame: frame)
	}
	}
	}
	}

	// Or file processing
	let timeline = try diarizer.processComplete(audioSamples)
	for (speakerIndex, segments) in timeline.segments.enumerated() {
	for segment in segments {
	print("Speaker \(speakerIndex): \(segment.startTime)s - \(segment.endTime)s")
	}
	}
	```
	Performance

	\| Metric \| Default \| NVIDIA High \|
	\|---------------\|---------\|-------------\|
	\| Latency \| ~1.12s \| ~30.4s \|
	\| RTFx (M4 Max) \| ~5.7x \| ~125.3x \|

	Files

	Models

	- Sortformer.mlpackage / .mlmodelc - Default config (low latency)
	- SortformerNvidiaLow.mlpackage / .mlmodelc - NVIDIA low latency config
	- SortformerNvidiaHigh.mlpackage / .mlmodelc - NVIDIA high latency config

	Scripts

	- convert_to_coreml.py - PyTorch to CoreML conversion
	- streaming_inference.py - Python streaming inference example
	- mic_inference.py - Real-time microphone demo

	Source

	Original model: https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1

	Credits & Acknowledgements

	This project would not have been possible without the significant technical contributions of https://huggingface.co/GradientDescent2718.

	Their work was instrumental in:

	- Architecture Conversion: Developing the complex PyTorch-to-CoreML conversion pipeline for the 17-layer Fast-Conformer and 18-layer Transformer heads.
	- Build & Optimization: Engineering the static shape configurations that allow the model to achieve ~120x RTF on Apple Silicon.
	- Logic Implementation: Porting the critical streaming state logic (speaker cache and FIFO management) to ensure consistent speaker identity tracking.

	This project was built upon the foundational work of the NVIDIA NeMo team.


	# Sortformer CoreML Models

	Streaming speaker diarization models converted from NVIDIA's Sortformer to CoreML for Apple Silicon.

	## Model Variants

	\| Variant \| File \| Latency \| Use Case \|
	\|---------\|------\|---------\|----------\|
	\| Default \| `Sortformer.mlmodelc` \| ~1.04s \| Low latency streaming \|
	\| NVIDIA Low \| `SortformerNvidiaLow.mlmodelc` \| ~1.04s \| Low latency streaming \|
	\| NVIDIA High \| `SortformerNvidiaHigh.mlmodelc` \| ~30.4s \| Best quality, offline \|

	## Configuration Parameters

	\| Parameter \| Default \| NVIDIA Low \| NVIDIA High \|
	\|-----------\|---------\|------------\|-------------\|
	\| chunk_len \| 6 \| 6 \| 340 \|
	\| chunk_right_context \| 7 \| 7 \| 40 \|
	\| chunk_left_context \| 1 \| 1 \| 1 \|
	\| fifo_len \| 40 \| 188 \| 40 \|
	\| spkcache_len \| 188 \| 188 \| 188 \|

	## Model Input/Output Shapes

	General:

	\| Input \| Shape \| Description \|
	\|-------\|-------\|-------------\|
	\| chunk \| `[1, 8*(C+L+R), 128]` \| Mel spectrogram features \|
	\| chunk_lengths \| `[1]` \| Actual chunk length \|
	\| spkcache \| `[1, S, 512]` \| Speaker cache embeddings \|
	\| spkcache_lengths \| `[1]` \| Actual cache length \|
	\| fifo \| `[1, F, 512]` \| FIFO queue embeddings \|
	\| fifo_lengths \| `[1]` \| Actual FIFO length \|

	\| Output \| Shape \| Description \|
	\|--------\|-------\|-------------\|
	\| speaker_preds \| `[C+L+R+S+F, 4]` \| Speaker probabilities (4 speakers) \|
	\| chunk_pre_encoder_embs \| `[C+L+R, 512]` \| Embeddings for state update \|
	\| chunk_pre_encoder_lengths \| `[1]` \| Actual embedding count \|
	\| nest_encoder_embs \| `[C+L+R+S+F, 192]` \| Embeddings for speaker discrimination \|
	\| nest_encoder_lengths \| `[1]` \| Actual speaker embedding count \|

	Note: `C = chunk_len`, `L = chunk_left_context`, `R = chunk_right_context`, `S = spkcache_len`, `F = fifo_len`.

	Configuration-Specific Shapes:

	\| Input \| Default \| NVIDIA Low \| NVIDIA High \|
	\| chunk \| `[1, 112, 128]` \| `[1, 112, 128]` \| `[1, 3048, 128]` \|
	\| chunk_lengths \| `[1]` \| `[1]` \| `[1]` \|
	\| spkcache \| `[1, 188, 512]` \| `[1, 188, 512]` \| `[1, 188, 512]` \|
	\| spkcache_lengths \| `[1]` \| `[1]` \| `[1]` \|
	\| fifo \| `[1, 40, 512]` \| `[1, 188, 512]` \| `[1, 40, 512]`
	\| fifo_lengths \| `[1]` \| `[1]` \| `[1]` \|

	\| Output \| Default \| NVIDIA Low \| NVIDIA High \|
	\| speaker_preds \| `[1, 242, 128]` \| `[1, 390, 128]` \| `[1, 609, 128]` \|
	\| chunk_pre_encoder_embs \| `[1, 14, 512]` \| `[1, 14, 512]` \| `[1, 381, 512]` \|
	\| chunk_pre_encoder_lengths \| `[1]` \| `[1]` \| `[1]` \|
	\| nest_encoder_embs \| `[1, 242, 192]` \| `[1, 390, 192]` \| `[1, 609, 192]` \|
	\| nest_encoder_lengths \| `[1]` \| `[1]` \| `[1]` \|

	## Usage with FluidAudio (Swift)

	```swift
	import FluidAudio

	// Initialize with default config (auto-downloads from HuggingFace)
	let diarizer = SortformerDiarizer(config: .default)
	let models = try await SortformerModels.loadFromHuggingFace(config: .default)
	diarizer.initialize(models: models)

	// Streaming processing
	for audioChunk in audioStream {
	if let result = try diarizer.processSamples(audioChunk) {
	for frame in 0..<result.frameCount {
	for speaker in 0..<4 {
	let prob = result.getSpeakerPrediction(speaker: speaker, frame: frame)
	}
	}
	}
	}

	// Or file processing
	let timeline = try diarizer.processComplete(audioSamples)
	for (speakerIndex, segments) in timeline.segments.enumerated() {
	for segment in segments {
	print("Speaker \(speakerIndex): \(segment.startTime)s - \(segment.endTime)s")
	}
	}
	```
	Performance

	\| Metric \| Default \| NVIDIA High \|
	\|---------------\|---------\|-------------\|
	\| Latency \| ~1.12s \| ~30.4s \|
	\| RTFx (M4 Max) \| ~5.7x \| ~125.3x \|

	Files

	Models

	- Sortformer.mlpackage / .mlmodelc - Default config (low latency)
	- SortformerNvidiaLow.mlpackage / .mlmodelc - NVIDIA low latency config
	- SortformerNvidiaHigh.mlpackage / .mlmodelc - NVIDIA high latency config

	Scripts

	- convert_to_coreml.py - PyTorch to CoreML conversion
	- streaming_inference.py - Python streaming inference example
	- mic_inference.py - Real-time microphone demo

	Source

	Original model: https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1

	Credits & Acknowledgements

	This project would not have been possible without the significant technical contributions of https://huggingface.co/GradientDescent2718.

	Their work was instrumental in:

	- Architecture Conversion: Developing the complex PyTorch-to-CoreML conversion pipeline for the 17-layer Fast-Conformer and 18-layer Transformer heads.
	- Build & Optimization: Engineering the static shape configurations that allow the model to achieve ~120x RTF on Apple Silicon.
	- Logic Implementation: Porting the critical streaming state logic (speaker cache and FIFO management) to ensure consistent speaker identity tracking.

	This project was built upon the foundational work of the NVIDIA NeMo team.