diar-streaming-sortformer-coreml / README.md

Upload 25 files

435fb20 verified about 1 month ago

2.66 kB

	# Sortformer CoreML Models - Gradient Descent Configuration

	Streaming speaker diarization models converted from NVIDIA's Sortformer to CoreML.

	## Configuration

	Gradient Descent - Higher quality, more context:

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| chunk_len \| 6 \|
	\| chunk_right_context \| 7 \|
	\| chunk_left_context \| 1 \|
	\| fifo_len \| 40 \|
	\| spkcache_len \| 188 \|
	\| spkcache_update_period \| 31 \|

	## Model Input Shapes

	\| Model \| Input \| Shape \|
	\|-------\|-------\|-------\|
	\| Preprocessor \| audio_signal \| [1, 18160] \|
	\| Preprocessor \| length \| [1] \|
	\| PreEncoder \| chunk \| [1, 112, 128] \|
	\| PreEncoder \| chunk_lengths \| [1] \|
	\| PreEncoder \| spkcache \| [1, 188, 512] \|
	\| PreEncoder \| spkcache_lengths \| [1] \|
	\| PreEncoder \| fifo \| [1, 40, 512] \|
	\| PreEncoder \| fifo_lengths \| [1] \|
	\| Head \| pre_encoder_embs \| [1, 242, 512] \|
	\| Head \| pre_encoder_lengths \| [1] \|
	\| Head \| chunk_embs_in \| [1, 14, 512] \|
	\| Head \| chunk_lens_in \| [1] \|

	## Model Output Shapes

	\| Model \| Output \| Shape \|
	\|-------\|--------\|-------\|
	\| Preprocessor \| features \| [1, 112, 128] \|
	\| Preprocessor \| feature_lengths \| [1] \|
	\| PreEncoder \| pre_encoder_embs \| [1, 242, 512] \|
	\| PreEncoder \| pre_encoder_lengths \| [1] \|
	\| PreEncoder \| chunk_embs_in \| [1, 14, 512] \|
	\| PreEncoder \| chunk_lens_in \| [1] \|
	\| Head \| speaker_preds \| [1, 242, 4] \|
	\| Head \| chunk_pre_encoder_embs \| [1, 14, 512] \|
	\| Head \| chunk_pre_encoder_lengths \| [1] \|

	## Files

	### Models
	- `Pipeline_Preprocessor.mlpackage` / `.mlmodelc` - Audio to mel features
	- `Pipeline_PreEncoder.mlpackage` / `.mlmodelc` - Mel features + state to embeddings
	- `Pipeline_Head_Fixed.mlpackage` / `.mlmodelc` - Embeddings to speaker predictions

	### Scripts
	- `export_gradient_descent.py` - Export script used to create these models
	- `coreml_wrappers.py` - PyTorch wrapper classes for export
	- `streaming_inference.py` - Python streaming inference example
	- `mic_inference.py` - Real-time microphone demo

	## Usage with FluidAudio (Swift)

	```swift
	let config = SortformerConfig.gradientDescent
	let diarizer = try await SortformerDiarizer(config: config)

	// Process audio chunks
	while let samples = getAudioChunk() {
	if let result = try diarizer.processChunk(samples) {
	// result.probabilities - confirmed speaker probabilities
	// result.tentativeProbabilities - preview (may change)
	}
	}
	```

	## Performance

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Latency \| ~1.04s (7 * 80ms right context + chunk) \|
	\| DER (AMI) \| ~30.8% \|
	\| RTFx \| ~8.2x on Apple Silicon \|

	## Source

	Original model: [nvidia/diar_streaming_sortformer_4spk-v2.1](https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1)

	# Sortformer CoreML Models - Gradient Descent Configuration

	Streaming speaker diarization models converted from NVIDIA's Sortformer to CoreML.

	## Configuration

	Gradient Descent - Higher quality, more context:

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| chunk_len \| 6 \|
	\| chunk_right_context \| 7 \|
	\| chunk_left_context \| 1 \|
	\| fifo_len \| 40 \|
	\| spkcache_len \| 188 \|
	\| spkcache_update_period \| 31 \|

	## Model Input Shapes

	\| Model \| Input \| Shape \|
	\|-------\|-------\|-------\|
	\| Preprocessor \| audio_signal \| [1, 18160] \|
	\| Preprocessor \| length \| [1] \|
	\| PreEncoder \| chunk \| [1, 112, 128] \|
	\| PreEncoder \| chunk_lengths \| [1] \|
	\| PreEncoder \| spkcache \| [1, 188, 512] \|
	\| PreEncoder \| spkcache_lengths \| [1] \|
	\| PreEncoder \| fifo \| [1, 40, 512] \|
	\| PreEncoder \| fifo_lengths \| [1] \|
	\| Head \| pre_encoder_embs \| [1, 242, 512] \|
	\| Head \| pre_encoder_lengths \| [1] \|
	\| Head \| chunk_embs_in \| [1, 14, 512] \|
	\| Head \| chunk_lens_in \| [1] \|

	## Model Output Shapes

	\| Model \| Output \| Shape \|
	\|-------\|--------\|-------\|
	\| Preprocessor \| features \| [1, 112, 128] \|
	\| Preprocessor \| feature_lengths \| [1] \|
	\| PreEncoder \| pre_encoder_embs \| [1, 242, 512] \|
	\| PreEncoder \| pre_encoder_lengths \| [1] \|
	\| PreEncoder \| chunk_embs_in \| [1, 14, 512] \|
	\| PreEncoder \| chunk_lens_in \| [1] \|
	\| Head \| speaker_preds \| [1, 242, 4] \|
	\| Head \| chunk_pre_encoder_embs \| [1, 14, 512] \|
	\| Head \| chunk_pre_encoder_lengths \| [1] \|

	## Files

	### Models
	- `Pipeline_Preprocessor.mlpackage` / `.mlmodelc` - Audio to mel features
	- `Pipeline_PreEncoder.mlpackage` / `.mlmodelc` - Mel features + state to embeddings
	- `Pipeline_Head_Fixed.mlpackage` / `.mlmodelc` - Embeddings to speaker predictions

	### Scripts
	- `export_gradient_descent.py` - Export script used to create these models
	- `coreml_wrappers.py` - PyTorch wrapper classes for export
	- `streaming_inference.py` - Python streaming inference example
	- `mic_inference.py` - Real-time microphone demo

	## Usage with FluidAudio (Swift)

	```swift
	let config = SortformerConfig.gradientDescent
	let diarizer = try await SortformerDiarizer(config: config)

	// Process audio chunks
	while let samples = getAudioChunk() {
	if let result = try diarizer.processChunk(samples) {
	// result.probabilities - confirmed speaker probabilities
	// result.tentativeProbabilities - preview (may change)
	}
	}
	```

	## Performance

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Latency \| ~1.04s (7 * 80ms right context + chunk) \|
	\| DER (AMI) \| ~30.8% \|
	\| RTFx \| ~8.2x on Apple Silicon \|

	## Source

	Original model: [nvidia/diar_streaming_sortformer_4spk-v2.1](https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1)