Upload README.md with huggingface_hub

da643fc verified 1 day ago

2.91 kB

	---
	license: cc-by-4.0
	tags:
	- speaker-diarization
	- coreml
	- apple-silicon
	- neural-engine
	- sortformer
	datasets:
	- voxconverse
	language:
	- en
	pipeline_tag: audio-classification
	---

	# Sortformer Diarization (CoreML)

	CoreML port of [NVIDIA Sortformer](https://arxiv.org/abs/2409.06656) for end-to-end streaming speaker diarization on Apple Silicon.

	Runs on the Neural Engine via CoreML. No separate embedding extraction or clustering — the model directly predicts per-frame speaker activity for up to 4 speakers.

	## Model Details

	- Architecture: Sortformer (Sort Loss + 17-layer FastConformer encoder + 18-layer Transformer)
	- Base model: `nvidia/diar_streaming_sortformer_4spk-v2.1` (117M params)
	- Task: Speaker diarization (up to 4 speakers)
	- Input: 128-dim log-mel features, streamed in chunks
	- Output: Per-frame speaker activity probabilities (sigmoid)
	- Format: CoreML `.mlmodelc` (compiled pipeline, 2 sub-models)
	- Size: ~230 MB

	## Streaming Configuration

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Sample rate \| 16 kHz \|
	\| Mel bins \| 128 \|
	\| n_fft \| 400 \|
	\| Hop length \| 160 \|
	\| Chunk length \| 6s \|
	\| Left context \| 1 chunk \|
	\| Right context \| 7 chunks \|
	\| Subsampling factor \| 8 \|
	\| Speaker cache length \| 188 frames \|
	\| FIFO length \| 40 frames \|
	\| Max speakers \| 4 \|

	## Input/Output Shapes

	Inputs:

	\| Tensor \| Shape \| Description \|
	\|--------\|-------\|-------------\|
	\| `chunk` \| `[1, 112, 128]` \| Mel features for current chunk \|
	\| `chunk_lengths` \| `[1]` \| Valid frames in chunk \|
	\| `spkcache` \| `[1, 188, 512]` \| Speaker cache state \|
	\| `spkcache_lengths` \| `[1]` \| Valid entries in speaker cache \|
	\| `fifo` \| `[1, 40, 512]` \| FIFO buffer state \|
	\| `fifo_lengths` \| `[1]` \| Valid entries in FIFO \|

	Outputs:

	\| Tensor \| Shape \| Description \|
	\|--------\|-------\|-------------\|
	\| `speaker_preds_out` \| `[1, 242, 4]` \| Speaker activity probabilities \|
	\| `chunk_pre_encoder_embs_out` \| `[1, 14, 512]` \| Chunk embeddings for state update \|
	\| `chunk_pre_encoder_lengths_out` \| `[1]` \| Valid embedding frames \|

	## Usage

	Used by [speech-swift](https://github.com/soniqo/speech-swift) for speaker diarization:

	```bash
	audio diarize meeting.wav --engine sortformer
	```

	```swift
	let diarizer = try await SortformerDiarizer.fromPretrained()
	let result = diarizer.diarize(audio: samples, sampleRate: 16000)
	for segment in result.segments {
	print("Speaker \(segment.speakerId): \(segment.startTime)s - \(segment.endTime)s")
	}
	```

	## Pipeline Architecture

	The model is a CoreML pipeline with two sub-models:

	1. PreEncoder (model0) — Runs `pre_encode` on the mel chunk, concatenates with speaker cache and FIFO state
	2. Head (model1) — Full FastConformer encoder + Transformer + sigmoid speaker heads

	State management (FIFO rotation, speaker cache compression) is handled in Swift outside the model.

	## License

	CC-BY-4.0

	---
	license: cc-by-4.0
	tags:
	- speaker-diarization
	- coreml
	- apple-silicon
	- neural-engine
	- sortformer
	datasets:
	- voxconverse
	language:
	- en
	pipeline_tag: audio-classification
	---

	# Sortformer Diarization (CoreML)

	CoreML port of [NVIDIA Sortformer](https://arxiv.org/abs/2409.06656) for end-to-end streaming speaker diarization on Apple Silicon.

	Runs on the Neural Engine via CoreML. No separate embedding extraction or clustering — the model directly predicts per-frame speaker activity for up to 4 speakers.

	## Model Details

	- Architecture: Sortformer (Sort Loss + 17-layer FastConformer encoder + 18-layer Transformer)
	- Base model: `nvidia/diar_streaming_sortformer_4spk-v2.1` (117M params)
	- Task: Speaker diarization (up to 4 speakers)
	- Input: 128-dim log-mel features, streamed in chunks
	- Output: Per-frame speaker activity probabilities (sigmoid)
	- Format: CoreML `.mlmodelc` (compiled pipeline, 2 sub-models)
	- Size: ~230 MB

	## Streaming Configuration

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Sample rate \| 16 kHz \|
	\| Mel bins \| 128 \|
	\| n_fft \| 400 \|
	\| Hop length \| 160 \|
	\| Chunk length \| 6s \|
	\| Left context \| 1 chunk \|
	\| Right context \| 7 chunks \|
	\| Subsampling factor \| 8 \|
	\| Speaker cache length \| 188 frames \|
	\| FIFO length \| 40 frames \|
	\| Max speakers \| 4 \|

	## Input/Output Shapes

	Inputs:

	\| Tensor \| Shape \| Description \|
	\|--------\|-------\|-------------\|
	\| `chunk` \| `[1, 112, 128]` \| Mel features for current chunk \|
	\| `chunk_lengths` \| `[1]` \| Valid frames in chunk \|
	\| `spkcache` \| `[1, 188, 512]` \| Speaker cache state \|
	\| `spkcache_lengths` \| `[1]` \| Valid entries in speaker cache \|
	\| `fifo` \| `[1, 40, 512]` \| FIFO buffer state \|
	\| `fifo_lengths` \| `[1]` \| Valid entries in FIFO \|

	Outputs:

	\| Tensor \| Shape \| Description \|
	\|--------\|-------\|-------------\|
	\| `speaker_preds_out` \| `[1, 242, 4]` \| Speaker activity probabilities \|
	\| `chunk_pre_encoder_embs_out` \| `[1, 14, 512]` \| Chunk embeddings for state update \|
	\| `chunk_pre_encoder_lengths_out` \| `[1]` \| Valid embedding frames \|

	## Usage

	Used by [speech-swift](https://github.com/soniqo/speech-swift) for speaker diarization:

	```bash
	audio diarize meeting.wav --engine sortformer
	```

	```swift
	let diarizer = try await SortformerDiarizer.fromPretrained()
	let result = diarizer.diarize(audio: samples, sampleRate: 16000)
	for segment in result.segments {
	print("Speaker \(segment.speakerId): \(segment.startTime)s - \(segment.endTime)s")
	}
	```

	## Pipeline Architecture

	The model is a CoreML pipeline with two sub-models:

	1. PreEncoder (model0) — Runs `pre_encode` on the mel chunk, concatenates with speaker cache and FIFO state
	2. Head (model1) — Full FastConformer encoder + Transformer + sigmoid speaker heads

	State management (FIFO rotation, speaker cache compression) is handled in Swift outside the model.

	## License

	CC-BY-4.0