--- license: cc-by-4.0 tags: - speaker-diarization - coreml - apple-silicon - neural-engine - sortformer datasets: - voxconverse language: - en pipeline_tag: audio-classification --- # Sortformer Diarization (CoreML) CoreML port of [NVIDIA Sortformer](https://arxiv.org/abs/2409.06656) for end-to-end streaming speaker diarization on Apple Silicon. Runs on the Neural Engine via CoreML. No separate embedding extraction or clustering — the model directly predicts per-frame speaker activity for up to 4 speakers. ## Model Details - **Architecture**: Sortformer (Sort Loss + 17-layer FastConformer encoder + 18-layer Transformer) - **Base model**: `nvidia/diar_streaming_sortformer_4spk-v2.1` (117M params) - **Task**: Speaker diarization (up to 4 speakers) - **Input**: 128-dim log-mel features, streamed in chunks - **Output**: Per-frame speaker activity probabilities (sigmoid) - **Format**: CoreML `.mlmodelc` (compiled pipeline, 2 sub-models) - **Size**: ~230 MB ## Streaming Configuration | Parameter | Value | |-----------|-------| | Sample rate | 16 kHz | | Mel bins | 128 | | n_fft | 400 | | Hop length | 160 | | Chunk length | 6s | | Left context | 1 chunk | | Right context | 7 chunks | | Subsampling factor | 8 | | Speaker cache length | 188 frames | | FIFO length | 40 frames | | Max speakers | 4 | ## Input/Output Shapes **Inputs:** | Tensor | Shape | Description | |--------|-------|-------------| | `chunk` | `[1, 112, 128]` | Mel features for current chunk | | `chunk_lengths` | `[1]` | Valid frames in chunk | | `spkcache` | `[1, 188, 512]` | Speaker cache state | | `spkcache_lengths` | `[1]` | Valid entries in speaker cache | | `fifo` | `[1, 40, 512]` | FIFO buffer state | | `fifo_lengths` | `[1]` | Valid entries in FIFO | **Outputs:** | Tensor | Shape | Description | |--------|-------|-------------| | `speaker_preds_out` | `[1, 242, 4]` | Speaker activity probabilities | | `chunk_pre_encoder_embs_out` | `[1, 14, 512]` | Chunk embeddings for state update | | `chunk_pre_encoder_lengths_out` | `[1]` | Valid embedding frames | ## Usage Used by [speech-swift](https://github.com/soniqo/speech-swift) for speaker diarization: ```bash audio diarize meeting.wav --engine sortformer ``` ```swift let diarizer = try await SortformerDiarizer.fromPretrained() let result = diarizer.diarize(audio: samples, sampleRate: 16000) for segment in result.segments { print("Speaker \(segment.speakerId): \(segment.startTime)s - \(segment.endTime)s") } ``` ## Pipeline Architecture The model is a CoreML pipeline with two sub-models: 1. **PreEncoder** (model0) — Runs `pre_encode` on the mel chunk, concatenates with speaker cache and FIFO state 2. **Head** (model1) — Full FastConformer encoder + Transformer + sigmoid speaker heads State management (FIFO rotation, speaker cache compression) is handled in Swift outside the model. ## License CC-BY-4.0