CoreML Speech Models
Collection
Speech AI models for Apple Neural Engine via CoreML. iOS/macOS ready. ASR, TTS, VAD, diarization. • 13 items • Updated
CoreML port of NVIDIA Sortformer for end-to-end streaming speaker diarization on Apple Silicon.
Runs on the Neural Engine via CoreML. No separate embedding extraction or clustering — the model directly predicts per-frame speaker activity for up to 4 speakers.
nvidia/diar_streaming_sortformer_4spk-v2.1 (117M params).mlmodelc (compiled pipeline, 2 sub-models)| Parameter | Value |
|---|---|
| Sample rate | 16 kHz |
| Mel bins | 128 |
| n_fft | 400 |
| Hop length | 160 |
| Chunk length | 6s |
| Left context | 1 chunk |
| Right context | 7 chunks |
| Subsampling factor | 8 |
| Speaker cache length | 188 frames |
| FIFO length | 40 frames |
| Max speakers | 4 |
Inputs:
| Tensor | Shape | Description |
|---|---|---|
chunk |
[1, 112, 128] |
Mel features for current chunk |
chunk_lengths |
[1] |
Valid frames in chunk |
spkcache |
[1, 188, 512] |
Speaker cache state |
spkcache_lengths |
[1] |
Valid entries in speaker cache |
fifo |
[1, 40, 512] |
FIFO buffer state |
fifo_lengths |
[1] |
Valid entries in FIFO |
Outputs:
| Tensor | Shape | Description |
|---|---|---|
speaker_preds_out |
[1, 242, 4] |
Speaker activity probabilities |
chunk_pre_encoder_embs_out |
[1, 14, 512] |
Chunk embeddings for state update |
chunk_pre_encoder_lengths_out |
[1] |
Valid embedding frames |
Used by speech-swift for speaker diarization:
audio diarize meeting.wav --engine sortformer
let diarizer = try await SortformerDiarizer.fromPretrained()
let result = diarizer.diarize(audio: samples, sampleRate: 16000)
for segment in result.segments {
print("Speaker \(segment.speakerId): \(segment.startTime)s - \(segment.endTime)s")
}
The model is a CoreML pipeline with two sub-models:
pre_encode on the mel chunk, concatenates with speaker cache and FIFO stateState management (FIFO rotation, speaker cache compression) is handled in Swift outside the model.
CC-BY-4.0