FluidInference
/

speaker-diarization-coreml

@@ -19,106 +19,73 @@ pipeline_tag: voice-activity-detection
 [![Discord](https://img.shields.io/badge/Discord-Join%20Chat-7289da.svg)](https://discord.gg/WNsvaCtmDe)
 [![GitHub Repo stars](https://img.shields.io/github/stars/FluidInference/FluidAudio?style=flat&logo=github)](https://github.com/FluidInference/FluidAudio)
-State-of-the-art speaker diarization models optimized for Apple Neural Engine, powering real-time on-device speaker separation with research-competitive performance.
-Support any language, models are trained on acoustic signatures
-## Model Description
-This repository contains CoreML-optimized speaker diarization models specifically converted and optimized for Apple devices (macOS 13.0+, iOS 16.0+). These models enable efficient on-device speaker diarization with minimal power consumption while maintaining state-of-the-art accuracy.
 ## Usage
 See the SDK for more details [https://github.com/FluidInference/FluidAudio](https://github.com/FluidInference/FluidAudio)
-### With FluidAudio SDK (Recommended)
-Installation
-Add FluidAudio to your project using Swift Package Manager:
-```
-dependencies: [
-    .package(url: "https://github.com/FluidInference/FluidAudio.git", from: "0.0.2"),
-],
-```
-```swift
-import FluidAudio
-Task {
-    let diarizer = DiarizerManager()
-    try await diarizer.initialize()
-    let audioSamples: [Float] = // your 16kHz audio
-    let result = try await diarizer.performCompleteDiarization(
-        audioSamples,
-        sampleRate: 16000
-    )
-    for segment in result.segments {
-        print("Speaker \(segment.speakerId): \(segment.startTimeSeconds)s - \(segment.endTimeSeconds)s")
-    }
-}
-### Direct CoreML Usage
-``swift
-import CoreML
-// Load the model
-let model = try! SpeakerDiarizationModel(configuration: MLModelConfiguration())
-// Prepare input (16kHz audio)
-let input = SpeakerDiarizationModelInput(audioSamples: audioArray)
-// Run inference
-let output = try! model.prediction(input: input)
-```
-## Acknowledgments
-These CoreML models are based on excellent work from:
-sherpa-onnx - Foundational diarization algorithms
-pyannote-audio - State-of-the-art diarization research
-wespeaker - Speaker embedding techniques
-### Key Features
-- **Apple Neural Engine Optimized**: Zero performance trade-offs with maximum efficiency
-- **Real-time Processing**: RTF of 0.02x (50x faster than real-time)
-- **Research-Competitive**: DER of 17.7% on AMI benchmark
-- **Power Efficient**: Designed for maximum performance per watt
-- **Privacy-First**: All processing happens on-device
-## Intended Uses & Limitations
-### Intended Uses
-- **Meeting Transcription**: Real-time speaker identification in meetings
-- **Voice Assistants**: Multi-speaker conversation understanding
-- **Media Production**: Automated speaker labeling for podcasts/interviews
-- **Research**: Academic research in speaker diarization
-- **Privacy-Focused Applications**: On-device processing without cloud dependencies
-### Limitations
-- Optimized for 16kHz audio input
-- Best performance with clear audio (no heavy background noise)
-- May struggle with heavily overlapping speech
-- Requires Apple devices with CoreML support
-### Technical Specifications
-- **Input**: 16kHz mono audio
-- **Output**: Speaker segments with timestamps and IDs
-- **Framework**: CoreML (converted from PyTorch)
-- **Optimization**: Apple Neural Engine (ANE) optimized operations
-- **Precision**: FP32 on CPU/GPU, FP16 on ANE
-## Training Data
-These models are converted from open-source variants trained on diverse speaker diarization datasets. The original models were trained on:
-- Multi-speaker conversation datasets
-- Various acoustic conditions
-- Multiple languages and accents
-*Note: Specific training data details depend on the original open-source model variant.*

 [![Discord](https://img.shields.io/badge/Discord-Join%20Chat-7289da.svg)](https://discord.gg/WNsvaCtmDe)
 [![GitHub Repo stars](https://img.shields.io/github/stars/FluidInference/FluidAudio?style=flat&logo=github)](https://github.com/FluidInference/FluidAudio)
+Speaker diarization based on [pyannote ](https://github.com/pyannote) models optimized for Apple Neural Engine.
+Models are trained on acoustic signatures so it supports any lanugage.
 ## Usage
 See the SDK for more details [https://github.com/FluidInference/FluidAudio](https://github.com/FluidInference/FluidAudio)
+### Technical Specifications
+- **Input**: 16kHz mono audio
+- **Output**: Speaker segments with timestamps and IDs
+- **Framework**: CoreML (converted from PyTorch)
+- **Optimization**: Apple Neural Engine (ANE) optimized operations
+- **Precision**: FP32 on CPU/GPU, FP16 on ANE
+## Performance
+See the [origianl model](https://huggingface.co/pyannote/speaker-diarization-community-1) for detailed DER benchmark, for the purpose of our conversion, we tried to match the original model as much as possible:
+The models on CoreML exhibit a ~10x Speedup on CPU and ~20x speed up on GPU.
+![plots/pipeline_timing.png](plots/pipeline_timing.png)
+Due to different precisions, there are minor differences in the values generated but the differences are mostly negilible, though it does account for some errors that needs to be adjusted during clustering:
+![plots/metrics_timeseries.png](plots/metrics_timeseries.png)
+We see this when running the end to end pipeline with the Pytorch model versus the Core ML model (patched the Pyannote pipeline to run the Core ML model instead)
+![plots/pipeline_overview.png](plots/pipeline_overview.png)
+## Citations (from original model)
+1. Speaker segmentation model
+```bibtex
+@inproceedings{Plaquet23,
+  author={Alexis Plaquet and Hervé Bredin},
+  title={{Powerset multi-class cross entropy loss for neural speaker diarization}},
+  year=2023,
+  booktitle={Proc. INTERSPEECH 2023},
+}
+```
+2. Speaker embedding model
+```bibtex
+@inproceedings{Wang2023,
+  title={Wespeaker: A research and production oriented speaker embedding learning toolkit},
+  author={Wang, Hongji and Liang, Chengdong and Wang, Shuai and Chen, Zhengyang and Zhang, Binbin and Xiang, Xu and Deng, Yanlei and Qian, Yanmin},
+  booktitle={ICASSP 2023, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
+  pages={1--5},
+  year={2023},
+  organization={IEEE}
+}
+```
+3. Speaker clustering
+```bibtex
+@article{Landini2022,
+  author={Landini, Federico and Profant, J{\'a}n and Diez, Mireia and Burget, Luk{\'a}{\v{s}}},
+  title={{Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: theory, implementation and analysis on standard tasks}},
+  year={2022},
+  journal={Computer Speech \& Language},
+}