--- license: apache-2.0 tags: - speech - audio - voice - speaker-diarization - speaker-change-detection - coreml - speaker-segmentation base_model: - pyannote/speaker-diarization-3.1 - pyannote/wespeaker-voxceleb-resnet34-LM --- # Speaker Diarization CoreML Models State-of-the-art speaker diarization models optimized for Apple Neural Engine, powering real-time on-device speaker separation with research-competitive performance. Support any language, models are trained on acoustic signatures ## Model Description This repository contains CoreML-optimized speaker diarization models specifically converted and optimized for Apple devices (macOS 13.0+, iOS 16.0+). These models enable efficient on-device speaker diarization with minimal power consumption while maintaining state-of-the-art accuracy. ## Usage See the SDK for more details [https://github.com/FluidInference/FluidAudio](https://github.com/FluidInference/FluidAudio) ### With FluidAudio SDK (Recommended) Installation Add FluidAudio to your project using Swift Package Manager: ``` dependencies: [ .package(url: "https://github.com/FluidInference/FluidAudio.git", from: "0.0.2"), ], ``` ```swift import FluidAudio Task { let diarizer = DiarizerManager() try await diarizer.initialize() let audioSamples: [Float] = // your 16kHz audio let result = try await diarizer.performCompleteDiarization( audioSamples, sampleRate: 16000 ) for segment in result.segments { print("Speaker \(segment.speakerId): \(segment.startTimeSeconds)s - \(segment.endTimeSeconds)s") } } ### Direct CoreML Usage ``swift import CoreML // Load the model let model = try! SpeakerDiarizationModel(configuration: MLModelConfiguration()) // Prepare input (16kHz audio) let input = SpeakerDiarizationModelInput(audioSamples: audioArray) // Run inference let output = try! model.prediction(input: input) ``` ## Acknowledgments These CoreML models are based on excellent work from: sherpa-onnx - Foundational diarization algorithms pyannote-audio - State-of-the-art diarization research wespeaker - Speaker embedding techniques ### Key Features - **Apple Neural Engine Optimized**: Zero performance trade-offs with maximum efficiency - **Real-time Processing**: RTF of 0.02x (50x faster than real-time) - **Research-Competitive**: DER of 17.7% on AMI benchmark - **Power Efficient**: Designed for maximum performance per watt - **Privacy-First**: All processing happens on-device ## Intended Uses & Limitations ### Intended Uses - **Meeting Transcription**: Real-time speaker identification in meetings - **Voice Assistants**: Multi-speaker conversation understanding - **Media Production**: Automated speaker labeling for podcasts/interviews - **Research**: Academic research in speaker diarization - **Privacy-Focused Applications**: On-device processing without cloud dependencies ### Limitations - Optimized for 16kHz audio input - Best performance with clear audio (no heavy background noise) - May struggle with heavily overlapping speech - Requires Apple devices with CoreML support ### Technical Specifications - **Input**: 16kHz mono audio - **Output**: Speaker segments with timestamps and IDs - **Framework**: CoreML (converted from PyTorch) - **Optimization**: Apple Neural Engine (ANE) optimized operations - **Precision**: FP32 on CPU/GPU, FP16 on ANE ## Training Data These models are converted from open-source variants trained on diverse speaker diarization datasets. The original models were trained on: - Multi-speaker conversation datasets - Various acoustic conditions - Multiple languages and accents *Note: Specific training data details depend on the original open-source model variant.*