bweng commited on
Commit
f63da86
·
verified ·
1 Parent(s): 9891012

Prepping for new version of the model

Browse files
Files changed (1) hide show
  1. README.md +45 -78
README.md CHANGED
@@ -19,106 +19,73 @@ pipeline_tag: voice-activity-detection
19
  [![Discord](https://img.shields.io/badge/Discord-Join%20Chat-7289da.svg)](https://discord.gg/WNsvaCtmDe)
20
  [![GitHub Repo stars](https://img.shields.io/github/stars/FluidInference/FluidAudio?style=flat&logo=github)](https://github.com/FluidInference/FluidAudio)
21
 
22
- State-of-the-art speaker diarization models optimized for Apple Neural Engine, powering real-time on-device speaker separation with research-competitive performance.
23
 
24
- Support any language, models are trained on acoustic signatures
25
-
26
- ## Model Description
27
-
28
- This repository contains CoreML-optimized speaker diarization models specifically converted and optimized for Apple devices (macOS 13.0+, iOS 16.0+). These models enable efficient on-device speaker diarization with minimal power consumption while maintaining state-of-the-art accuracy.
29
 
30
  ## Usage
31
 
32
  See the SDK for more details [https://github.com/FluidInference/FluidAudio](https://github.com/FluidInference/FluidAudio)
33
 
34
- ### With FluidAudio SDK (Recommended)
35
-
36
- Installation
37
- Add FluidAudio to your project using Swift Package Manager:
38
-
39
- ```
40
- dependencies: [
41
- .package(url: "https://github.com/FluidInference/FluidAudio.git", from: "0.0.2"),
42
- ],
43
- ```
44
 
45
- ```swift
46
- import FluidAudio
47
-
48
- Task {
49
- let diarizer = DiarizerManager()
50
- try await diarizer.initialize()
51
-
52
- let audioSamples: [Float] = // your 16kHz audio
53
- let result = try await diarizer.performCompleteDiarization(
54
- audioSamples,
55
- sampleRate: 16000
56
- )
57
-
58
- for segment in result.segments {
59
- print("Speaker \(segment.speakerId): \(segment.startTimeSeconds)s - \(segment.endTimeSeconds)s")
60
- }
61
- }
62
 
 
63
 
64
- ### Direct CoreML Usage
65
- ``swift
66
- import CoreML
67
 
68
- // Load the model
69
- let model = try! SpeakerDiarizationModel(configuration: MLModelConfiguration())
70
 
71
- // Prepare input (16kHz audio)
72
- let input = SpeakerDiarizationModelInput(audioSamples: audioArray)
73
 
74
- // Run inference
75
- let output = try! model.prediction(input: input)
76
- ```
77
 
 
78
 
79
- ## Acknowledgments
80
- These CoreML models are based on excellent work from:
81
 
82
- sherpa-onnx - Foundational diarization algorithms
83
- pyannote-audio - State-of-the-art diarization research
84
- wespeaker - Speaker embedding techniques
85
 
86
 
87
- ### Key Features
88
- - **Apple Neural Engine Optimized**: Zero performance trade-offs with maximum efficiency
89
- - **Real-time Processing**: RTF of 0.02x (50x faster than real-time)
90
- - **Research-Competitive**: DER of 17.7% on AMI benchmark
91
- - **Power Efficient**: Designed for maximum performance per watt
92
- - **Privacy-First**: All processing happens on-device
93
 
 
94
 
95
- ## Intended Uses & Limitations
96
 
97
- ### Intended Uses
98
- - **Meeting Transcription**: Real-time speaker identification in meetings
99
- - **Voice Assistants**: Multi-speaker conversation understanding
100
- - **Media Production**: Automated speaker labeling for podcasts/interviews
101
- - **Research**: Academic research in speaker diarization
102
- - **Privacy-Focused Applications**: On-device processing without cloud dependencies
 
 
103
 
104
- ### Limitations
105
- - Optimized for 16kHz audio input
106
- - Best performance with clear audio (no heavy background noise)
107
- - May struggle with heavily overlapping speech
108
- - Requires Apple devices with CoreML support
109
 
110
- ### Technical Specifications
111
- - **Input**: 16kHz mono audio
112
- - **Output**: Speaker segments with timestamps and IDs
113
- - **Framework**: CoreML (converted from PyTorch)
114
- - **Optimization**: Apple Neural Engine (ANE) optimized operations
115
- - **Precision**: FP32 on CPU/GPU, FP16 on ANE
 
 
 
 
116
 
117
- ## Training Data
118
 
119
- These models are converted from open-source variants trained on diverse speaker diarization datasets. The original models were trained on:
120
- - Multi-speaker conversation datasets
121
- - Various acoustic conditions
122
- - Multiple languages and accents
123
 
124
- *Note: Specific training data details depend on the original open-source model variant.*
 
 
 
 
 
 
 
19
  [![Discord](https://img.shields.io/badge/Discord-Join%20Chat-7289da.svg)](https://discord.gg/WNsvaCtmDe)
20
  [![GitHub Repo stars](https://img.shields.io/github/stars/FluidInference/FluidAudio?style=flat&logo=github)](https://github.com/FluidInference/FluidAudio)
21
 
22
+ Speaker diarization based on [pyannote ](https://github.com/pyannote) models optimized for Apple Neural Engine.
23
 
24
+ Models are trained on acoustic signatures so it supports any lanugage.
 
 
 
 
25
 
26
  ## Usage
27
 
28
  See the SDK for more details [https://github.com/FluidInference/FluidAudio](https://github.com/FluidInference/FluidAudio)
29
 
30
+ ### Technical Specifications
31
+ - **Input**: 16kHz mono audio
32
+ - **Output**: Speaker segments with timestamps and IDs
33
+ - **Framework**: CoreML (converted from PyTorch)
34
+ - **Optimization**: Apple Neural Engine (ANE) optimized operations
35
+ - **Precision**: FP32 on CPU/GPU, FP16 on ANE
 
 
 
 
36
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
 
38
+ ## Performance
39
 
40
+ See the [origianl model](https://huggingface.co/pyannote/speaker-diarization-community-1) for detailed DER benchmark, for the purpose of our conversion, we tried to match the original model as much as possible:
 
 
41
 
42
+ The models on CoreML exhibit a ~10x Speedup on CPU and ~20x speed up on GPU.
 
43
 
44
+ ![plots/pipeline_timing.png](plots/pipeline_timing.png)
 
45
 
46
+ Due to different precisions, there are minor differences in the values generated but the differences are mostly negilible, though it does account for some errors that needs to be adjusted during clustering:
 
 
47
 
48
+ ![plots/metrics_timeseries.png](plots/metrics_timeseries.png)
49
 
 
 
50
 
51
+ We see this when running the end to end pipeline with the Pytorch model versus the Core ML model (patched the Pyannote pipeline to run the Core ML model instead)
52
+ ![plots/pipeline_overview.png](plots/pipeline_overview.png)
 
53
 
54
 
 
 
 
 
 
 
55
 
56
+ ## Citations (from original model)
57
 
58
+ 1. Speaker segmentation model
59
 
60
+ ```bibtex
61
+ @inproceedings{Plaquet23,
62
+ author={Alexis Plaquet and Hervé Bredin},
63
+ title={{Powerset multi-class cross entropy loss for neural speaker diarization}},
64
+ year=2023,
65
+ booktitle={Proc. INTERSPEECH 2023},
66
+ }
67
+ ```
68
 
69
+ 2. Speaker embedding model
 
 
 
 
70
 
71
+ ```bibtex
72
+ @inproceedings{Wang2023,
73
+ title={Wespeaker: A research and production oriented speaker embedding learning toolkit},
74
+ author={Wang, Hongji and Liang, Chengdong and Wang, Shuai and Chen, Zhengyang and Zhang, Binbin and Xiang, Xu and Deng, Yanlei and Qian, Yanmin},
75
+ booktitle={ICASSP 2023, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
76
+ pages={1--5},
77
+ year={2023},
78
+ organization={IEEE}
79
+ }
80
+ ```
81
 
 
82
 
83
+ 3. Speaker clustering
 
 
 
84
 
85
+ ```bibtex
86
+ @article{Landini2022,
87
+ author={Landini, Federico and Profant, J{\'a}n and Diez, Mireia and Burget, Luk{\'a}{\v{s}}},
88
+ title={{Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: theory, implementation and analysis on standard tasks}},
89
+ year={2022},
90
+ journal={Computer Speech \& Language},
91
+ }