GradientDescent2718 commited on
Commit
3c5e325
·
verified ·
1 Parent(s): 58f1fb3

Added model card info

Browse files
Files changed (1) hide show
  1. README.md +179 -3
README.md CHANGED
@@ -1,6 +1,182 @@
1
  ---
2
- license: mit
3
  language:
4
  - en
5
- pipeline_tag: voice-activity-detection
6
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
2
  language:
3
  - en
4
+ license: other
5
+ library_name: coreml
6
+ pipeline_tag: audio-classification
7
+ tags:
8
+ - speaker-diarization
9
+ - diarization
10
+ - coreml
11
+ - apple
12
+ - streaming
13
+ - audio
14
+ - ls-eend
15
+ - eend
16
+ pretty_name: LS-EEND CoreML Models
17
+ model-index:
18
+ - name: LS-EEND CoreML Models
19
+ results: []
20
+ ---
21
+
22
+ # LS-EEND CoreML Models
23
+
24
+ CoreML exports of LS-EEND, a long-form streaming end-to-end neural diarization model with online attractor extraction.
25
+
26
+ This repository contains non-quantized CoreML step models for four LS-EEND variants:
27
+
28
+ - `AMI`
29
+ - `CALLHOME`
30
+ - `DIHARD II`
31
+ - `DIHARD III`
32
+
33
+ These models are intended for stateful streaming inference. Each package runs one LS-EEND step at a time with explicit recurrent/cache tensors, rather than processing an entire utterance in a single call.
34
+
35
+ ## Included files
36
+
37
+ Each variant directory contains:
38
+
39
+ - `*.mlpackage`: the CoreML model package
40
+ - `*.json`: metadata needed by the runtime
41
+ - `*.mlmodelc`: a compiled CoreML bundle generated locally for convenience
42
+
43
+ Variant directories:
44
+
45
+ - `AMI/`
46
+ - `CALLHOME/`
47
+ - `DIHARD II/`
48
+ - `DIHARD III/`
49
+
50
+ ## Variants
51
+
52
+ | Variant | Package | Configured max speakers | Model output capacity |
53
+ | --- | --- | ---: | ---: |
54
+ | AMI | `AMI/ls_eend_ami_step.mlpackage` | 4 | 6 |
55
+ | CALLHOME | `CALLHOME/ls_eend_callhome_step.mlpackage` | 7 | 9 |
56
+ | DIHARD II | `DIHARD II/ls_eend_dih2_step.mlpackage` | 10 | 12 |
57
+ | DIHARD III | `DIHARD III/ls_eend_dih3_step.mlpackage` | 10 | 12 |
58
+
59
+ The metadata JSON distinguishes between:
60
+
61
+ - `max_speakers`: the dataset/config speaker setting from the LS-EEND infer YAML
62
+ - `max_nspks`: the exported model's full decode/output capacity
63
+
64
+ ## Frontend and runtime assumptions
65
+
66
+ All four non-quantized exports in this repo use the same frontend settings:
67
+
68
+ - sample rate: `8000 Hz`
69
+ - window length: `200` samples
70
+ - hop length: `80` samples
71
+ - FFT size: `1024`
72
+ - mel bins: `23`
73
+ - context receptive field: `7`
74
+ - subsampling: `10`
75
+ - feature type: `logmel23_cummn`
76
+ - output frame rate: `10 Hz`
77
+ - compute precision: `float32`
78
+
79
+ These are step-wise streaming models. A runtime must maintain and feed the recurrent state tensors between calls:
80
+
81
+ - `enc_ret_kv`
82
+ - `enc_ret_scale`
83
+ - `enc_conv_cache`
84
+ - `dec_ret_kv`
85
+ - `dec_ret_scale`
86
+ - `top_buffer`
87
+
88
+ The CoreML inputs and outputs follow the LS-EEND step export used by the reference Python and Swift runtimes.
89
+
90
+ ## Intended usage
91
+
92
+ Use these packages with a runtime that:
93
+
94
+ 1. Resamples audio to mono `8 kHz`
95
+ 2. Extracts LS-EEND features with the settings above
96
+ 3. Preserves model state across step calls
97
+ 4. Uses `ingest`/`decode` control inputs to handle the encoder delay and final tail flush
98
+ 5. Applies postprocessing such as sigmoid, thresholding, optional median filtering, and RTTM conversion outside the CoreML graph
99
+
100
+ This repository is not a drop-in replacement for generic Hugging Face `transformers` inference. It is meant for custom CoreML runtimes, such as:
101
+
102
+ - the Python LS-EEND CoreML runtime from the FS-EEND project
103
+ - the Swift/macOS runtime used for the LS-EEND CoreML microphone demo
104
+
105
+ ## Minimal metadata example
106
+
107
+ Each variant ships a sidecar JSON with fields like:
108
+
109
+ ```json
110
+ {
111
+ "sample_rate": 8000,
112
+ "win_length": 200,
113
+ "hop_length": 80,
114
+ "n_fft": 1024,
115
+ "n_mels": 23,
116
+ "context_recp": 7,
117
+ "subsampling": 10,
118
+ "feat_type": "logmel23_cummn",
119
+ "frame_hz": 10.0,
120
+ "max_speakers": 10,
121
+ "max_nspks": 12
122
+ }
123
+ ```
124
+
125
+ Check the variant-specific `*.json` file for the exact state tensor shapes and output dimensions.
126
+
127
+ ## Source project
128
+
129
+ These CoreML exports were produced from the LS-EEND code in the FS-EEND repository:
130
+
131
+ - GitHub: [Audio-WestlakeU/FS-EEND](https://github.com/Audio-WestlakeU/FS-EEND)
132
+
133
+ The export path is based on the LS-EEND CoreML step exporter and variant batch exporter in that project.
134
+
135
+ ## Training and evaluation context
136
+
137
+ From the source project, the reported real-world diarization error rates are:
138
+
139
+ | Dataset | DER (%) |
140
+ | --- | ---: |
141
+ | CALLHOME | 12.11 |
142
+ | DIHARD II | 27.58 |
143
+ | DIHARD III | 19.61 |
144
+ | AMI Dev | 20.97 |
145
+ | AMI Eval | 20.76 |
146
+
147
+ These numbers come from the upstream LS-EEND project README and reflect the original training/evaluation setup, not a Hugging Face evaluation pipeline.
148
+
149
+ ## Limitations
150
+
151
+ - These models are exported for Apple CoreML runtimes, not for PyTorch or ONNX consumers.
152
+ - They are stateful streaming step models, so they require a custom driver loop.
153
+ - They assume an 8 kHz LS-EEND frontend and will not produce matching results if you use a different spectrogram pipeline.
154
+ - Speaker identities are output as activity tracks/slots and still require downstream diarization postprocessing and speaker-slot alignment where appropriate.
155
+
156
+ ## License and dataset constraints
157
+
158
+ Please verify the licensing and access conditions for:
159
+
160
+ - the upstream FS-EEND / LS-EEND code and weights
161
+ - AMI
162
+ - CALLHOME
163
+ - DIHARD II
164
+ - DIHARD III
165
+
166
+ This repository only redistributes CoreML exports of the LS-EEND model variants. Dataset usage rights remain governed by the original datasets and their terms.
167
+
168
+ ## Citation
169
+
170
+ If you use LS-EEND, cite the original paper:
171
+
172
+ ```bibtex
173
+ @ARTICLE{11122273,
174
+ author={Liang, Di and Li, Xiaofei},
175
+ journal={IEEE Transactions on Audio, Speech and Language Processing},
176
+ title={LS-EEND: Long-Form Streaming End-to-End Neural Diarization With Online Attractor Extraction},
177
+ year={2025},
178
+ volume={33},
179
+ pages={3568-3581},
180
+ doi={10.1109/TASLPRO.2025.3597446}
181
+ }
182
+ ```