File size: 6,120 Bytes
686a020 b24c358 3c5e325 b89d94b a9eedb2 128774e b89d94b 3c5e325 b24c358 3c5e325 b24c358 3c5e325 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 | ---
language:
- en
license: mit
library_name: coreml
pipeline_tag: audio-classification
tags:
- speaker-diarization
- diarization
- coreml
- apple
- streaming
- audio
- ls-eend
- eend
pretty_name: LS-EEND CoreML Models
model-index:
- name: LS-EEND CoreML Models
results: []
---
# LS-EEND CoreML Models
CoreML exports of LS-EEND, a long-form streaming end-to-end neural diarization model with online attractor extraction.
This repository contains non-quantized CoreML step models for four LS-EEND variants:
- `AMI`
- `CALLHOME`
- `DIHARD II`
- `DIHARD III`
These models are intended for stateful streaming inference. Each package runs one LS-EEND step at a time with explicit recurrent/cache tensors, rather than processing an entire utterance in a single call.
## Included files
Each variant directory contains:
- `*.mlpackage`: the CoreML model package
- `*.json`: metadata needed by the runtime
- `*.mlmodelc`: a compiled CoreML bundle generated locally for convenience
Variant directories:
- `AMI/`
- `CALLHOME/`
- `DIHARD II/`
- `DIHARD III/`
## Variants
| Variant | Package | Configured max speakers | Model output capacity |
| --- | --- | ---: | ---: |
| AMI | `AMI/ls_eend_ami_step.mlpackage` | 4 | 6 |
| CALLHOME | `CALLHOME/ls_eend_callhome_step.mlpackage` | 7 | 9 |
| DIHARD II | `DIHARD II/ls_eend_dih2_step.mlpackage` | 10 | 12 |
| DIHARD III | `DIHARD III/ls_eend_dih3_step.mlpackage` | 10 | 12 |
The metadata JSON distinguishes between:
- `max_speakers`: the dataset/config speaker setting from the LS-EEND infer YAML
- `max_nspks`: the exported model's full decode/output capacity
## Frontend and runtime assumptions
All four non-quantized exports in this repo use the same frontend settings:
- sample rate: `8000 Hz`
- window length: `200` samples
- hop length: `80` samples
- FFT size: `1024`
- mel bins: `23`
- context receptive field: `7`
- subsampling: `10`
- feature type: `logmel23_cummn`
- output frame rate: `10 Hz`
- compute precision: `float32`
These are step-wise streaming models. A runtime must maintain and feed the recurrent state tensors between calls:
- `enc_ret_kv`
- `enc_ret_scale`
- `enc_conv_cache`
- `dec_ret_kv`
- `dec_ret_scale`
- `top_buffer`
The CoreML inputs and outputs follow the LS-EEND step export used by the reference Python and Swift runtimes.
## Intended usage
Use these packages with a runtime that:
1. Resamples audio to mono `8 kHz`
2. Extracts LS-EEND features with the settings above
3. Preserves model state across step calls
4. Uses `ingest`/`decode` control inputs to handle the encoder delay and final tail flush
5. Applies postprocessing such as sigmoid, thresholding, optional median filtering, and RTTM conversion outside the CoreML graph
This repository is not a drop-in replacement for generic Hugging Face `transformers` inference. It is meant for custom CoreML runtimes, such as:
- the Python LS-EEND CoreML runtime from the FS-EEND project
- the Swift/macOS runtime used for the LS-EEND CoreML microphone demo
## Minimal metadata example
Each variant ships a sidecar JSON with fields like:
```json
{
"sample_rate": 8000,
"win_length": 200,
"hop_length": 80,
"n_fft": 1024,
"n_mels": 23,
"context_recp": 7,
"subsampling": 10,
"feat_type": "logmel23_cummn",
"frame_hz": 10.0,
"max_speakers": 10,
"max_nspks": 12
}
```
Check the variant-specific `*.json` file for the exact state tensor shapes and output dimensions.
## Credits
- **Base model**: [LS-EEND](https://github.com/Audio-WestlakeU/FS-EEND) by Di Liang & Xiaofei Li (Westlake University). Paper: [LS-EEND: Long-Form Streaming End-to-End Neural Diarization with Online Attractor Extraction](https://arxiv.org/abs/2410.06670) (IEEE TASLP 2025). The original model is not hosted on HuggingFace; pretrained weights are available on [GitHub](https://github.com/Audio-WestlakeU/FS-EEND).
- **CoreML conversion**: [@GradientDescent2718](https://huggingface.co/GradientDescent2718). Original repo: [GradientDescent2718/ls-eend-coreml](https://huggingface.co/GradientDescent2718/ls-eend-coreml).
## Source project
These CoreML exports were produced from the LS-EEND code in the FS-EEND repository:
- GitHub: [Audio-WestlakeU/FS-EEND](https://github.com/Audio-WestlakeU/FS-EEND)
The export path is based on the LS-EEND CoreML step exporter and variant batch exporter in that project.
## Training and evaluation context
From the source project, the reported real-world diarization error rates are:
| Dataset | DER (%) |
| --- | ---: |
| CALLHOME | 12.11 |
| DIHARD II | 27.58 |
| DIHARD III | 19.61 |
| AMI Dev | 20.97 |
| AMI Eval | 20.76 |
These numbers come from the upstream LS-EEND project README and reflect the original training/evaluation setup, not a Hugging Face evaluation pipeline.
## Limitations
- These models are exported for Apple CoreML runtimes, not for PyTorch or ONNX consumers.
- They are stateful streaming step models, so they require a custom driver loop.
- They assume an 8 kHz LS-EEND frontend and will not produce matching results if you use a different spectrogram pipeline.
- Speaker identities are output as activity tracks/slots and still require downstream diarization postprocessing and speaker-slot alignment where appropriate.
## License and dataset constraints
The upstream LS-EEND model/codebase used for these CoreML exports is MIT-licensed, and this repository is published as MIT accordingly.
The underlying evaluation and fine-tuning datasets still have their own access and usage terms:
- AMI
- CALLHOME
- DIHARD II
- DIHARD III
This repository redistributes CoreML exports of the LS-EEND model variants. Dataset licensing and access requirements remain governed by the original dataset providers.
## Citation
If you use LS-EEND, cite the original paper:
```bibtex
@ARTICLE{11122273,
author={Liang, Di and Li, Xiaofei},
journal={IEEE Transactions on Audio, Speech and Language Processing},
title={LS-EEND: Long-Form Streaming End-to-End Neural Diarization With Online Attractor Extraction},
year={2025},
volume={33},
pages={3568-3581},
doi={10.1109/TASLPRO.2025.3597446}
}
```
|