FluidInference
/

ls-eend-coreml

 ---
 language:
 - en
+license: other
+library_name: coreml
+pipeline_tag: audio-classification
+tags:
+- speaker-diarization
+- diarization
+- coreml
+- apple
+- streaming
+- audio
+- ls-eend
+- eend
+pretty_name: LS-EEND CoreML Models
+model-index:
+- name: LS-EEND CoreML Models
+  results: []
+---
+# LS-EEND CoreML Models
+CoreML exports of LS-EEND, a long-form streaming end-to-end neural diarization model with online attractor extraction.
+This repository contains non-quantized CoreML step models for four LS-EEND variants:
+- `AMI`
+- `CALLHOME`
+- `DIHARD II`
+- `DIHARD III`
+These models are intended for stateful streaming inference. Each package runs one LS-EEND step at a time with explicit recurrent/cache tensors, rather than processing an entire utterance in a single call.
+## Included files
+Each variant directory contains:
+- `*.mlpackage`: the CoreML model package
+- `*.json`: metadata needed by the runtime
+- `*.mlmodelc`: a compiled CoreML bundle generated locally for convenience
+Variant directories:
+- `AMI/`
+- `CALLHOME/`
+- `DIHARD II/`
+- `DIHARD III/`
+## Variants
+| Variant | Package | Configured max speakers | Model output capacity |
+| --- | --- | ---: | ---: |
+| AMI | `AMI/ls_eend_ami_step.mlpackage` | 4 | 6 |
+| CALLHOME | `CALLHOME/ls_eend_callhome_step.mlpackage` | 7 | 9 |
+| DIHARD II | `DIHARD II/ls_eend_dih2_step.mlpackage` | 10 | 12 |
+| DIHARD III | `DIHARD III/ls_eend_dih3_step.mlpackage` | 10 | 12 |
+The metadata JSON distinguishes between:
+- `max_speakers`: the dataset/config speaker setting from the LS-EEND infer YAML
+- `max_nspks`: the exported model's full decode/output capacity
+## Frontend and runtime assumptions
+All four non-quantized exports in this repo use the same frontend settings:
+- sample rate: `8000 Hz`
+- window length: `200` samples
+- hop length: `80` samples
+- FFT size: `1024`
+- mel bins: `23`
+- context receptive field: `7`
+- subsampling: `10`
+- feature type: `logmel23_cummn`
+- output frame rate: `10 Hz`
+- compute precision: `float32`
+These are step-wise streaming models. A runtime must maintain and feed the recurrent state tensors between calls:
+- `enc_ret_kv`
+- `enc_ret_scale`
+- `enc_conv_cache`
+- `dec_ret_kv`
+- `dec_ret_scale`
+- `top_buffer`
+The CoreML inputs and outputs follow the LS-EEND step export used by the reference Python and Swift runtimes.
+## Intended usage
+Use these packages with a runtime that:
+1. Resamples audio to mono `8 kHz`
+2. Extracts LS-EEND features with the settings above
+3. Preserves model state across step calls
+4. Uses `ingest`/`decode` control inputs to handle the encoder delay and final tail flush
+5. Applies postprocessing such as sigmoid, thresholding, optional median filtering, and RTTM conversion outside the CoreML graph
+This repository is not a drop-in replacement for generic Hugging Face `transformers` inference. It is meant for custom CoreML runtimes, such as:
+- the Python LS-EEND CoreML runtime from the FS-EEND project
+- the Swift/macOS runtime used for the LS-EEND CoreML microphone demo
+## Minimal metadata example
+Each variant ships a sidecar JSON with fields like:
+```json
+{
+  "sample_rate": 8000,
+  "win_length": 200,
+  "hop_length": 80,
+  "n_fft": 1024,
+  "n_mels": 23,
+  "context_recp": 7,
+  "subsampling": 10,
+  "feat_type": "logmel23_cummn",
+  "frame_hz": 10.0,
+  "max_speakers": 10,
+  "max_nspks": 12
+}
+```
+Check the variant-specific `*.json` file for the exact state tensor shapes and output dimensions.
+## Source project
+These CoreML exports were produced from the LS-EEND code in the FS-EEND repository:
+- GitHub: [Audio-WestlakeU/FS-EEND](https://github.com/Audio-WestlakeU/FS-EEND)
+The export path is based on the LS-EEND CoreML step exporter and variant batch exporter in that project.
+## Training and evaluation context
+From the source project, the reported real-world diarization error rates are:
+| Dataset | DER (%) |
+| --- | ---: |
+| CALLHOME | 12.11 |
+| DIHARD II | 27.58 |
+| DIHARD III | 19.61 |
+| AMI Dev | 20.97 |
+| AMI Eval | 20.76 |
+These numbers come from the upstream LS-EEND project README and reflect the original training/evaluation setup, not a Hugging Face evaluation pipeline.
+## Limitations
+- These models are exported for Apple CoreML runtimes, not for PyTorch or ONNX consumers.
+- They are stateful streaming step models, so they require a custom driver loop.
+- They assume an 8 kHz LS-EEND frontend and will not produce matching results if you use a different spectrogram pipeline.
+- Speaker identities are output as activity tracks/slots and still require downstream diarization postprocessing and speaker-slot alignment where appropriate.
+## License and dataset constraints
+Please verify the licensing and access conditions for:
+- the upstream FS-EEND / LS-EEND code and weights
+- AMI
+- CALLHOME
+- DIHARD II
+- DIHARD III
+This repository only redistributes CoreML exports of the LS-EEND model variants. Dataset usage rights remain governed by the original datasets and their terms.
+## Citation
+If you use LS-EEND, cite the original paper:
+```bibtex
+@ARTICLE{11122273,
+  author={Liang, Di and Li, Xiaofei},
+  journal={IEEE Transactions on Audio, Speech and Language Processing},
+  title={LS-EEND: Long-Form Streaming End-to-End Neural Diarization With Online Attractor Extraction},
+  year={2025},
+  volume={33},
+  pages={3568-3581},
+  doi={10.1109/TASLPRO.2025.3597446}
+}
+```