| --- |
| language: |
| - en |
| license: mit |
| library_name: coreml |
| pipeline_tag: audio-classification |
| tags: |
| - speaker-diarization |
| - diarization |
| - coreml |
| - apple |
| - streaming |
| - audio |
| - ls-eend |
| - eend |
| pretty_name: LS-EEND CoreML Models |
| model-index: |
| - name: LS-EEND CoreML Models |
| results: [] |
| --- |
| |
| # LS-EEND CoreML Models |
|
|
| CoreML exports of LS-EEND, a long-form streaming end-to-end neural diarization model with online attractor extraction. |
|
|
| This repository contains non-quantized CoreML step models for four LS-EEND variants: |
|
|
| - `AMI` |
| - `CALLHOME` |
| - `DIHARD II` |
| - `DIHARD III` |
|
|
| These models are intended for stateful streaming inference. Each package runs one LS-EEND step at a time with explicit recurrent/cache tensors, rather than processing an entire utterance in a single call. |
|
|
| ## Included files |
|
|
| Each variant directory contains: |
|
|
| - `*.mlpackage`: the CoreML model package |
| - `*.json`: metadata needed by the runtime |
| - `*.mlmodelc`: a compiled CoreML bundle generated locally for convenience |
|
|
| Variant directories: |
|
|
| - `AMI/` |
| - `CALLHOME/` |
| - `DIHARD II/` |
| - `DIHARD III/` |
|
|
| ## Variants |
|
|
| | Variant | Package | Configured max speakers | Model output capacity | |
| | --- | --- | ---: | ---: | |
| | AMI | `AMI/ls_eend_ami_step.mlpackage` | 4 | 6 | |
| | CALLHOME | `CALLHOME/ls_eend_callhome_step.mlpackage` | 7 | 9 | |
| | DIHARD II | `DIHARD II/ls_eend_dih2_step.mlpackage` | 10 | 12 | |
| | DIHARD III | `DIHARD III/ls_eend_dih3_step.mlpackage` | 10 | 12 | |
|
|
| The metadata JSON distinguishes between: |
|
|
| - `max_speakers`: the dataset/config speaker setting from the LS-EEND infer YAML |
| - `max_nspks`: the exported model's full decode/output capacity |
|
|
| ## Frontend and runtime assumptions |
|
|
| All four non-quantized exports in this repo use the same frontend settings: |
|
|
| - sample rate: `8000 Hz` |
| - window length: `200` samples |
| - hop length: `80` samples |
| - FFT size: `1024` |
| - mel bins: `23` |
| - context receptive field: `7` |
| - subsampling: `10` |
| - feature type: `logmel23_cummn` |
| - output frame rate: `10 Hz` |
| - compute precision: `float32` |
|
|
| These are step-wise streaming models. A runtime must maintain and feed the recurrent state tensors between calls: |
|
|
| - `enc_ret_kv` |
| - `enc_ret_scale` |
| - `enc_conv_cache` |
| - `dec_ret_kv` |
| - `dec_ret_scale` |
| - `top_buffer` |
|
|
| The CoreML inputs and outputs follow the LS-EEND step export used by the reference Python and Swift runtimes. |
|
|
| ## Intended usage |
|
|
| Use these packages with a runtime that: |
|
|
| 1. Resamples audio to mono `8 kHz` |
| 2. Extracts LS-EEND features with the settings above |
| 3. Preserves model state across step calls |
| 4. Uses `ingest`/`decode` control inputs to handle the encoder delay and final tail flush |
| 5. Applies postprocessing such as sigmoid, thresholding, optional median filtering, and RTTM conversion outside the CoreML graph |
|
|
| This repository is not a drop-in replacement for generic Hugging Face `transformers` inference. It is meant for custom CoreML runtimes, such as: |
|
|
| - the Python LS-EEND CoreML runtime from the FS-EEND project |
| - the Swift/macOS runtime used for the LS-EEND CoreML microphone demo |
|
|
| ## Minimal metadata example |
|
|
| Each variant ships a sidecar JSON with fields like: |
|
|
| ```json |
| { |
| "sample_rate": 8000, |
| "win_length": 200, |
| "hop_length": 80, |
| "n_fft": 1024, |
| "n_mels": 23, |
| "context_recp": 7, |
| "subsampling": 10, |
| "feat_type": "logmel23_cummn", |
| "frame_hz": 10.0, |
| "max_speakers": 10, |
| "max_nspks": 12 |
| } |
| ``` |
|
|
| Check the variant-specific `*.json` file for the exact state tensor shapes and output dimensions. |
|
|
| ## Credits |
|
|
| - **Base model**: [LS-EEND](https://github.com/Audio-WestlakeU/FS-EEND) by Di Liang & Xiaofei Li (Westlake University). Paper: [LS-EEND: Long-Form Streaming End-to-End Neural Diarization with Online Attractor Extraction](https://arxiv.org/abs/2410.06670) (IEEE TASLP 2025). The original model is not hosted on HuggingFace; pretrained weights are available on [GitHub](https://github.com/Audio-WestlakeU/FS-EEND). |
| - **CoreML conversion**: [@GradientDescent2718](https://huggingface.co/GradientDescent2718). Original repo: [GradientDescent2718/ls-eend-coreml](https://huggingface.co/GradientDescent2718/ls-eend-coreml). |
|
|
| ## Source project |
|
|
| These CoreML exports were produced from the LS-EEND code in the FS-EEND repository: |
|
|
| - GitHub: [Audio-WestlakeU/FS-EEND](https://github.com/Audio-WestlakeU/FS-EEND) |
|
|
| The export path is based on the LS-EEND CoreML step exporter and variant batch exporter in that project. |
|
|
| ## Training and evaluation context |
|
|
| From the source project, the reported real-world diarization error rates are: |
|
|
| | Dataset | DER (%) | |
| | --- | ---: | |
| | CALLHOME | 12.11 | |
| | DIHARD II | 27.58 | |
| | DIHARD III | 19.61 | |
| | AMI Dev | 20.97 | |
| | AMI Eval | 20.76 | |
|
|
| These numbers come from the upstream LS-EEND project README and reflect the original training/evaluation setup, not a Hugging Face evaluation pipeline. |
|
|
| ## Limitations |
|
|
| - These models are exported for Apple CoreML runtimes, not for PyTorch or ONNX consumers. |
| - They are stateful streaming step models, so they require a custom driver loop. |
| - They assume an 8 kHz LS-EEND frontend and will not produce matching results if you use a different spectrogram pipeline. |
| - Speaker identities are output as activity tracks/slots and still require downstream diarization postprocessing and speaker-slot alignment where appropriate. |
|
|
| ## License and dataset constraints |
|
|
| The upstream LS-EEND model/codebase used for these CoreML exports is MIT-licensed, and this repository is published as MIT accordingly. |
|
|
| The underlying evaluation and fine-tuning datasets still have their own access and usage terms: |
|
|
| - AMI |
| - CALLHOME |
| - DIHARD II |
| - DIHARD III |
|
|
| This repository redistributes CoreML exports of the LS-EEND model variants. Dataset licensing and access requirements remain governed by the original dataset providers. |
|
|
| ## Citation |
|
|
| If you use LS-EEND, cite the original paper: |
|
|
| ```bibtex |
| @ARTICLE{11122273, |
| author={Liang, Di and Li, Xiaofei}, |
| journal={IEEE Transactions on Audio, Speech and Language Processing}, |
| title={LS-EEND: Long-Form Streaming End-to-End Neural Diarization With Online Attractor Extraction}, |
| year={2025}, |
| volume={33}, |
| pages={3568-3581}, |
| doi={10.1109/TASLPRO.2025.3597446} |
| } |
| ``` |
|
|