File size: 6,906 Bytes
a45ae04
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cc40a1e
 
 
 
 
 
a45ae04
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4f5cde6
127422f
cc40a1e
 
127422f
cc40a1e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
127422f
a45ae04
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cc40a1e
 
 
 
 
 
 
a45ae04
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
---
language:
- en
license: mit
library_name: onnx
pipeline_tag: audio-classification
tags:
- speaker-diarization
- diarization
- onnx
- streaming
- audio
- ls-eend
- eend
pretty_name: LS-EEND ONNX Models
model-index:
- name: LS-EEND ONNX Models
  results: []
---
# LS-EEND ONNX Models

ONNX exports of LS-EEND, a long-form streaming end-to-end neural diarization model with online attractor extraction.

This repository contains non-quantized ONNX step models for four LS-EEND variants:

- `AMI`
- `CALLHOME`
- `DIHARD II`
- `DIHARD III`

These models are intended for stateful streaming inference. Each package runs one LS-EEND step at a time with explicit recurrent/cache tensors, rather than processing an entire utterance in a single call.

## Included files

Each variant directory contains:

- `*.onnx`: the ONNX model
- `*.json`: metadata needed by the runtime

Variant directories:

- `AMI/`
- `CALLHOME/`
- `DIHARD II/`
- `DIHARD III/`

## Variants

| Variant    | Package                               | Configured max speakers | Model output capacity |
| ---------- | ------------------------------------- | ----------------------: | --------------------: |
| AMI        | `AMI/ls_eend_ami_step.onnx`           |                       4 |                     6 |
| CALLHOME   | `CALLHOME/ls_eend_callhome_step.onnx` |                       7 |                     9 |
| DIHARD II  | `DIHARD II/ls_eend_dih2_step.onnx`    |                      10 |                    12 |
| DIHARD III | `DIHARD III/ls_eend_dih3_step.onnx`   |                      10 |                    12 |

The metadata JSON distinguishes between:

- `max_speakers`: the dataset/config speaker setting from the LS-EEND infer YAML
- `max_nspks`: the exported model's full decode/output capacity

## Frontend and runtime assumptions

All four non-quantized exports in this repo use the same frontend settings:

- sample rate: `8000 Hz`
- window length: `200` samples
- hop length: `80` samples
- FFT size: `1024`
- mel bins: `23`
- context receptive field: `7`
- subsampling: `10`
- feature type: `logmel23_cummn`
- output frame rate: `10 Hz`
- compute precision: `float32`

These are step-wise streaming models. A runtime must maintain and feed the recurrent state tensors between calls:

- `enc_ret_kv`
- `enc_ret_scale`
- `enc_conv_cache`
- `dec_ret_kv`
- `dec_ret_scale`
- `top_buffer`

The ONNX inputs and outputs follow the LS-EEND step export used by the reference Python and Swift runtimes.

## Intended usage

Use these packages with a runtime that:

1. Resamples audio to mono `8 kHz`
2. Extracts LS-EEND features with the settings above
3. Preserves model state across step calls
4. Uses `ingest`/`decode` control inputs to handle the encoder delay and final tail flush
5. Applies postprocessing such as sigmoid, thresholding, optional median filtering, and RTTM conversion outside the ONNX graph

This repository is not a drop-in replacement for generic Hugging Face `transformers` inference. It is meant for custom ONNX runtimes, such as:

- The Python LS-EEND ONNX runtime in the [example](https://huggingface.co/GradientDescent2718/LS-EEND-ONNX/tree/main/example) directory

## Microphone Demo
Setup the virtual environment: 
```bash
# Create and activate virtual environment
python3.10 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r example/requirements.txt
```

To run the Python microphone inference script for the DIHARD III variant, run the following command:
```bash
python example/ls_eend_onnx_mic_gui.py --onnx-model DIHARD\ III/ls_eend_dih3_step.onnx
```

For the other variants, replace `DIHARD\ III/ls_eend_dih3_step.onnx` with the path to the desired variant:
```bash
# AMI variant
python example/ls_eend_onnx_mic_gui.py --onnx-model AMI/ls_eend_ami_step.onnx

# CALLHOME variant
python example/ls_eend_onnx_mic_gui.py --onnx-model CALLHOME/ls_eend_callhome_step.onnx

# DIHARD II variant
python example/ls_eend_onnx_mic_gui.py --onnx-model DIHARD\ II/ls_eend_dih2_step.onnx
```

## Minimal metadata example

Each variant ships a sidecar JSON with fields like:

```json
{
  "sample_rate": 8000,
  "win_length": 200,
  "hop_length": 80,
  "n_fft": 1024,
  "n_mels": 23,
  "context_recp": 7,
  "subsampling": 10,
  "feat_type": "logmel23_cummn",
  "frame_hz": 10.0,
  "max_speakers": 10,
  "max_nspks": 12
}
```

Check the variant-specific `*.json` file for the exact state tensor shapes and output dimensions.

## Credits

- **Base model**: [LS-EEND](https://github.com/Audio-WestlakeU/FS-EEND) by Di Liang & Xiaofei Li (Westlake University). Paper: [LS-EEND: Long-Form Streaming End-to-End Neural Diarization with Online Attractor Extraction](https://arxiv.org/abs/2410.06670) (IEEE TASLP 2025). The original model is not hosted on HuggingFace; pretrained weights are available on [GitHub](https://github.com/Audio-WestlakeU/FS-EEND).

## Source project

These ONNX exports were produced from the LS-EEND code in the FS-EEND repository:

- GitHub: [Audio-WestlakeU/FS-EEND](https://github.com/Audio-WestlakeU/FS-EEND)

The export path is based on the LS-EEND ONNX step exporter and variant batch exporter in that project.

## Training and evaluation context

From the source project, the reported real-world diarization error rates are:

| Dataset    | DER (%) |
| ---------- | ------: |
| CALLHOME   |   12.11 |
| DIHARD II  |   27.58 |
| DIHARD III |   19.61 |
| AMI Dev    |   20.97 |
| AMI Eval   |   20.76 |

These numbers come from the upstream LS-EEND project README and reflect the original training/evaluation setup, not a Hugging Face evaluation pipeline.

## Limitations

- These models are exported for ONNX runtimes.
- They are stateful streaming step models, so they require a custom driver loop.
- They assume an 8 kHz LS-EEND frontend and will not produce matching results if you use a different spectrogram pipeline.
- Speaker identities are output as activity tracks/slots and still require downstream diarization postprocessing and speaker-slot alignment where appropriate.

## License and dataset constraints

The upstream LS-EEND model/codebase used for these ONNX exports is MIT-licensed, and this repository is published as MIT accordingly.

The underlying evaluation and fine-tuning datasets still have their own access and usage terms:

- AMI
- CALLHOME
- DIHARD II
- DIHARD III

This repository redistributes ONNX exports of the LS-EEND model variants. Dataset licensing and access requirements remain governed by the original dataset providers.

## Citation

If you use LS-EEND, cite the original paper:

```bibtex
@ARTICLE{11122273,
  author={Liang, Di and Li, Xiaofei},
  journal={IEEE Transactions on Audio, Speech and Language Processing},
  title={LS-EEND: Long-Form Streaming End-to-End Neural Diarization With Online Attractor Extraction},
  year={2025},
  volume={33},
  pages={3568-3581},
  doi={10.1109/TASLPRO.2025.3597446}
}
```