File size: 6,058 Bytes
12571ec
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
---
license: apache-2.0
library_name: pytorch
tags:
- diarization
- eend
- speaker-diarization
- audio
- audarai
- speech
base_model: audarai/Audar-ASR-Turbo
pipeline_tag: voice-activity-detection
---

# Audar-ASR-Turbo Diarization (EEND v3)

End-to-end neural diarization (EEND) head trained jointly with Sortformer
distillation on top of the frozen Audar3-ASR-1.7B audio_tower
(`audarai/Audar-ASR-Turbo`). Produces frame-level multi-speaker activity
posteriors at 13 fps with K=12 sigmoid output channels, suitable for
diarizing long audio with up to 12 simultaneous speaker tracks. Trained
with 2663h of synthetic multi-speaker mixtures + soft-label distillation
from `nvidia/diar_streaming_sortformer_4spk-v2`. This is the v3 best
checkpoint (step 2500) β€” beyond this step the model overfits and on-the-
leaderboard DER regresses.

## What this model DOES / DOES NOT do

- **DOES**: Frame-level multi-speaker activity detection (K=12 sigmoid
  posteriors at 13 fps). Produces a per-frame, per-track speech / non-
  speech decision. Used downstream for speaker segmentation, turn-taking,
  overlap detection, and as a front-end for speaker attribution.
- **DOES NOT**: Audio-to-text transcription. ASR is handled by the base
  model (`audarai/Audar-ASR-Turbo`). This repo only contains the
  diarization head β€” you still need the base audio_tower to extract the
  2048-dim features the head consumes.

## Audit-grade DER on 8 public leaderboards

All numbers below are **audit-grade**: `Ξ£_errors / Ξ£_total_speech`
(audit-correct micro-aggregation), `collar=0.25s`, `threshold=0.9`,
`fps=13`, `K=12`, evaluated with the official held-out splits.
Sortformer column is `nvidia/diar_streaming_sortformer_4spk-v2`
evaluated under the same protocol β€” not the numbers reported by NVIDIA,
which use different aggregation and collar.

| Corpus            |    Audar v3 DER |   Sortformer DER |
|-------------------|----------------:|-----------------:|
| VoxConverse (dev) |          21.11% |           11.65% |
| AliMeeting        |          32.74% |           26.43% |
| ICSI              |          40.32% |           30.81% |
| MSDWild few       |          36.81% |           27.75% |
| AMI               |          46.56% |           37.34% |
| MSDWild many      |          45.64% |           41.98% |
| DipCo             |          47.58% |           38.58% |
| CHiME-6           | **69.65%** βœ…   |           71.80% |
| **MACRO avg**     |      **42.55%** |           35.79% |

Audar v3 beats Sortformer on CHiME-6 (the hardest, far-field, multi-
party dinner-table corpus) by 2.15 absolute DER. On the other 7 corpora
Sortformer is still ahead in macro-average β€” this is intentional: v3 is
the first checkpoint in the v3 lineage that crosses the CHiME-6
crossover bar and is being released as a hardware-friendly,
distillation-compatible baseline for the v4 program.

> **Note**: An internal `internal_synthetic_val` validation set tracked
> during training is **NOT** a leaderboard and is not reported here. Only
> public-test-set DER counts.

## Architecture

- **Encoder (frozen)**: `audarai/Audar-ASR-Turbo` audio_tower β†’ 2048-dim
  features at 13 fps.
- **Head (trainable, ~25M params)**:
  - 4 Γ— Conformer-style blocks, `d_model=512`, `n_heads=8`,
    conv kernel size 15, dropout 0.2.
  - `K_max=12` sigmoid output channels (per-track speaker activity).
  - Soft-target Sortformer distillation auxiliary loss
    (`sortformer_weight=0.3`).
- **Frame rate**: 13 Hz (β‰ˆ77 ms hop).
- **Input dtype**: bfloat16.

## Inference convention

- `threshold = 0.9` (the optimal operating point per the v3 audit sweep)
- `fps = 13`
- `collar = 0.25 s` (standard DIHARD / VoxConverse evaluation collar)
- `K_max = 12`
- Sample rate: 16 kHz

## Training

- **Data**: 2663 hours of synthetic multi-speaker mixtures (2-12 speakers
  per mixture) + Sortformer teacher distillation.
- **Optimizer**: AdamW, `lr=3e-4`, 1000 warmup steps, gradient clip 1.0.
- **Schedule**: 8000 steps planned; **step 2500 is the best by audit
  DER** β€” past 2500 the model overfits and macro DER regresses.
- **Distillation teacher**: `nvidia/diar_streaming_sortformer_4spk-v2`,
  weight `0.3`.
- **Distributed**: 8 Γ— A100 / H100 nodes, DDP, batch size 8 per GPU.

## Files

- `eend_v3_step2500.pt` β€” the v3 best checkpoint. PyTorch state dict
  containing `nar` (the EEND head), `ctc` (auxiliary CTC), and
  `speaker_attn` state dicts. ~125 MB.
- `config.json` β€” head hyperparameters and audit-best operating point.
- `README.md` β€” this file.

## Inference example

```python
import torch
from huggingface_hub import hf_hub_download

# 1. Download the checkpoint
ckpt_path = hf_hub_download(
    "audarai/Audar-ASR-Turbo_diarization",
    "eend_v3_step2500.pt",
)
state = torch.load(ckpt_path, weights_only=False, map_location="cpu")

# 2. Construct the head β€” you need the NARDiarHeadEEND class from
#    https://github.com/audarai/eend_diar
from nar_diar_head_eend import NARDiarHeadEEND
head = (
    NARDiarHeadEEND(K_max=12, n_blocks=4, hidden_dim=512)
    .cuda()
    .bfloat16()
    .eval()
)
head.load_state_dict(state["nar"])

# 3. Forward
#    The head consumes [B, T, 2048] features from the Audar audio_tower
#    at 13 fps and emits [B, T, 12] sigmoid posteriors.
# with torch.no_grad():
#     posteriors = torch.sigmoid(head(audar_features))  # [B, T, 12]
#     active     = posteriors > 0.9                     # binary speaker activity
```

## Citation

If you use this model please cite the eend_diar repo (audarai
internal) and the Sortformer teacher:

```bibtex
@misc{audar_eend_v3_2026,
  title  = {Audar-ASR-Turbo Diarization (EEND v3)},
  author = {AudarAI},
  year   = {2026},
  url    = {https://huggingface.co/audarai/Audar-ASR-Turbo_diarization}
}

@misc{nvidia_sortformer_2024,
  title  = {Streaming Sortformer Diarization (4-spk v2)},
  author = {NVIDIA},
  year   = {2024},
  url    = {https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2}
}
```

## License

Apache 2.0. See `LICENSE` (Apache-2.0 default for audarai).