Audar-ASR-Turbo Diarization (EEND v3)
End-to-end neural diarization (EEND) head trained jointly with Sortformer
distillation on top of the frozen Audar3-ASR-1.7B audio_tower
(audarai/Audar-ASR-Turbo). Produces frame-level multi-speaker activity
posteriors at 13 fps with K=12 sigmoid output channels, suitable for
diarizing long audio with up to 12 simultaneous speaker tracks. Trained
with 2663h of synthetic multi-speaker mixtures + soft-label distillation
from nvidia/diar_streaming_sortformer_4spk-v2. This is the v3 best
checkpoint (step 2500) β beyond this step the model overfits and on-the-
leaderboard DER regresses.
What this model DOES / DOES NOT do
- DOES: Frame-level multi-speaker activity detection (K=12 sigmoid posteriors at 13 fps). Produces a per-frame, per-track speech / non- speech decision. Used downstream for speaker segmentation, turn-taking, overlap detection, and as a front-end for speaker attribution.
- DOES NOT: Audio-to-text transcription. ASR is handled by the base
model (
audarai/Audar-ASR-Turbo). This repo only contains the diarization head β you still need the base audio_tower to extract the 2048-dim features the head consumes.
Audit-grade DER on 8 public leaderboards
All numbers below are audit-grade: Ξ£_errors / Ξ£_total_speech
(audit-correct micro-aggregation), collar=0.25s, threshold=0.9,
fps=13, K=12, evaluated with the official held-out splits.
Sortformer column is nvidia/diar_streaming_sortformer_4spk-v2
evaluated under the same protocol β not the numbers reported by NVIDIA,
which use different aggregation and collar.
| Corpus | Audar v3 DER | Sortformer DER |
|---|---|---|
| VoxConverse (dev) | 21.11% | 11.65% |
| AliMeeting | 32.74% | 26.43% |
| ICSI | 40.32% | 30.81% |
| MSDWild few | 36.81% | 27.75% |
| AMI | 46.56% | 37.34% |
| MSDWild many | 45.64% | 41.98% |
| DipCo | 47.58% | 38.58% |
| CHiME-6 | 69.65% β | 71.80% |
| MACRO avg | 42.55% | 35.79% |
Audar v3 beats Sortformer on CHiME-6 (the hardest, far-field, multi- party dinner-table corpus) by 2.15 absolute DER. On the other 7 corpora Sortformer is still ahead in macro-average β this is intentional: v3 is the first checkpoint in the v3 lineage that crosses the CHiME-6 crossover bar and is being released as a hardware-friendly, distillation-compatible baseline for the v4 program.
Note: An internal
internal_synthetic_valvalidation set tracked during training is NOT a leaderboard and is not reported here. Only public-test-set DER counts.
Architecture
- Encoder (frozen):
audarai/Audar-ASR-Turboaudio_tower β 2048-dim features at 13 fps. - Head (trainable, ~25M params):
- 4 Γ Conformer-style blocks,
d_model=512,n_heads=8, conv kernel size 15, dropout 0.2. K_max=12sigmoid output channels (per-track speaker activity).- Soft-target Sortformer distillation auxiliary loss
(
sortformer_weight=0.3).
- 4 Γ Conformer-style blocks,
- Frame rate: 13 Hz (β77 ms hop).
- Input dtype: bfloat16.
Inference convention
threshold = 0.9(the optimal operating point per the v3 audit sweep)fps = 13collar = 0.25 s(standard DIHARD / VoxConverse evaluation collar)K_max = 12- Sample rate: 16 kHz
Training
- Data: 2663 hours of synthetic multi-speaker mixtures (2-12 speakers per mixture) + Sortformer teacher distillation.
- Optimizer: AdamW,
lr=3e-4, 1000 warmup steps, gradient clip 1.0. - Schedule: 8000 steps planned; step 2500 is the best by audit DER β past 2500 the model overfits and macro DER regresses.
- Distillation teacher:
nvidia/diar_streaming_sortformer_4spk-v2, weight0.3. - Distributed: 8 Γ A100 / H100 nodes, DDP, batch size 8 per GPU.
Files
eend_v3_step2500.ptβ the v3 best checkpoint. PyTorch state dict containingnar(the EEND head),ctc(auxiliary CTC), andspeaker_attnstate dicts. ~125 MB.config.jsonβ head hyperparameters and audit-best operating point.README.mdβ this file.
Inference example
import torch
from huggingface_hub import hf_hub_download
# 1. Download the checkpoint
ckpt_path = hf_hub_download(
"audarai/Audar-ASR-Turbo_diarization",
"eend_v3_step2500.pt",
)
state = torch.load(ckpt_path, weights_only=False, map_location="cpu")
# 2. Construct the head β you need the NARDiarHeadEEND class from
# https://github.com/audarai/eend_diar
from nar_diar_head_eend import NARDiarHeadEEND
head = (
NARDiarHeadEEND(K_max=12, n_blocks=4, hidden_dim=512)
.cuda()
.bfloat16()
.eval()
)
head.load_state_dict(state["nar"])
# 3. Forward
# The head consumes [B, T, 2048] features from the Audar audio_tower
# at 13 fps and emits [B, T, 12] sigmoid posteriors.
# with torch.no_grad():
# posteriors = torch.sigmoid(head(audar_features)) # [B, T, 12]
# active = posteriors > 0.9 # binary speaker activity
Citation
If you use this model please cite the eend_diar repo (audarai internal) and the Sortformer teacher:
@misc{audar_eend_v3_2026,
title = {Audar-ASR-Turbo Diarization (EEND v3)},
author = {AudarAI},
year = {2026},
url = {https://huggingface.co/audarai/Audar-ASR-Turbo_diarization}
}
@misc{nvidia_sortformer_2024,
title = {Streaming Sortformer Diarization (4-spk v2)},
author = {NVIDIA},
year = {2024},
url = {https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2}
}
License
Apache 2.0. See LICENSE (Apache-2.0 default for audarai).
- Downloads last month
- 36