Audar-ASR-Turbo Diarization (EEND v3)

End-to-end neural diarization (EEND) head trained jointly with Sortformer distillation on top of the frozen Audar3-ASR-1.7B audio_tower (audarai/Audar-ASR-Turbo). Produces frame-level multi-speaker activity posteriors at 13 fps with K=12 sigmoid output channels, suitable for diarizing long audio with up to 12 simultaneous speaker tracks. Trained with 2663h of synthetic multi-speaker mixtures + soft-label distillation from nvidia/diar_streaming_sortformer_4spk-v2. This is the v3 best checkpoint (step 2500) — beyond this step the model overfits and on-the- leaderboard DER regresses.

What this model DOES / DOES NOT do

DOES: Frame-level multi-speaker activity detection (K=12 sigmoid posteriors at 13 fps). Produces a per-frame, per-track speech / non- speech decision. Used downstream for speaker segmentation, turn-taking, overlap detection, and as a front-end for speaker attribution.
DOES NOT: Audio-to-text transcription. ASR is handled by the base model (audarai/Audar-ASR-Turbo). This repo only contains the diarization head — you still need the base audio_tower to extract the 2048-dim features the head consumes.

Audit-grade DER on 8 public leaderboards

All numbers below are audit-grade: Σ_errors / Σ_total_speech (audit-correct micro-aggregation), collar=0.25s, threshold=0.9, fps=13, K=12, evaluated with the official held-out splits. Sortformer column is nvidia/diar_streaming_sortformer_4spk-v2 evaluated under the same protocol — not the numbers reported by NVIDIA, which use different aggregation and collar.

Corpus	Audar v3 DER	Sortformer DER
VoxConverse (dev)	21.11%	11.65%
AliMeeting	32.74%	26.43%
ICSI	40.32%	30.81%
MSDWild few	36.81%	27.75%
AMI	46.56%	37.34%
MSDWild many	45.64%	41.98%
DipCo	47.58%	38.58%
CHiME-6	69.65% ✅	71.80%
MACRO avg	42.55%	35.79%

Audar v3 beats Sortformer on CHiME-6 (the hardest, far-field, multi- party dinner-table corpus) by 2.15 absolute DER. On the other 7 corpora Sortformer is still ahead in macro-average — this is intentional: v3 is the first checkpoint in the v3 lineage that crosses the CHiME-6 crossover bar and is being released as a hardware-friendly, distillation-compatible baseline for the v4 program.

Note: An internal internal_synthetic_val validation set tracked during training is NOT a leaderboard and is not reported here. Only public-test-set DER counts.

Architecture

Encoder (frozen): audarai/Audar-ASR-Turbo audio_tower → 2048-dim features at 13 fps.
Head (trainable, ~25M params):
- 4 × Conformer-style blocks, d_model=512, n_heads=8, conv kernel size 15, dropout 0.2.
- K_max=12 sigmoid output channels (per-track speaker activity).
- Soft-target Sortformer distillation auxiliary loss (sortformer_weight=0.3).
Frame rate: 13 Hz (≈77 ms hop).
Input dtype: bfloat16.

Inference convention

threshold = 0.9 (the optimal operating point per the v3 audit sweep)
fps = 13
collar = 0.25 s (standard DIHARD / VoxConverse evaluation collar)
K_max = 12
Sample rate: 16 kHz

Training

Data: 2663 hours of synthetic multi-speaker mixtures (2-12 speakers per mixture) + Sortformer teacher distillation.
Optimizer: AdamW, lr=3e-4, 1000 warmup steps, gradient clip 1.0.
Schedule: 8000 steps planned; step 2500 is the best by audit DER — past 2500 the model overfits and macro DER regresses.
Distillation teacher: nvidia/diar_streaming_sortformer_4spk-v2, weight 0.3.
Distributed: 8 × A100 / H100 nodes, DDP, batch size 8 per GPU.

Files

eend_v3_step2500.pt — the v3 best checkpoint. PyTorch state dict containing nar (the EEND head), ctc (auxiliary CTC), and speaker_attn state dicts. ~125 MB.
config.json — head hyperparameters and audit-best operating point.
README.md — this file.

Inference example

import torch
from huggingface_hub import hf_hub_download

# 1. Download the checkpoint
ckpt_path = hf_hub_download(
    "audarai/Audar-ASR-Turbo_diarization",
    "eend_v3_step2500.pt",
)
state = torch.load(ckpt_path, weights_only=False, map_location="cpu")

# 2. Construct the head — you need the NARDiarHeadEEND class from
#    https://github.com/audarai/eend_diar
from nar_diar_head_eend import NARDiarHeadEEND
head = (
    NARDiarHeadEEND(K_max=12, n_blocks=4, hidden_dim=512)
    .cuda()
    .bfloat16()
    .eval()
)
head.load_state_dict(state["nar"])

# 3. Forward
#    The head consumes [B, T, 2048] features from the Audar audio_tower
#    at 13 fps and emits [B, T, 12] sigmoid posteriors.
# with torch.no_grad():
#     posteriors = torch.sigmoid(head(audar_features))  # [B, T, 12]
#     active     = posteriors > 0.9                     # binary speaker activity

Citation

If you use this model please cite the eend_diar repo (audarai internal) and the Sortformer teacher:

@misc{audar_eend_v3_2026,
  title  = {Audar-ASR-Turbo Diarization (EEND v3)},
  author = {AudarAI},
  year   = {2026},
  url    = {https://huggingface.co/audarai/Audar-ASR-Turbo_diarization}
}

@misc{nvidia_sortformer_2024,
  title  = {Streaming Sortformer Diarization (4-spk v2)},
  author = {NVIDIA},
  year   = {2024},
  url    = {https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2}
}

License

Apache 2.0. See LICENSE (Apache-2.0 default for audarai).

Downloads last month: 36

Inference Providers NEW

Voice Activity Detection

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support