Audar-ASR-Turbo Diarization (EEND v3)

End-to-end neural diarization (EEND) head trained jointly with Sortformer distillation on top of the frozen Audar3-ASR-1.7B audio_tower (audarai/Audar-ASR-Turbo). Produces frame-level multi-speaker activity posteriors at 13 fps with K=12 sigmoid output channels, suitable for diarizing long audio with up to 12 simultaneous speaker tracks. Trained with 2663h of synthetic multi-speaker mixtures + soft-label distillation from nvidia/diar_streaming_sortformer_4spk-v2. This is the v3 best checkpoint (step 2500) β€” beyond this step the model overfits and on-the- leaderboard DER regresses.

What this model DOES / DOES NOT do

  • DOES: Frame-level multi-speaker activity detection (K=12 sigmoid posteriors at 13 fps). Produces a per-frame, per-track speech / non- speech decision. Used downstream for speaker segmentation, turn-taking, overlap detection, and as a front-end for speaker attribution.
  • DOES NOT: Audio-to-text transcription. ASR is handled by the base model (audarai/Audar-ASR-Turbo). This repo only contains the diarization head β€” you still need the base audio_tower to extract the 2048-dim features the head consumes.

Audit-grade DER on 8 public leaderboards

All numbers below are audit-grade: Ξ£_errors / Ξ£_total_speech (audit-correct micro-aggregation), collar=0.25s, threshold=0.9, fps=13, K=12, evaluated with the official held-out splits. Sortformer column is nvidia/diar_streaming_sortformer_4spk-v2 evaluated under the same protocol β€” not the numbers reported by NVIDIA, which use different aggregation and collar.

Corpus Audar v3 DER Sortformer DER
VoxConverse (dev) 21.11% 11.65%
AliMeeting 32.74% 26.43%
ICSI 40.32% 30.81%
MSDWild few 36.81% 27.75%
AMI 46.56% 37.34%
MSDWild many 45.64% 41.98%
DipCo 47.58% 38.58%
CHiME-6 69.65% βœ… 71.80%
MACRO avg 42.55% 35.79%

Audar v3 beats Sortformer on CHiME-6 (the hardest, far-field, multi- party dinner-table corpus) by 2.15 absolute DER. On the other 7 corpora Sortformer is still ahead in macro-average β€” this is intentional: v3 is the first checkpoint in the v3 lineage that crosses the CHiME-6 crossover bar and is being released as a hardware-friendly, distillation-compatible baseline for the v4 program.

Note: An internal internal_synthetic_val validation set tracked during training is NOT a leaderboard and is not reported here. Only public-test-set DER counts.

Architecture

  • Encoder (frozen): audarai/Audar-ASR-Turbo audio_tower β†’ 2048-dim features at 13 fps.
  • Head (trainable, ~25M params):
    • 4 Γ— Conformer-style blocks, d_model=512, n_heads=8, conv kernel size 15, dropout 0.2.
    • K_max=12 sigmoid output channels (per-track speaker activity).
    • Soft-target Sortformer distillation auxiliary loss (sortformer_weight=0.3).
  • Frame rate: 13 Hz (β‰ˆ77 ms hop).
  • Input dtype: bfloat16.

Inference convention

  • threshold = 0.9 (the optimal operating point per the v3 audit sweep)
  • fps = 13
  • collar = 0.25 s (standard DIHARD / VoxConverse evaluation collar)
  • K_max = 12
  • Sample rate: 16 kHz

Training

  • Data: 2663 hours of synthetic multi-speaker mixtures (2-12 speakers per mixture) + Sortformer teacher distillation.
  • Optimizer: AdamW, lr=3e-4, 1000 warmup steps, gradient clip 1.0.
  • Schedule: 8000 steps planned; step 2500 is the best by audit DER β€” past 2500 the model overfits and macro DER regresses.
  • Distillation teacher: nvidia/diar_streaming_sortformer_4spk-v2, weight 0.3.
  • Distributed: 8 Γ— A100 / H100 nodes, DDP, batch size 8 per GPU.

Files

  • eend_v3_step2500.pt β€” the v3 best checkpoint. PyTorch state dict containing nar (the EEND head), ctc (auxiliary CTC), and speaker_attn state dicts. ~125 MB.
  • config.json β€” head hyperparameters and audit-best operating point.
  • README.md β€” this file.

Inference example

import torch
from huggingface_hub import hf_hub_download

# 1. Download the checkpoint
ckpt_path = hf_hub_download(
    "audarai/Audar-ASR-Turbo_diarization",
    "eend_v3_step2500.pt",
)
state = torch.load(ckpt_path, weights_only=False, map_location="cpu")

# 2. Construct the head β€” you need the NARDiarHeadEEND class from
#    https://github.com/audarai/eend_diar
from nar_diar_head_eend import NARDiarHeadEEND
head = (
    NARDiarHeadEEND(K_max=12, n_blocks=4, hidden_dim=512)
    .cuda()
    .bfloat16()
    .eval()
)
head.load_state_dict(state["nar"])

# 3. Forward
#    The head consumes [B, T, 2048] features from the Audar audio_tower
#    at 13 fps and emits [B, T, 12] sigmoid posteriors.
# with torch.no_grad():
#     posteriors = torch.sigmoid(head(audar_features))  # [B, T, 12]
#     active     = posteriors > 0.9                     # binary speaker activity

Citation

If you use this model please cite the eend_diar repo (audarai internal) and the Sortformer teacher:

@misc{audar_eend_v3_2026,
  title  = {Audar-ASR-Turbo Diarization (EEND v3)},
  author = {AudarAI},
  year   = {2026},
  url    = {https://huggingface.co/audarai/Audar-ASR-Turbo_diarization}
}

@misc{nvidia_sortformer_2024,
  title  = {Streaming Sortformer Diarization (4-spk v2)},
  author = {NVIDIA},
  year   = {2024},
  url    = {https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2}
}

License

Apache 2.0. See LICENSE (Apache-2.0 default for audarai).

Downloads last month
36
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support