File size: 6,058 Bytes
12571ec | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 | ---
license: apache-2.0
library_name: pytorch
tags:
- diarization
- eend
- speaker-diarization
- audio
- audarai
- speech
base_model: audarai/Audar-ASR-Turbo
pipeline_tag: voice-activity-detection
---
# Audar-ASR-Turbo Diarization (EEND v3)
End-to-end neural diarization (EEND) head trained jointly with Sortformer
distillation on top of the frozen Audar3-ASR-1.7B audio_tower
(`audarai/Audar-ASR-Turbo`). Produces frame-level multi-speaker activity
posteriors at 13 fps with K=12 sigmoid output channels, suitable for
diarizing long audio with up to 12 simultaneous speaker tracks. Trained
with 2663h of synthetic multi-speaker mixtures + soft-label distillation
from `nvidia/diar_streaming_sortformer_4spk-v2`. This is the v3 best
checkpoint (step 2500) β beyond this step the model overfits and on-the-
leaderboard DER regresses.
## What this model DOES / DOES NOT do
- **DOES**: Frame-level multi-speaker activity detection (K=12 sigmoid
posteriors at 13 fps). Produces a per-frame, per-track speech / non-
speech decision. Used downstream for speaker segmentation, turn-taking,
overlap detection, and as a front-end for speaker attribution.
- **DOES NOT**: Audio-to-text transcription. ASR is handled by the base
model (`audarai/Audar-ASR-Turbo`). This repo only contains the
diarization head β you still need the base audio_tower to extract the
2048-dim features the head consumes.
## Audit-grade DER on 8 public leaderboards
All numbers below are **audit-grade**: `Ξ£_errors / Ξ£_total_speech`
(audit-correct micro-aggregation), `collar=0.25s`, `threshold=0.9`,
`fps=13`, `K=12`, evaluated with the official held-out splits.
Sortformer column is `nvidia/diar_streaming_sortformer_4spk-v2`
evaluated under the same protocol β not the numbers reported by NVIDIA,
which use different aggregation and collar.
| Corpus | Audar v3 DER | Sortformer DER |
|-------------------|----------------:|-----------------:|
| VoxConverse (dev) | 21.11% | 11.65% |
| AliMeeting | 32.74% | 26.43% |
| ICSI | 40.32% | 30.81% |
| MSDWild few | 36.81% | 27.75% |
| AMI | 46.56% | 37.34% |
| MSDWild many | 45.64% | 41.98% |
| DipCo | 47.58% | 38.58% |
| CHiME-6 | **69.65%** β
| 71.80% |
| **MACRO avg** | **42.55%** | 35.79% |
Audar v3 beats Sortformer on CHiME-6 (the hardest, far-field, multi-
party dinner-table corpus) by 2.15 absolute DER. On the other 7 corpora
Sortformer is still ahead in macro-average β this is intentional: v3 is
the first checkpoint in the v3 lineage that crosses the CHiME-6
crossover bar and is being released as a hardware-friendly,
distillation-compatible baseline for the v4 program.
> **Note**: An internal `internal_synthetic_val` validation set tracked
> during training is **NOT** a leaderboard and is not reported here. Only
> public-test-set DER counts.
## Architecture
- **Encoder (frozen)**: `audarai/Audar-ASR-Turbo` audio_tower β 2048-dim
features at 13 fps.
- **Head (trainable, ~25M params)**:
- 4 Γ Conformer-style blocks, `d_model=512`, `n_heads=8`,
conv kernel size 15, dropout 0.2.
- `K_max=12` sigmoid output channels (per-track speaker activity).
- Soft-target Sortformer distillation auxiliary loss
(`sortformer_weight=0.3`).
- **Frame rate**: 13 Hz (β77 ms hop).
- **Input dtype**: bfloat16.
## Inference convention
- `threshold = 0.9` (the optimal operating point per the v3 audit sweep)
- `fps = 13`
- `collar = 0.25 s` (standard DIHARD / VoxConverse evaluation collar)
- `K_max = 12`
- Sample rate: 16 kHz
## Training
- **Data**: 2663 hours of synthetic multi-speaker mixtures (2-12 speakers
per mixture) + Sortformer teacher distillation.
- **Optimizer**: AdamW, `lr=3e-4`, 1000 warmup steps, gradient clip 1.0.
- **Schedule**: 8000 steps planned; **step 2500 is the best by audit
DER** β past 2500 the model overfits and macro DER regresses.
- **Distillation teacher**: `nvidia/diar_streaming_sortformer_4spk-v2`,
weight `0.3`.
- **Distributed**: 8 Γ A100 / H100 nodes, DDP, batch size 8 per GPU.
## Files
- `eend_v3_step2500.pt` β the v3 best checkpoint. PyTorch state dict
containing `nar` (the EEND head), `ctc` (auxiliary CTC), and
`speaker_attn` state dicts. ~125 MB.
- `config.json` β head hyperparameters and audit-best operating point.
- `README.md` β this file.
## Inference example
```python
import torch
from huggingface_hub import hf_hub_download
# 1. Download the checkpoint
ckpt_path = hf_hub_download(
"audarai/Audar-ASR-Turbo_diarization",
"eend_v3_step2500.pt",
)
state = torch.load(ckpt_path, weights_only=False, map_location="cpu")
# 2. Construct the head β you need the NARDiarHeadEEND class from
# https://github.com/audarai/eend_diar
from nar_diar_head_eend import NARDiarHeadEEND
head = (
NARDiarHeadEEND(K_max=12, n_blocks=4, hidden_dim=512)
.cuda()
.bfloat16()
.eval()
)
head.load_state_dict(state["nar"])
# 3. Forward
# The head consumes [B, T, 2048] features from the Audar audio_tower
# at 13 fps and emits [B, T, 12] sigmoid posteriors.
# with torch.no_grad():
# posteriors = torch.sigmoid(head(audar_features)) # [B, T, 12]
# active = posteriors > 0.9 # binary speaker activity
```
## Citation
If you use this model please cite the eend_diar repo (audarai
internal) and the Sortformer teacher:
```bibtex
@misc{audar_eend_v3_2026,
title = {Audar-ASR-Turbo Diarization (EEND v3)},
author = {AudarAI},
year = {2026},
url = {https://huggingface.co/audarai/Audar-ASR-Turbo_diarization}
}
@misc{nvidia_sortformer_2024,
title = {Streaming Sortformer Diarization (4-spk v2)},
author = {NVIDIA},
year = {2024},
url = {https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2}
}
```
## License
Apache 2.0. See `LICENSE` (Apache-2.0 default for audarai).
|