Dragonhead's picture
Initial release: v3 best checkpoint (step 2500, macro DER 42.55%)
12571ec verified
---
license: apache-2.0
library_name: pytorch
tags:
- diarization
- eend
- speaker-diarization
- audio
- audarai
- speech
base_model: audarai/Audar-ASR-Turbo
pipeline_tag: voice-activity-detection
---
# Audar-ASR-Turbo Diarization (EEND v3)
End-to-end neural diarization (EEND) head trained jointly with Sortformer
distillation on top of the frozen Audar3-ASR-1.7B audio_tower
(`audarai/Audar-ASR-Turbo`). Produces frame-level multi-speaker activity
posteriors at 13 fps with K=12 sigmoid output channels, suitable for
diarizing long audio with up to 12 simultaneous speaker tracks. Trained
with 2663h of synthetic multi-speaker mixtures + soft-label distillation
from `nvidia/diar_streaming_sortformer_4spk-v2`. This is the v3 best
checkpoint (step 2500) β€” beyond this step the model overfits and on-the-
leaderboard DER regresses.
## What this model DOES / DOES NOT do
- **DOES**: Frame-level multi-speaker activity detection (K=12 sigmoid
posteriors at 13 fps). Produces a per-frame, per-track speech / non-
speech decision. Used downstream for speaker segmentation, turn-taking,
overlap detection, and as a front-end for speaker attribution.
- **DOES NOT**: Audio-to-text transcription. ASR is handled by the base
model (`audarai/Audar-ASR-Turbo`). This repo only contains the
diarization head β€” you still need the base audio_tower to extract the
2048-dim features the head consumes.
## Audit-grade DER on 8 public leaderboards
All numbers below are **audit-grade**: `Ξ£_errors / Ξ£_total_speech`
(audit-correct micro-aggregation), `collar=0.25s`, `threshold=0.9`,
`fps=13`, `K=12`, evaluated with the official held-out splits.
Sortformer column is `nvidia/diar_streaming_sortformer_4spk-v2`
evaluated under the same protocol β€” not the numbers reported by NVIDIA,
which use different aggregation and collar.
| Corpus | Audar v3 DER | Sortformer DER |
|-------------------|----------------:|-----------------:|
| VoxConverse (dev) | 21.11% | 11.65% |
| AliMeeting | 32.74% | 26.43% |
| ICSI | 40.32% | 30.81% |
| MSDWild few | 36.81% | 27.75% |
| AMI | 46.56% | 37.34% |
| MSDWild many | 45.64% | 41.98% |
| DipCo | 47.58% | 38.58% |
| CHiME-6 | **69.65%** βœ… | 71.80% |
| **MACRO avg** | **42.55%** | 35.79% |
Audar v3 beats Sortformer on CHiME-6 (the hardest, far-field, multi-
party dinner-table corpus) by 2.15 absolute DER. On the other 7 corpora
Sortformer is still ahead in macro-average β€” this is intentional: v3 is
the first checkpoint in the v3 lineage that crosses the CHiME-6
crossover bar and is being released as a hardware-friendly,
distillation-compatible baseline for the v4 program.
> **Note**: An internal `internal_synthetic_val` validation set tracked
> during training is **NOT** a leaderboard and is not reported here. Only
> public-test-set DER counts.
## Architecture
- **Encoder (frozen)**: `audarai/Audar-ASR-Turbo` audio_tower β†’ 2048-dim
features at 13 fps.
- **Head (trainable, ~25M params)**:
- 4 Γ— Conformer-style blocks, `d_model=512`, `n_heads=8`,
conv kernel size 15, dropout 0.2.
- `K_max=12` sigmoid output channels (per-track speaker activity).
- Soft-target Sortformer distillation auxiliary loss
(`sortformer_weight=0.3`).
- **Frame rate**: 13 Hz (β‰ˆ77 ms hop).
- **Input dtype**: bfloat16.
## Inference convention
- `threshold = 0.9` (the optimal operating point per the v3 audit sweep)
- `fps = 13`
- `collar = 0.25 s` (standard DIHARD / VoxConverse evaluation collar)
- `K_max = 12`
- Sample rate: 16 kHz
## Training
- **Data**: 2663 hours of synthetic multi-speaker mixtures (2-12 speakers
per mixture) + Sortformer teacher distillation.
- **Optimizer**: AdamW, `lr=3e-4`, 1000 warmup steps, gradient clip 1.0.
- **Schedule**: 8000 steps planned; **step 2500 is the best by audit
DER** β€” past 2500 the model overfits and macro DER regresses.
- **Distillation teacher**: `nvidia/diar_streaming_sortformer_4spk-v2`,
weight `0.3`.
- **Distributed**: 8 Γ— A100 / H100 nodes, DDP, batch size 8 per GPU.
## Files
- `eend_v3_step2500.pt` β€” the v3 best checkpoint. PyTorch state dict
containing `nar` (the EEND head), `ctc` (auxiliary CTC), and
`speaker_attn` state dicts. ~125 MB.
- `config.json` β€” head hyperparameters and audit-best operating point.
- `README.md` β€” this file.
## Inference example
```python
import torch
from huggingface_hub import hf_hub_download
# 1. Download the checkpoint
ckpt_path = hf_hub_download(
"audarai/Audar-ASR-Turbo_diarization",
"eend_v3_step2500.pt",
)
state = torch.load(ckpt_path, weights_only=False, map_location="cpu")
# 2. Construct the head β€” you need the NARDiarHeadEEND class from
# https://github.com/audarai/eend_diar
from nar_diar_head_eend import NARDiarHeadEEND
head = (
NARDiarHeadEEND(K_max=12, n_blocks=4, hidden_dim=512)
.cuda()
.bfloat16()
.eval()
)
head.load_state_dict(state["nar"])
# 3. Forward
# The head consumes [B, T, 2048] features from the Audar audio_tower
# at 13 fps and emits [B, T, 12] sigmoid posteriors.
# with torch.no_grad():
# posteriors = torch.sigmoid(head(audar_features)) # [B, T, 12]
# active = posteriors > 0.9 # binary speaker activity
```
## Citation
If you use this model please cite the eend_diar repo (audarai
internal) and the Sortformer teacher:
```bibtex
@misc{audar_eend_v3_2026,
title = {Audar-ASR-Turbo Diarization (EEND v3)},
author = {AudarAI},
year = {2026},
url = {https://huggingface.co/audarai/Audar-ASR-Turbo_diarization}
}
@misc{nvidia_sortformer_2024,
title = {Streaming Sortformer Diarization (4-spk v2)},
author = {NVIDIA},
year = {2024},
url = {https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2}
}
```
## License
Apache 2.0. See `LICENSE` (Apache-2.0 default for audarai).