File size: 6,058 Bytes

12571ec

---
license: apache-2.0
library_name: pytorch
tags:
- diarization
- eend
- speaker-diarization
- audio
- audarai
- speech
base_model: audarai/Audar-ASR-Turbo
pipeline_tag: voice-activity-detection
---

# Audar-ASR-Turbo Diarization (EEND v3)

End-to-end neural diarization (EEND) head trained jointly with Sortformer
distillation on top of the frozen Audar3-ASR-1.7B audio_tower
(`audarai/Audar-ASR-Turbo`). Produces frame-level multi-speaker activity
posteriors at 13 fps with K=12 sigmoid output channels, suitable for
diarizing long audio with up to 12 simultaneous speaker tracks. Trained
with 2663h of synthetic multi-speaker mixtures + soft-label distillation
from `nvidia/diar_streaming_sortformer_4spk-v2`. This is the v3 best
checkpoint (step 2500) — beyond this step the model overfits and on-the-
leaderboard DER regresses.

## What this model DOES / DOES NOT do

- **DOES**: Frame-level multi-speaker activity detection (K=12 sigmoid
  posteriors at 13 fps). Produces a per-frame, per-track speech / non-
  speech decision. Used downstream for speaker segmentation, turn-taking,
  overlap detection, and as a front-end for speaker attribution.
- **DOES NOT**: Audio-to-text transcription. ASR is handled by the base
  model (`audarai/Audar-ASR-Turbo`). This repo only contains the
  diarization head — you still need the base audio_tower to extract the
  2048-dim features the head consumes.

## Audit-grade DER on 8 public leaderboards

All numbers below are **audit-grade**: `Σ_errors / Σ_total_speech`
(audit-correct micro-aggregation), `collar=0.25s`, `threshold=0.9`,
`fps=13`, `K=12`, evaluated with the official held-out splits.
Sortformer column is `nvidia/diar_streaming_sortformer_4spk-v2`
evaluated under the same protocol — not the numbers reported by NVIDIA,
which use different aggregation and collar.

| Corpus            |    Audar v3 DER |   Sortformer DER |
|-------------------|----------------:|-----------------:|
| VoxConverse (dev) |          21.11% |           11.65% |
| AliMeeting        |          32.74% |           26.43% |
| ICSI              |          40.32% |           30.81% |
| MSDWild few       |          36.81% |           27.75% |
| AMI               |          46.56% |           37.34% |
| MSDWild many      |          45.64% |           41.98% |
| DipCo             |          47.58% |           38.58% |
| CHiME-6           | **69.65%** ✅   |           71.80% |
| **MACRO avg**     |      **42.55%** |           35.79% |

Audar v3 beats Sortformer on CHiME-6 (the hardest, far-field, multi-
party dinner-table corpus) by 2.15 absolute DER. On the other 7 corpora
Sortformer is still ahead in macro-average — this is intentional: v3 is
the first checkpoint in the v3 lineage that crosses the CHiME-6
crossover bar and is being released as a hardware-friendly,
distillation-compatible baseline for the v4 program.

> **Note**: An internal `internal_synthetic_val` validation set tracked
> during training is **NOT** a leaderboard and is not reported here. Only
> public-test-set DER counts.

## Architecture

- **Encoder (frozen)**: `audarai/Audar-ASR-Turbo` audio_tower → 2048-dim
  features at 13 fps.
- **Head (trainable, ~25M params)**:
  - 4 × Conformer-style blocks, `d_model=512`, `n_heads=8`,
    conv kernel size 15, dropout 0.2.
  - `K_max=12` sigmoid output channels (per-track speaker activity).
  - Soft-target Sortformer distillation auxiliary loss
    (`sortformer_weight=0.3`).
- **Frame rate**: 13 Hz (≈77 ms hop).
- **Input dtype**: bfloat16.

## Inference convention

- `threshold = 0.9` (the optimal operating point per the v3 audit sweep)
- `fps = 13`
- `collar = 0.25 s` (standard DIHARD / VoxConverse evaluation collar)
- `K_max = 12`
- Sample rate: 16 kHz

## Training

- **Data**: 2663 hours of synthetic multi-speaker mixtures (2-12 speakers
  per mixture) + Sortformer teacher distillation.
- **Optimizer**: AdamW, `lr=3e-4`, 1000 warmup steps, gradient clip 1.0.
- **Schedule**: 8000 steps planned; **step 2500 is the best by audit
  DER** — past 2500 the model overfits and macro DER regresses.
- **Distillation teacher**: `nvidia/diar_streaming_sortformer_4spk-v2`,
  weight `0.3`.
- **Distributed**: 8 × A100 / H100 nodes, DDP, batch size 8 per GPU.

## Files

- `eend_v3_step2500.pt` — the v3 best checkpoint. PyTorch state dict
  containing `nar` (the EEND head), `ctc` (auxiliary CTC), and
  `speaker_attn` state dicts. ~125 MB.
- `config.json` — head hyperparameters and audit-best operating point.
- `README.md` — this file.

## Inference example

```python
import torch
from huggingface_hub import hf_hub_download

# 1. Download the checkpoint
ckpt_path = hf_hub_download(
    "audarai/Audar-ASR-Turbo_diarization",
    "eend_v3_step2500.pt",
)
state = torch.load(ckpt_path, weights_only=False, map_location="cpu")

# 2. Construct the head — you need the NARDiarHeadEEND class from
#    https://github.com/audarai/eend_diar
from nar_diar_head_eend import NARDiarHeadEEND
head = (
    NARDiarHeadEEND(K_max=12, n_blocks=4, hidden_dim=512)
    .cuda()
    .bfloat16()
    .eval()
)
head.load_state_dict(state["nar"])

# 3. Forward
#    The head consumes [B, T, 2048] features from the Audar audio_tower
#    at 13 fps and emits [B, T, 12] sigmoid posteriors.
# with torch.no_grad():
#     posteriors = torch.sigmoid(head(audar_features))  # [B, T, 12]
#     active     = posteriors > 0.9                     # binary speaker activity
```

## Citation

If you use this model please cite the eend_diar repo (audarai
internal) and the Sortformer teacher:

```bibtex
@misc{audar_eend_v3_2026,
  title  = {Audar-ASR-Turbo Diarization (EEND v3)},
  author = {AudarAI},
  year   = {2026},
  url    = {https://huggingface.co/audarai/Audar-ASR-Turbo_diarization}
}

@misc{nvidia_sortformer_2024,
  title  = {Streaming Sortformer Diarization (4-spk v2)},
  author = {NVIDIA},
  year   = {2024},
  url    = {https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2}
}
```

## License

Apache 2.0. See `LICENSE` (Apache-2.0 default for audarai).