| --- |
| license: apache-2.0 |
| library_name: pytorch |
| tags: |
| - diarization |
| - eend |
| - speaker-diarization |
| - audio |
| - audarai |
| - speech |
| base_model: audarai/Audar-ASR-Turbo |
| pipeline_tag: voice-activity-detection |
| --- |
| |
| # Audar-ASR-Turbo Diarization (EEND v3) |
|
|
| End-to-end neural diarization (EEND) head trained jointly with Sortformer |
| distillation on top of the frozen Audar3-ASR-1.7B audio_tower |
| (`audarai/Audar-ASR-Turbo`). Produces frame-level multi-speaker activity |
| posteriors at 13 fps with K=12 sigmoid output channels, suitable for |
| diarizing long audio with up to 12 simultaneous speaker tracks. Trained |
| with 2663h of synthetic multi-speaker mixtures + soft-label distillation |
| from `nvidia/diar_streaming_sortformer_4spk-v2`. This is the v3 best |
| checkpoint (step 2500) β beyond this step the model overfits and on-the- |
| leaderboard DER regresses. |
|
|
| ## What this model DOES / DOES NOT do |
|
|
| - **DOES**: Frame-level multi-speaker activity detection (K=12 sigmoid |
| posteriors at 13 fps). Produces a per-frame, per-track speech / non- |
| speech decision. Used downstream for speaker segmentation, turn-taking, |
| overlap detection, and as a front-end for speaker attribution. |
| - **DOES NOT**: Audio-to-text transcription. ASR is handled by the base |
| model (`audarai/Audar-ASR-Turbo`). This repo only contains the |
| diarization head β you still need the base audio_tower to extract the |
| 2048-dim features the head consumes. |
| |
| ## Audit-grade DER on 8 public leaderboards |
| |
| All numbers below are **audit-grade**: `Ξ£_errors / Ξ£_total_speech` |
| (audit-correct micro-aggregation), `collar=0.25s`, `threshold=0.9`, |
| `fps=13`, `K=12`, evaluated with the official held-out splits. |
| Sortformer column is `nvidia/diar_streaming_sortformer_4spk-v2` |
| evaluated under the same protocol β not the numbers reported by NVIDIA, |
| which use different aggregation and collar. |
|
|
| | Corpus | Audar v3 DER | Sortformer DER | |
| |-------------------|----------------:|-----------------:| |
| | VoxConverse (dev) | 21.11% | 11.65% | |
| | AliMeeting | 32.74% | 26.43% | |
| | ICSI | 40.32% | 30.81% | |
| | MSDWild few | 36.81% | 27.75% | |
| | AMI | 46.56% | 37.34% | |
| | MSDWild many | 45.64% | 41.98% | |
| | DipCo | 47.58% | 38.58% | |
| | CHiME-6 | **69.65%** β
| 71.80% | |
| | **MACRO avg** | **42.55%** | 35.79% | |
|
|
| Audar v3 beats Sortformer on CHiME-6 (the hardest, far-field, multi- |
| party dinner-table corpus) by 2.15 absolute DER. On the other 7 corpora |
| Sortformer is still ahead in macro-average β this is intentional: v3 is |
| the first checkpoint in the v3 lineage that crosses the CHiME-6 |
| crossover bar and is being released as a hardware-friendly, |
| distillation-compatible baseline for the v4 program. |
|
|
| > **Note**: An internal `internal_synthetic_val` validation set tracked |
| > during training is **NOT** a leaderboard and is not reported here. Only |
| > public-test-set DER counts. |
|
|
| ## Architecture |
|
|
| - **Encoder (frozen)**: `audarai/Audar-ASR-Turbo` audio_tower β 2048-dim |
| features at 13 fps. |
| - **Head (trainable, ~25M params)**: |
| - 4 Γ Conformer-style blocks, `d_model=512`, `n_heads=8`, |
| conv kernel size 15, dropout 0.2. |
| - `K_max=12` sigmoid output channels (per-track speaker activity). |
| - Soft-target Sortformer distillation auxiliary loss |
| (`sortformer_weight=0.3`). |
| - **Frame rate**: 13 Hz (β77 ms hop). |
| - **Input dtype**: bfloat16. |
| |
| ## Inference convention |
|
|
| - `threshold = 0.9` (the optimal operating point per the v3 audit sweep) |
| - `fps = 13` |
| - `collar = 0.25 s` (standard DIHARD / VoxConverse evaluation collar) |
| - `K_max = 12` |
| - Sample rate: 16 kHz |
|
|
| ## Training |
|
|
| - **Data**: 2663 hours of synthetic multi-speaker mixtures (2-12 speakers |
| per mixture) + Sortformer teacher distillation. |
| - **Optimizer**: AdamW, `lr=3e-4`, 1000 warmup steps, gradient clip 1.0. |
| - **Schedule**: 8000 steps planned; **step 2500 is the best by audit |
| DER** β past 2500 the model overfits and macro DER regresses. |
| - **Distillation teacher**: `nvidia/diar_streaming_sortformer_4spk-v2`, |
| weight `0.3`. |
| - **Distributed**: 8 Γ A100 / H100 nodes, DDP, batch size 8 per GPU. |
|
|
| ## Files |
|
|
| - `eend_v3_step2500.pt` β the v3 best checkpoint. PyTorch state dict |
| containing `nar` (the EEND head), `ctc` (auxiliary CTC), and |
| `speaker_attn` state dicts. ~125 MB. |
| - `config.json` β head hyperparameters and audit-best operating point. |
| - `README.md` β this file. |
|
|
| ## Inference example |
|
|
| ```python |
| import torch |
| from huggingface_hub import hf_hub_download |
| |
| # 1. Download the checkpoint |
| ckpt_path = hf_hub_download( |
| "audarai/Audar-ASR-Turbo_diarization", |
| "eend_v3_step2500.pt", |
| ) |
| state = torch.load(ckpt_path, weights_only=False, map_location="cpu") |
| |
| # 2. Construct the head β you need the NARDiarHeadEEND class from |
| # https://github.com/audarai/eend_diar |
| from nar_diar_head_eend import NARDiarHeadEEND |
| head = ( |
| NARDiarHeadEEND(K_max=12, n_blocks=4, hidden_dim=512) |
| .cuda() |
| .bfloat16() |
| .eval() |
| ) |
| head.load_state_dict(state["nar"]) |
| |
| # 3. Forward |
| # The head consumes [B, T, 2048] features from the Audar audio_tower |
| # at 13 fps and emits [B, T, 12] sigmoid posteriors. |
| # with torch.no_grad(): |
| # posteriors = torch.sigmoid(head(audar_features)) # [B, T, 12] |
| # active = posteriors > 0.9 # binary speaker activity |
| ``` |
|
|
| ## Citation |
|
|
| If you use this model please cite the eend_diar repo (audarai |
| internal) and the Sortformer teacher: |
| |
| ```bibtex |
| @misc{audar_eend_v3_2026, |
| title = {Audar-ASR-Turbo Diarization (EEND v3)}, |
| author = {AudarAI}, |
| year = {2026}, |
| url = {https://huggingface.co/audarai/Audar-ASR-Turbo_diarization} |
| } |
| |
| @misc{nvidia_sortformer_2024, |
| title = {Streaming Sortformer Diarization (4-spk v2)}, |
| author = {NVIDIA}, |
| year = {2024}, |
| url = {https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2} |
| } |
| ``` |
| |
| ## License |
| |
| Apache 2.0. See `LICENSE` (Apache-2.0 default for audarai). |
| |