--- license: apache-2.0 library_name: pytorch tags: - diarization - eend - speaker-diarization - audio - audarai - speech base_model: audarai/Audar-ASR-Turbo pipeline_tag: voice-activity-detection --- # Audar-ASR-Turbo Diarization (EEND v3) End-to-end neural diarization (EEND) head trained jointly with Sortformer distillation on top of the frozen Audar3-ASR-1.7B audio_tower (`audarai/Audar-ASR-Turbo`). Produces frame-level multi-speaker activity posteriors at 13 fps with K=12 sigmoid output channels, suitable for diarizing long audio with up to 12 simultaneous speaker tracks. Trained with 2663h of synthetic multi-speaker mixtures + soft-label distillation from `nvidia/diar_streaming_sortformer_4spk-v2`. This is the v3 best checkpoint (step 2500) — beyond this step the model overfits and on-the- leaderboard DER regresses. ## What this model DOES / DOES NOT do - **DOES**: Frame-level multi-speaker activity detection (K=12 sigmoid posteriors at 13 fps). Produces a per-frame, per-track speech / non- speech decision. Used downstream for speaker segmentation, turn-taking, overlap detection, and as a front-end for speaker attribution. - **DOES NOT**: Audio-to-text transcription. ASR is handled by the base model (`audarai/Audar-ASR-Turbo`). This repo only contains the diarization head — you still need the base audio_tower to extract the 2048-dim features the head consumes. ## Audit-grade DER on 8 public leaderboards All numbers below are **audit-grade**: `Σ_errors / Σ_total_speech` (audit-correct micro-aggregation), `collar=0.25s`, `threshold=0.9`, `fps=13`, `K=12`, evaluated with the official held-out splits. Sortformer column is `nvidia/diar_streaming_sortformer_4spk-v2` evaluated under the same protocol — not the numbers reported by NVIDIA, which use different aggregation and collar. | Corpus | Audar v3 DER | Sortformer DER | |-------------------|----------------:|-----------------:| | VoxConverse (dev) | 21.11% | 11.65% | | AliMeeting | 32.74% | 26.43% | | ICSI | 40.32% | 30.81% | | MSDWild few | 36.81% | 27.75% | | AMI | 46.56% | 37.34% | | MSDWild many | 45.64% | 41.98% | | DipCo | 47.58% | 38.58% | | CHiME-6 | **69.65%** ✅ | 71.80% | | **MACRO avg** | **42.55%** | 35.79% | Audar v3 beats Sortformer on CHiME-6 (the hardest, far-field, multi- party dinner-table corpus) by 2.15 absolute DER. On the other 7 corpora Sortformer is still ahead in macro-average — this is intentional: v3 is the first checkpoint in the v3 lineage that crosses the CHiME-6 crossover bar and is being released as a hardware-friendly, distillation-compatible baseline for the v4 program. > **Note**: An internal `internal_synthetic_val` validation set tracked > during training is **NOT** a leaderboard and is not reported here. Only > public-test-set DER counts. ## Architecture - **Encoder (frozen)**: `audarai/Audar-ASR-Turbo` audio_tower → 2048-dim features at 13 fps. - **Head (trainable, ~25M params)**: - 4 × Conformer-style blocks, `d_model=512`, `n_heads=8`, conv kernel size 15, dropout 0.2. - `K_max=12` sigmoid output channels (per-track speaker activity). - Soft-target Sortformer distillation auxiliary loss (`sortformer_weight=0.3`). - **Frame rate**: 13 Hz (≈77 ms hop). - **Input dtype**: bfloat16. ## Inference convention - `threshold = 0.9` (the optimal operating point per the v3 audit sweep) - `fps = 13` - `collar = 0.25 s` (standard DIHARD / VoxConverse evaluation collar) - `K_max = 12` - Sample rate: 16 kHz ## Training - **Data**: 2663 hours of synthetic multi-speaker mixtures (2-12 speakers per mixture) + Sortformer teacher distillation. - **Optimizer**: AdamW, `lr=3e-4`, 1000 warmup steps, gradient clip 1.0. - **Schedule**: 8000 steps planned; **step 2500 is the best by audit DER** — past 2500 the model overfits and macro DER regresses. - **Distillation teacher**: `nvidia/diar_streaming_sortformer_4spk-v2`, weight `0.3`. - **Distributed**: 8 × A100 / H100 nodes, DDP, batch size 8 per GPU. ## Files - `eend_v3_step2500.pt` — the v3 best checkpoint. PyTorch state dict containing `nar` (the EEND head), `ctc` (auxiliary CTC), and `speaker_attn` state dicts. ~125 MB. - `config.json` — head hyperparameters and audit-best operating point. - `README.md` — this file. ## Inference example ```python import torch from huggingface_hub import hf_hub_download # 1. Download the checkpoint ckpt_path = hf_hub_download( "audarai/Audar-ASR-Turbo_diarization", "eend_v3_step2500.pt", ) state = torch.load(ckpt_path, weights_only=False, map_location="cpu") # 2. Construct the head — you need the NARDiarHeadEEND class from # https://github.com/audarai/eend_diar from nar_diar_head_eend import NARDiarHeadEEND head = ( NARDiarHeadEEND(K_max=12, n_blocks=4, hidden_dim=512) .cuda() .bfloat16() .eval() ) head.load_state_dict(state["nar"]) # 3. Forward # The head consumes [B, T, 2048] features from the Audar audio_tower # at 13 fps and emits [B, T, 12] sigmoid posteriors. # with torch.no_grad(): # posteriors = torch.sigmoid(head(audar_features)) # [B, T, 12] # active = posteriors > 0.9 # binary speaker activity ``` ## Citation If you use this model please cite the eend_diar repo (audarai internal) and the Sortformer teacher: ```bibtex @misc{audar_eend_v3_2026, title = {Audar-ASR-Turbo Diarization (EEND v3)}, author = {AudarAI}, year = {2026}, url = {https://huggingface.co/audarai/Audar-ASR-Turbo_diarization} } @misc{nvidia_sortformer_2024, title = {Streaming Sortformer Diarization (4-spk v2)}, author = {NVIDIA}, year = {2024}, url = {https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2} } ``` ## License Apache 2.0. See `LICENSE` (Apache-2.0 default for audarai).