Initial release: v3 best checkpoint (step 2500, macro DER 42.55%)

12571ec verified 1 day ago

6.06 kB

	---
	license: apache-2.0
	library_name: pytorch
	tags:
	- diarization
	- eend
	- speaker-diarization
	- audio
	- audarai
	- speech
	base_model: audarai/Audar-ASR-Turbo
	pipeline_tag: voice-activity-detection
	---

	# Audar-ASR-Turbo Diarization (EEND v3)

	End-to-end neural diarization (EEND) head trained jointly with Sortformer
	distillation on top of the frozen Audar3-ASR-1.7B audio_tower
	(`audarai/Audar-ASR-Turbo`). Produces frame-level multi-speaker activity
	posteriors at 13 fps with K=12 sigmoid output channels, suitable for
	diarizing long audio with up to 12 simultaneous speaker tracks. Trained
	with 2663h of synthetic multi-speaker mixtures + soft-label distillation
	from `nvidia/diar_streaming_sortformer_4spk-v2`. This is the v3 best
	checkpoint (step 2500) — beyond this step the model overfits and on-the-
	leaderboard DER regresses.

	## What this model DOES / DOES NOT do

	- DOES: Frame-level multi-speaker activity detection (K=12 sigmoid
	posteriors at 13 fps). Produces a per-frame, per-track speech / non-
	speech decision. Used downstream for speaker segmentation, turn-taking,
	overlap detection, and as a front-end for speaker attribution.
	- DOES NOT: Audio-to-text transcription. ASR is handled by the base
	model (`audarai/Audar-ASR-Turbo`). This repo only contains the
	diarization head — you still need the base audio_tower to extract the
	2048-dim features the head consumes.

	## Audit-grade DER on 8 public leaderboards

	All numbers below are audit-grade: `Σ_errors / Σ_total_speech`
	(audit-correct micro-aggregation), `collar=0.25s`, `threshold=0.9`,
	`fps=13`, `K=12`, evaluated with the official held-out splits.
	Sortformer column is `nvidia/diar_streaming_sortformer_4spk-v2`
	evaluated under the same protocol — not the numbers reported by NVIDIA,
	which use different aggregation and collar.

	\| Corpus \| Audar v3 DER \| Sortformer DER \|
	\|-------------------\|----------------:\|-----------------:\|
	\| VoxConverse (dev) \| 21.11% \| 11.65% \|
	\| AliMeeting \| 32.74% \| 26.43% \|
	\| ICSI \| 40.32% \| 30.81% \|
	\| MSDWild few \| 36.81% \| 27.75% \|
	\| AMI \| 46.56% \| 37.34% \|
	\| MSDWild many \| 45.64% \| 41.98% \|
	\| DipCo \| 47.58% \| 38.58% \|
	\| CHiME-6 \| 69.65% ✅ \| 71.80% \|
	\| MACRO avg \| 42.55% \| 35.79% \|

	Audar v3 beats Sortformer on CHiME-6 (the hardest, far-field, multi-
	party dinner-table corpus) by 2.15 absolute DER. On the other 7 corpora
	Sortformer is still ahead in macro-average — this is intentional: v3 is
	the first checkpoint in the v3 lineage that crosses the CHiME-6
	crossover bar and is being released as a hardware-friendly,
	distillation-compatible baseline for the v4 program.

	> Note: An internal `internal_synthetic_val` validation set tracked
	> during training is NOT a leaderboard and is not reported here. Only
	> public-test-set DER counts.

	## Architecture

	- Encoder (frozen): `audarai/Audar-ASR-Turbo` audio_tower → 2048-dim
	features at 13 fps.
	- Head (trainable, ~25M params):
	- 4 × Conformer-style blocks, `d_model=512`, `n_heads=8`,
	conv kernel size 15, dropout 0.2.
	- `K_max=12` sigmoid output channels (per-track speaker activity).
	- Soft-target Sortformer distillation auxiliary loss
	(`sortformer_weight=0.3`).
	- Frame rate: 13 Hz (≈77 ms hop).
	- Input dtype: bfloat16.

	## Inference convention

	- `threshold = 0.9` (the optimal operating point per the v3 audit sweep)
	- `fps = 13`
	- `collar = 0.25 s` (standard DIHARD / VoxConverse evaluation collar)
	- `K_max = 12`
	- Sample rate: 16 kHz

	## Training

	- Data: 2663 hours of synthetic multi-speaker mixtures (2-12 speakers
	per mixture) + Sortformer teacher distillation.
	- Optimizer: AdamW, `lr=3e-4`, 1000 warmup steps, gradient clip 1.0.
	- Schedule: 8000 steps planned; **step 2500 is the best by audit
	DER** — past 2500 the model overfits and macro DER regresses.
	- Distillation teacher: `nvidia/diar_streaming_sortformer_4spk-v2`,
	weight `0.3`.
	- Distributed: 8 × A100 / H100 nodes, DDP, batch size 8 per GPU.

	## Files

	- `eend_v3_step2500.pt` — the v3 best checkpoint. PyTorch state dict
	containing `nar` (the EEND head), `ctc` (auxiliary CTC), and
	`speaker_attn` state dicts. ~125 MB.
	- `config.json` — head hyperparameters and audit-best operating point.
	- `README.md` — this file.

	## Inference example

	```python
	import torch
	from huggingface_hub import hf_hub_download

	# 1. Download the checkpoint
	ckpt_path = hf_hub_download(
	"audarai/Audar-ASR-Turbo_diarization",
	"eend_v3_step2500.pt",
	)
	state = torch.load(ckpt_path, weights_only=False, map_location="cpu")

	# 2. Construct the head — you need the NARDiarHeadEEND class from
	# https://github.com/audarai/eend_diar
	from nar_diar_head_eend import NARDiarHeadEEND
	head = (
	NARDiarHeadEEND(K_max=12, n_blocks=4, hidden_dim=512)
	.cuda()
	.bfloat16()
	.eval()
	)
	head.load_state_dict(state["nar"])

	# 3. Forward
	# The head consumes [B, T, 2048] features from the Audar audio_tower
	# at 13 fps and emits [B, T, 12] sigmoid posteriors.
	# with torch.no_grad():
	# posteriors = torch.sigmoid(head(audar_features)) # [B, T, 12]
	# active = posteriors > 0.9 # binary speaker activity
	```

	## Citation

	If you use this model please cite the eend_diar repo (audarai
	internal) and the Sortformer teacher:

	```bibtex
	@misc{audar_eend_v3_2026,
	title = {Audar-ASR-Turbo Diarization (EEND v3)},
	author = {AudarAI},
	year = {2026},
	url = {https://huggingface.co/audarai/Audar-ASR-Turbo_diarization}
	}

	@misc{nvidia_sortformer_2024,
	title = {Streaming Sortformer Diarization (4-spk v2)},
	author = {NVIDIA},
	year = {2024},
	url = {https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2}
	}
	```

	## License

	Apache 2.0. See `LICENSE` (Apache-2.0 default for audarai).