DiCoW v3.2-SF — Sortformer Fine-Tuned Diarization-Conditioned Whisper
This repository hosts DiCoW v3.2-SF, a variant of DiCoW v3.2 developed by BUT Speech@FIT. It is fine-tuned to accept Sortformer soft speaker-activity masks as diarization input, with the goal of adapting the FDDT conditioning layers to the noise patterns of the Sortformer diarization system.
What is this model?
Standard DiCoW models are trained with ground-truth (GT) binary diarization masks. At inference time, however, real-world diarization systems such as Sortformer produce continuous soft probabilities that can differ substantially from clean binary signals.
DiCoW v3.2-SF addresses this mismatch by fine-tuning the FDDT (Frame-Level Diarization-Dependent Transformation) layers of DiCoW v3.2 exclusively on Sortformer soft masks, while keeping the Whisper decoder frozen.
Quick Usage
Load in Python
from transformers import AutoModelForSpeechSeq2Seq
model = AutoModelForSpeechSeq2Seq.from_pretrained(
"BUT-FIT/DiCoW_v3_2_SF",
trust_remote_code=True
)
Performance (tcpWER, 5 s collar)
Comparison between this model and its baseline DiCoW v3.2 under both ground-truth (GT) and Sortformer (SF) diarization.
| Dataset | Baseline — GT diar. | Baseline — SF diar. | This model — GT diar. | This model — SF diar. |
|---|---|---|---|---|
| NOTSOFAR-1 eval-small | 16.4 | 57.6 | 20.2 | 58.7 |
| AMI-SDM | 15.3 | 81.3 | 17.6 | 77.3 |
| Libri2Mix clean | 4.7 | 8.9 | 7.9 | 5.2 |
| Libri2Mix noisy | 11.3 | 21.5 | 16.1 | 12.3 |
| Libri3Mix clean | 28.8 | 38.9 | 36.7 | 38.2 |
| Libri3Mix noisy | 39.6 | 53.9 | 47.1 | 47.3 |
Interpretation
Fine-tuning on Sortformer soft masks teaches the FDDT layers to be more tolerant of continuous probability values, yielding substantial gains on simulated LibriMix conditions: Libri2Mix-clean drops from 8.9 % to 5.2 % and Libri2Mix-noisy from 21.5 % to 12.3 % under SF diarization.
This adaptation carries a trade-off: GT diarization performance degrades across all datasets, because the FDDT layers have adjusted their learned transformations to expect noisy soft inputs rather than clean binary masks.
On real-world meeting data the SF-diarization gains are modest: AMI-SDM improves from 81.3 % to 77.3 %, while NOTSOFAR-1 is largely unchanged.
Recommendation: Use DiCoW v3.2-SF when your inference pipeline uses Sortformer-style diarization on relatively clean or simulated mixtures. For real-world meeting transcription with the best overall accuracy, prefer DiCoW v3.3 or SE-DiCoW.
Limitations
- GT mask degradation: Compared to DiCoW v3.2, oracle (GT) diarization performance is meaningfully worse. Do not use this model when clean binary masks are available.
- Meeting-domain gap: Sortformer mask adaptation on LibriMix data does not fully transfer to spontaneous meeting corpora such as NOTSOFAR-1.
Citations
If you use this model, please cite the original papers:
@article{POLOK2026101841,
title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition},
journal = {Computer Speech & Language},
volume = {95},
year = {2026},
doi = {10.1016/j.csl.2025.101841},
author = {Alexander Polok et al.}
}
@INPROCEEDINGS{10887683,
title={Target Speaker ASR with Whisper},
author={Polok, Alexander et al.},
booktitle={ICASSP 2025},
year={2025},
doi={10.1109/ICASSP49660.2025.10887683}
}
Contact
- Email: xbohatd00@stud.fit.vut.cz
- Downloads last month
- 67
Model tree for bohatey/DiCoW_v3_2_SF
Base model
BUT-FIT/DiCoW_v3_2