DiCoW v3.2-SF — Sortformer Fine-Tuned Diarization-Conditioned Whisper

This repository hosts DiCoW v3.2-SF, a variant of DiCoW v3.2 developed by BUT Speech@FIT. It is fine-tuned to accept Sortformer soft speaker-activity masks as diarization input, with the goal of adapting the FDDT conditioning layers to the noise patterns of the Sortformer diarization system.

What is this model?

Standard DiCoW models are trained with ground-truth (GT) binary diarization masks. At inference time, however, real-world diarization systems such as Sortformer produce continuous soft probabilities that can differ substantially from clean binary signals.

DiCoW v3.2-SF addresses this mismatch by fine-tuning the FDDT (Frame-Level Diarization-Dependent Transformation) layers of DiCoW v3.2 exclusively on Sortformer soft masks, while keeping the Whisper decoder frozen.

Quick Usage

Load in Python

from transformers import AutoModelForSpeechSeq2Seq

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    "BUT-FIT/DiCoW_v3_2_SF",
    trust_remote_code=True
)

Performance (tcpWER, 5 s collar)

Comparison between this model and its baseline DiCoW v3.2 under both ground-truth (GT) and Sortformer (SF) diarization.

Dataset	Baseline — GT diar.	Baseline — SF diar.	This model — GT diar.	This model — SF diar.
NOTSOFAR-1 eval-small	16.4	57.6	20.2	58.7
AMI-SDM	15.3	81.3	17.6	77.3
Libri2Mix clean	4.7	8.9	7.9	5.2
Libri2Mix noisy	11.3	21.5	16.1	12.3
Libri3Mix clean	28.8	38.9	36.7	38.2
Libri3Mix noisy	39.6	53.9	47.1	47.3

Interpretation

Fine-tuning on Sortformer soft masks teaches the FDDT layers to be more tolerant of continuous probability values, yielding substantial gains on simulated LibriMix conditions: Libri2Mix-clean drops from 8.9 % to 5.2 % and Libri2Mix-noisy from 21.5 % to 12.3 % under SF diarization.

This adaptation carries a trade-off: GT diarization performance degrades across all datasets, because the FDDT layers have adjusted their learned transformations to expect noisy soft inputs rather than clean binary masks.

On real-world meeting data the SF-diarization gains are modest: AMI-SDM improves from 81.3 % to 77.3 %, while NOTSOFAR-1 is largely unchanged.

Recommendation: Use DiCoW v3.2-SF when your inference pipeline uses Sortformer-style diarization on relatively clean or simulated mixtures. For real-world meeting transcription with the best overall accuracy, prefer DiCoW v3.3 or SE-DiCoW.

Limitations

GT mask degradation: Compared to DiCoW v3.2, oracle (GT) diarization performance is meaningfully worse. Do not use this model when clean binary masks are available.
Meeting-domain gap: Sortformer mask adaptation on LibriMix data does not fully transfer to spontaneous meeting corpora such as NOTSOFAR-1.

Citations

If you use this model, please cite the original papers:

@article{POLOK2026101841,
    title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition},
    journal = {Computer Speech & Language},
    volume = {95},
    year = {2026},
    doi = {10.1016/j.csl.2025.101841},
    author = {Alexander Polok et al.}
}

@INPROCEEDINGS{10887683,
    title={Target Speaker ASR with Whisper}, 
    author={Polok, Alexander et al.},
    booktitle={ICASSP 2025}, 
    year={2025},
    doi={10.1109/ICASSP49660.2025.10887683}
}

Contact

Email: xbohatd00@stud.fit.vut.cz

Downloads last month: 67

Safetensors

Model size

0.9B params

Tensor type

F32

Model tree for bohatey/DiCoW_v3_2_SF

Base model

BUT-FIT/DiCoW_v3_2

Finetuned

(1)

this model

Datasets used to train bohatey/DiCoW_v3_2_SF

Paper for bohatey/DiCoW_v3_2_SF

Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens

Paper • 2409.06656 • Published Sep 10, 2024 • 1