🧠 DiCoW v3.3 — Target-Speaker ASR

This repository hosts DiCoW v3.3, a Target-Speaker ASR (TS-ASR) model developed by BUT Speech@FIT. It is designed to transcribe the speech of a specific speaker within a multi-talker mixture by conditioning on speaker diarization outputs.

This model version incorporates the refinements and training strategies described in the paper SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper.

🔧 What's New in v3.3?

This version represents a significant stabilization and enhancement over the original DiCoW (v1):

⚡ Improved Conditioning: Introduces FDDT (Frame-Level Diarization Dependent Transformation) layers before positional embeddings for better signal modulation.
📉 Reduced Error: achieved ~50% relative reduction in tcpWER on Libri3Mix compared to v1.
🛠️ Training Stability: Uses less suppressive initialization and flexible data segmentation (no forced end-timestamps).
🔄 Robustness: Trained with STNO noise injection and SpecAugment to handle imperfect diarization.

⚡ Quick Usage

1. Run Interactive Demo (Gradio)

The easiest way to use this model is via the DiCoW inference repository. We provide a Gradio app that handles diarization and STNO mask generation automatically:

python app.py

2. Load in Python

If you want to download and load the model manually for your own scripts:

from transformers import AutoModelForSpeechSeq2Seq

# Load the model (requires remote code for custom FDDT layers)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    "BUT-FIT/DiCoW_v3_3", 
    trust_remote_code=True
)

# Note: The model expects specific STNO conditioning inputs. 
# See inference.py in the GitHub repo for the full pipeline.

🧬 Want to build your own DiCoW?

It's all yours with just two commands! This model is fully open-source and reproducible using our toolkit.

1. Data Preparation Clone the mt-asr-data-prep repository and run the setup script to generate the required manifests:

./prepare.sh --single-mic-only --root-dir /path/to/workdir

2. Training Clone the training repository TS-ASR-Whisper and launch the experiment using the pre-configured dicow_v3 recipe:

sbatch --export SRC_ROOT=$PWD scripts/submit_slurm.sh +train=dicow_v3

🏆 Performance Snapshot (tcpWER)

Metric: Time-Constrained Minimum Permutation WER (5s collar)

Dataset	DiCoW v1 (Baseline)	DiCoW v3.3 (This Model)
Libri2Mix (Both)	21.6%	9.7%
LibriSpeechMix (2)	17.9%	3.1%
AMI (SDM)	21.4%	18.7%
NOTSOFAR-1 (Small-SC)	29.8%	26.6%

Scores based on DiariZen Diarization. See paper for Real Diarization results. 🔗 View Full Leaderboard

⚙️ Model Details

Base Architecture: Whisper large-v3-turbo
Conditioning: Frame-Level Diarization-Dependent Transformations (FDDT)
Input: 30s Audio + 4-channel STNO Mask
Training Data: AMI, NOTSOFAR-1, LibriMix (2/3 spk), Synthetic LibriSpeech Mixtures.

⚠️ Limitations

Diarization Dependent: Performance is heavily dependent on the quality of the input diarization.
Ambiguity: In scenarios with >2 fully overlapping speakers, the model may struggle to distinguish the target (addressed in the SE-DiCoW model).

📚 Citations

If you use this model, please cite the following papers:

@article{polok2026sedicow,
  title={SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper},
  author={Alexander Polok and Dominik Klement and Samuele Cornell and Matthew Wiesner and Jan Černocký and Sanjeev Khudanpur and Lukáš Burget},
  journal={arXiv preprint arXiv:2601.19194},
  year={2026}
}

@article{POLOK2026101841,
    title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition},
    journal = {Computer Speech & Language},
    volume = {95},
    year = {2026},
    doi = {10.1016/j.csl.2025.101841},
    author = {Alexander Polok et al.}
}

@INPROCEEDINGS{10887683,
    title={Target Speaker ASR with Whisper}, 
    author={Polok, Alexander et al.},
    booktitle={ICASSP 2025}, 
    year={2025},
    doi={10.1109/ICASSP49660.2025.10887683}
}