🧠 SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper

This repository hosts SE-DiCoW, the state-of-the-art Target-Speaker ASR model developed by BUT Speech@FIT in collaboration with JHU CLSP/HLTCOE and CMU LTI.

SE-DiCoW Architecture

πŸ”§ Key Innovations

  • πŸ” Self-Enrollment (SE): Automatically selects the most informative segment of the target speaker from the conversation and integrates it via cross-attention. This solves the ambiguity problem in fully overlapped regions.
  • ⚑ Improved Conditioning: Introduces FDDT (Frame-Level Diarization Dependent Transformation) layers before positional embeddings for better signal modulation.
  • πŸ“‰ Reduced Error: achieved ~75% relative reduction in tcpWER on Libri3Mix compared to v1.
  • πŸ› οΈ Training Stability: Uses less suppressive initialization and flexible data segmentation (no forced end-timestamps).
  • πŸ”„ Robustness: Trained with STNO noise injection and SpecAugment to handle imperfect diarization.

⚑ Quick Usage

1. Run Interactive Demo (Gradio)

The easiest way to use this model is via the DiCoW inference repository. We provide a Gradio app that handles diarization, self-enrollment selection, and mask generation automatically:

git clone https://github.com/BUTSpeechFIT/DiCoW
cd DiCoW
python app.py

2. Load in Python

If you want to load the model manually (e.g., for custom scripts):

from transformers import AutoModelForSpeechSeq2Seq

# 1. Load the model (requires remote code for custom Self-Enrollment layers)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    "BUT-FIT/SE-DiCoW", 
    trust_remote_code=True
)

# Note: This model requires specific conditioning (STNO masks + Enrollment Audio).
# It cannot be run with standard Whisper pipelines.
# See inference code in the GitHub repo for details.

🧬 Reproducibility & Training

This model is fully open-source and can be easily reproduced using our toolkit.

1. Data Preparation Clone the mt-asr-data-prep repository and run the setup script:

./prepare.sh --single-mic-only --root-dir /path/to/workdir

2. Training Clone the training repository TS-ASR-Whisper and launch the experiment using the se_dicow recipe:

# Run this from the root of the TS-ASR-Whisper repository
sbatch --export SRC_ROOT=$PWD scripts/submit_slurm.sh +train=se_dicow

πŸ† Performance Snapshot (tcpWER)

Metric: Time-Constrained Minimum Permutation WER (5s collar) - DiariZen Diarization

Dataset DiCoW v1 (Baseline) SE-DiCoW (This Model)
Libri2Mix (Both) 21.6% 9.7%
LibriSpeechMix (2) 17.9% 3.1%
AMI (SDM) 21.4% 18.5%
NOTSOFAR-1 (Small-SC) 29.8% 26.2%

πŸ”— View Full Leaderboard


πŸ“¦ Model Details

  • Base Model: Whisper large-v3-turbo
  • Training Datasets: NOTSOFAR-1, AMI, LibriMix (2/3 spk), Synthetic LibriSpeech.
  • Mechanism: Diarization-Conditioned + Self-Enrollment Cross-Attention.

πŸ“š Citations

If you use this model, please cite our ICASSP 2026 and CS&L 2026 papers:

@INPROCEEDINGS{polok2026sedicow,
  author={Polok, Alexander and Klement, Dominik and Cornell, Samuele and Wiesner, Matthew and Černocký, Jan and Khudanpur, Sanjeev and Burget, LukÑő},
  booktitle={ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper}, 
  year={2026},
}

@article{POLOK2026101841,
    title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition},
    journal = {Computer Speech & Language},
    volume = {95},
    year = {2026},
    doi = {10.1016/j.csl.2025.101841},
    author = {Alexander Polok et al.}
}

πŸ“¬ Contact

Downloads last month
-
Safetensors
Model size
1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Datasets used to train BUT-FIT/SE-DiCoW

Collection including BUT-FIT/SE-DiCoW