π§ SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper
This repository hosts SE-DiCoW, the state-of-the-art Target-Speaker ASR model developed by BUT Speech@FIT in collaboration with JHU CLSP/HLTCOE and CMU LTI.
π§ Key Innovations
- π Self-Enrollment (SE): Automatically selects the most informative segment of the target speaker from the conversation and integrates it via cross-attention. This solves the ambiguity problem in fully overlapped regions.
- β‘ Improved Conditioning: Introduces FDDT (Frame-Level Diarization Dependent Transformation) layers before positional embeddings for better signal modulation.
- π Reduced Error: achieved ~75% relative reduction in tcpWER on Libri3Mix compared to v1.
- π οΈ Training Stability: Uses less suppressive initialization and flexible data segmentation (no forced end-timestamps).
- π Robustness: Trained with STNO noise injection and SpecAugment to handle imperfect diarization.
β‘ Quick Usage
1. Run Interactive Demo (Gradio)
The easiest way to use this model is via the DiCoW inference repository. We provide a Gradio app that handles diarization, self-enrollment selection, and mask generation automatically:
git clone https://github.com/BUTSpeechFIT/DiCoW
cd DiCoW
python app.py
2. Load in Python
If you want to load the model manually (e.g., for custom scripts):
from transformers import AutoModelForSpeechSeq2Seq
# 1. Load the model (requires remote code for custom Self-Enrollment layers)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
"BUT-FIT/SE-DiCoW",
trust_remote_code=True
)
# Note: This model requires specific conditioning (STNO masks + Enrollment Audio).
# It cannot be run with standard Whisper pipelines.
# See inference code in the GitHub repo for details.
𧬠Reproducibility & Training
This model is fully open-source and can be easily reproduced using our toolkit.
1. Data Preparation Clone the mt-asr-data-prep repository and run the setup script:
./prepare.sh --single-mic-only --root-dir /path/to/workdir
2. Training
Clone the training repository TS-ASR-Whisper and launch the experiment using the se_dicow recipe:
# Run this from the root of the TS-ASR-Whisper repository
sbatch --export SRC_ROOT=$PWD scripts/submit_slurm.sh +train=se_dicow
π Performance Snapshot (tcpWER)
Metric: Time-Constrained Minimum Permutation WER (5s collar) - DiariZen Diarization
| Dataset | DiCoW v1 (Baseline) | SE-DiCoW (This Model) |
|---|---|---|
| Libri2Mix (Both) | 21.6% | 9.7% |
| LibriSpeechMix (2) | 17.9% | 3.1% |
| AMI (SDM) | 21.4% | 18.5% |
| NOTSOFAR-1 (Small-SC) | 29.8% | 26.2% |
π¦ Model Details
- Base Model: Whisper large-v3-turbo
- Training Datasets: NOTSOFAR-1, AMI, LibriMix (2/3 spk), Synthetic LibriSpeech.
- Mechanism: Diarization-Conditioned + Self-Enrollment Cross-Attention.
π Citations
If you use this model, please cite our ICASSP 2026 and CS&L 2026 papers:
@INPROCEEDINGS{polok2026sedicow,
author={Polok, Alexander and Klement, Dominik and Cornell, Samuele and Wiesner, Matthew and ΔernockΓ½, Jan and Khudanpur, Sanjeev and Burget, LukΓ‘Ε‘},
booktitle={ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper},
year={2026},
}
@article{POLOK2026101841,
title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition},
journal = {Computer Speech & Language},
volume = {95},
year = {2026},
doi = {10.1016/j.csl.2025.101841},
author = {Alexander Polok et al.}
}
π¬ Contact
- Issues: GitHub Issues
- Email: ipoloka@fit.vut.cz
- Downloads last month
- -