--- library_name: transformers tags: - speech - automatic-speech-recognition - whisper - multilingual - speaker-diarization - meeting-transcription - target-speaker-asr - SE-DiCoW - BUT-FIT pipeline_tag: automatic-speech-recognition license: apache-2.0 datasets: - microsoft/NOTSOFAR - edinburghcstr/ami - LibriSpeechMix - LibriMix --- # 🧠 SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper This repository hosts **SE-DiCoW**, the state-of-the-art Target-Speaker ASR model developed by [BUT Speech@FIT](https://github.com/BUTSpeechFIT) in collaboration with **JHU CLSP/HLTCOE** and **CMU LTI**.
SE-DiCoW Architecture
## πŸ”§ Key Innovations * **πŸ” Self-Enrollment (SE):** Automatically selects the most informative segment of the target speaker from the conversation and integrates it via **cross-attention**. This solves the ambiguity problem in fully overlapped regions. * **⚑ Improved Conditioning:** Introduces **FDDT (Frame-Level Diarization Dependent Transformation)** layers *before* positional embeddings for better signal modulation. * **πŸ“‰ Reduced Error:** achieved **~75% relative reduction** in tcpWER on Libri3Mix compared to v1. * **πŸ› οΈ Training Stability:** Uses less suppressive initialization and flexible data segmentation (no forced end-timestamps). * **πŸ”„ Robustness:** Trained with **STNO noise injection** and **SpecAugment** to handle imperfect diarization. --- ## ⚑ Quick Usage ### 1. Run Interactive Demo (Gradio) The easiest way to use this model is via the [**DiCoW inference repository**](https://github.com/BUTSpeechFIT/DiCoW). We provide a Gradio app that handles diarization, self-enrollment selection, and mask generation automatically: ```bash git clone https://github.com/BUTSpeechFIT/DiCoW cd DiCoW python app.py ``` ### 2. Load in Python If you want to load the model manually (e.g., for custom scripts): ```python from transformers import AutoModelForSpeechSeq2Seq # 1. Load the model (requires remote code for custom Self-Enrollment layers) model = AutoModelForSpeechSeq2Seq.from_pretrained( "BUT-FIT/SE-DiCoW", trust_remote_code=True ) # Note: This model requires specific conditioning (STNO masks + Enrollment Audio). # It cannot be run with standard Whisper pipelines. # See inference code in the GitHub repo for details. ``` --- ## 🧬 Reproducibility & Training This model is fully open-source and can be easily reproduced using our toolkit. **1. Data Preparation** Clone the **[mt-asr-data-prep](https://github.com/BUTSpeechFIT/mt-asr-data-prep)** repository and run the setup script: ```bash ./prepare.sh --single-mic-only --root-dir /path/to/workdir ``` **2. Training** Clone the training repository **[TS-ASR-Whisper](https://github.com/BUTSpeechFIT/TS-ASR-Whisper)** and launch the experiment using the `se_dicow` recipe: ```bash # Run this from the root of the TS-ASR-Whisper repository sbatch --export SRC_ROOT=$PWD scripts/submit_slurm.sh +train=se_dicow ``` --- ## πŸ† Performance Snapshot (tcpWER) *Metric: Time-Constrained Minimum Permutation WER (5s collar) - DiariZen Diarization* | Dataset | DiCoW v1 (Baseline) | **SE-DiCoW (This Model)** | |---------------------------|---------------------|---------------------------| | **Libri2Mix (Both)** | 21.6% | **9.7%** | | **LibriSpeechMix (2)** | 17.9% | **3.1%** | | **AMI (SDM)** | 21.4% | **18.5%** | | **NOTSOFAR-1 (Small-SC)** | 29.8% | **26.2%** | πŸ”— **[View Full Leaderboard](https://huggingface.co/spaces/BUT-FIT/EMMA_leaderboard)** --- ## πŸ“¦ Model Details * **Base Model:** [Whisper large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo) * **Training Datasets:** NOTSOFAR-1, AMI, LibriMix (2/3 spk), Synthetic LibriSpeech. * **Mechanism:** Diarization-Conditioned + Self-Enrollment Cross-Attention. --- ## πŸ“š Citations If you use this model, please cite our **ICASSP 2026** and **CS&L 2026** papers: ```bibtex @INPROCEEDINGS{polok2026sedicow, author={Polok, Alexander and Klement, Dominik and Cornell, Samuele and Wiesner, Matthew and ČernockΓ½, Jan and Khudanpur, Sanjeev and Burget, LukΓ‘Ε‘}, booktitle={ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, title={SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper}, year={2026}, } @article{POLOK2026101841, title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition}, journal = {Computer Speech & Language}, volume = {95}, year = {2026}, doi = {10.1016/j.csl.2025.101841}, author = {Alexander Polok et al.} } ``` ## πŸ“¬ Contact * **Issues:** [GitHub Issues](https://github.com/BUTSpeechFIT/DiCoW/issues) * **Email:** [ipoloka@fit.vut.cz](mailto:ipoloka@fit.vut.cz)