File size: 5,038 Bytes

---
library_name: transformers
tags:
- speech
- automatic-speech-recognition
- whisper
- multilingual
- speaker-diarization
- meeting-transcription
- target-speaker-asr
- SE-DiCoW
- BUT-FIT
pipeline_tag: automatic-speech-recognition
license: apache-2.0
datasets:
- microsoft/NOTSOFAR
- edinburghcstr/ami
- LibriSpeechMix
- LibriMix
---

# 🧠 SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper

This repository hosts **SE-DiCoW**, the state-of-the-art Target-Speaker ASR model developed by [BUT Speech@FIT](https://github.com/BUTSpeechFIT) in collaboration with **JHU CLSP/HLTCOE** and **CMU LTI**. 


<div align="center">
  <img src="https://huggingface.co/BUT-FIT/SE-DiCoW/resolve/main/SE-DiCoW.png" alt="SE-DiCoW Architecture" width="800"/>
</div>

## 🔧 Key Innovations

* **🔍 Self-Enrollment (SE):** Automatically selects the most informative segment of the target speaker from the conversation and integrates it via **cross-attention**. This solves the ambiguity problem in fully overlapped regions.
* **⚡ Improved Conditioning:** Introduces **FDDT (Frame-Level Diarization Dependent Transformation)** layers *before* positional embeddings for better signal modulation.
* **📉 Reduced Error:** achieved **~75% relative reduction** in tcpWER on Libri3Mix compared to v1.
* **🛠️ Training Stability:** Uses less suppressive initialization and flexible data segmentation (no forced end-timestamps).
* **🔄 Robustness:** Trained with **STNO noise injection** and **SpecAugment** to handle imperfect diarization.

---

## ⚡ Quick Usage

### 1. Run Interactive Demo (Gradio)
The easiest way to use this model is via the [**DiCoW inference repository**](https://github.com/BUTSpeechFIT/DiCoW). We provide a Gradio app that handles diarization, self-enrollment selection, and mask generation automatically:

```bash
git clone https://github.com/BUTSpeechFIT/DiCoW
cd DiCoW
python app.py
```

### 2. Load in Python

If you want to load the model manually (e.g., for custom scripts):

```python
from transformers import AutoModelForSpeechSeq2Seq

# 1. Load the model (requires remote code for custom Self-Enrollment layers)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    "BUT-FIT/SE-DiCoW", 
    trust_remote_code=True
)

# Note: This model requires specific conditioning (STNO masks + Enrollment Audio).
# It cannot be run with standard Whisper pipelines.
# See inference code in the GitHub repo for details.
```

---

## 🧬 Reproducibility & Training

This model is fully open-source and can be easily reproduced using our toolkit.

**1. Data Preparation**
Clone the **[mt-asr-data-prep](https://github.com/BUTSpeechFIT/mt-asr-data-prep)** repository and run the setup script:

```bash
./prepare.sh --single-mic-only --root-dir /path/to/workdir
```

**2. Training**
Clone the training repository **[TS-ASR-Whisper](https://github.com/BUTSpeechFIT/TS-ASR-Whisper)** and launch the experiment using the `se_dicow` recipe:

```bash
# Run this from the root of the TS-ASR-Whisper repository
sbatch --export SRC_ROOT=$PWD scripts/submit_slurm.sh +train=se_dicow
```

---

## 🏆 Performance Snapshot (tcpWER)

*Metric: Time-Constrained Minimum Permutation WER (5s collar) - DiariZen Diarization*

| Dataset                   | DiCoW v1 (Baseline) | **SE-DiCoW (This Model)** |
|---------------------------|---------------------|---------------------------|
| **Libri2Mix (Both)**      | 21.6%               | **9.7%**                  |
| **LibriSpeechMix (2)**    | 17.9%               | **3.1%**                  |
| **AMI (SDM)**             | 21.4%               | **18.5%**                 |
| **NOTSOFAR-1 (Small-SC)** | 29.8%               | **26.2%**                 |

🔗 **[View Full Leaderboard](https://huggingface.co/spaces/BUT-FIT/EMMA_leaderboard)**

---

## 📦 Model Details

* **Base Model:** [Whisper large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo)
* **Training Datasets:** NOTSOFAR-1, AMI, LibriMix (2/3 spk), Synthetic LibriSpeech.
* **Mechanism:** Diarization-Conditioned + Self-Enrollment Cross-Attention.

---

## 📚 Citations

If you use this model, please cite our **ICASSP 2026** and **CS&L 2026** papers:

```bibtex
@INPROCEEDINGS{polok2026sedicow,
  author={Polok, Alexander and Klement, Dominik and Cornell, Samuele and Wiesner, Matthew and Černocký, Jan and Khudanpur, Sanjeev and Burget, Lukáš},
  booktitle={ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper}, 
  year={2026},
}

@article{POLOK2026101841,
    title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition},
    journal = {Computer Speech & Language},
    volume = {95},
    year = {2026},
    doi = {10.1016/j.csl.2025.101841},
    author = {Alexander Polok et al.}
}

```

## 📬 Contact

* **Issues:** [GitHub Issues](https://github.com/BUTSpeechFIT/DiCoW/issues)
* **Email:** [ipoloka@fit.vut.cz](mailto:ipoloka@fit.vut.cz)