File size: 4,964 Bytes

---
library_name: transformers
tags:
- speech
- automatic-speech-recognition
- whisper
- multilingual
- speaker-diarization
- meeting-transcription
- target-speaker-asr
- SE-DiCoW
- BUT-FIT
pipeline_tag: automatic-speech-recognition
license: apache-2.0
datasets:
- microsoft/NOTSOFAR
- edinburghcstr/ami
- LibriSpeechMix
- LibriMix
---

# 🧠 SE-DiCoW — Self-Enrolled Diarization-Conditioned Whisper

This repository hosts the **SE-DiCoW** model developed by [BUT Speech@FIT](https://github.com/BUTSpeechFIT), in collaboration with **JHU CLSP/HLTCOE** and **CMU LTI**, tailored for **target-speaker multi-talker automatic speech recognition (TS-ASR)**.

## 🔧 Key Innovations

* **Self-Enrollment (SE):**  
  Automatically selects the most informative segment of the target speaker within a conversation and integrates it via **cross-attention** at each encoder layer.
* **Improved Initialization & Segmentation:**  
  Refined FDDT initialization and corrected data segmentation for more stable training.
* **Augmentations:**  
  - Gaussian noise injection to STNO masks  
  - Segment-wise flipping of dominant STNO classes  
  - Joint **SpecAugment** on input + STNO  
  - **MUSAN** noise mixing  

➡️ Together, these yield **49.7% tcpWER reduction** over the original DiCoW on the **EMMA MT-ASR benchmark**, with over **70% gains** on heavily overlapped Libri3Mix.

![SE-DiCoW Architecture](./SE-DiCoW_figure.png)
---

## 🛠️ Model Usage

```python
from transformers import AutoModelForSpeechSeq2Seq

MODEL_NAME = "BUT-FIT/SE_DiCoW"
model = AutoModelForSpeechSeq2Seq.from_pretrained(MODEL_NAME, trust_remote_code=True)
````

➡️ Training and inference pipelines:

* [**Training Code (TS-ASR-Whisper)**](https://github.com/BUTSpeechFIT/TS-ASR-Whisper)
* [**Inference Code**](https://github.com/BUTSpeechFIT/DiCoW)

---

## 🏆 Performance

**Benchmark:** EMMA MT-ASR (multi-domain, multi-talker)

* SE-DiCoW outperforms DiCoW and DiCoW v3.2 under both **oracle** and **real diarization**, particularly in highly overlapped conditions (Libri3Mix).
* Achieves **state-of-the-art** or comparable performance to domain-tuned systems on AMI, NOTSOFAR-1, and synthetic LibriMix mixtures.

🔗 [**EMMA-MT ASR Leaderboard**](https://huggingface.co/spaces/BUT-FIT/EMMA_leaderboard)

---

## 📦 Model Details

* **Base Model:** [Whisper large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo)
* **Training Datasets:**

  * [NOTSOFAR-1](https://github.com/microsoft/NOTSOFAR1-Challenge)
  * [AMI Meeting Corpus](http://groups.inf.ed.ac.uk/ami/corpus/)
  * [Libri2Mix / Libri3Mix](https://github.com/JorisCos/LibriMix)
  * [LibriSpeech](https://www.openslr.org/12) synthetic mixtures

---

## 🧬 Source Repositories

* 🔧 [Training Code: TS-ASR-Whisper](https://github.com/BUTSpeechFIT/TS-ASR-Whisper)
* 🚀 [Inference (DiCoW)](https://github.com/BUTSpeechFIT/DiCoW)

---

## 📚 Related Publications

* 📰 **ICASSP 2026:**
  *SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper*
  [IEEE ICASSP 2026]

* 📰 **Journal Paper (CSL 2026):**
  *DiCoW: Diarization-Conditioned Whisper for Target Speaker ASR*
  [Computer Speech & Language, 2026](https://www.sciencedirect.com/science/article/pii/S088523082500066X)

* 📰 **ICASSP 2025:**
  *Target Speaker ASR with Whisper*
  [IEEE ICASSP 2025](https://doi.org/10.1109/ICASSP49660.2025.10887683)

---

## 📝 Citation

If you use this model, please cite the following works:

```bibtex
@INPROCEEDINGS{polok2026sedicow,
  author={Polok, Alexander and Klement, Dominik and Cornell, Samuele and Wiesner, Matthew and Černocký, Jan and Khudanpur, Sanjeev and Burget, Lukáš},
  booktitle={ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper}, 
  year={2026},
  pages={1-5},
}

@article{POLOK2026101841,
    title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition},
    journal = {Computer Speech & Language},
    volume = {95},
    pages = {101841},
    year = {2026},
    doi = {https://doi.org/10.1016/j.csl.2025.101841},
    author = {Alexander Polok and Dominik Klement and Martin Kocour and Jiangyu Han and Federico Landini and Bolaji Yusuf and Matthew Wiesner and Sanjeev Khudanpur and Jan Černocký and Lukáš Burget},
}

@INPROCEEDINGS{10887683,
  author={Polok, Alexander and Klement, Dominik and Wiesner, Matthew and Khudanpur, Sanjeev and Černocký, Jan and Burget, Lukáš},
  booktitle={ICASSP 2025}, 
  title={Target Speaker ASR with Whisper}, 
  year={2025},
  doi={10.1109/ICASSP49660.2025.10887683}
}
```

---

## 📬 Contact

For questions or collaboration inquiries:

📧 **Email:** [ipoloka@fit.vut.cz](mailto:ipoloka@fit.vut.cz)

🏢 **Affiliation:** [BUT Speech@FIT](https://github.com/BUTSpeechFIT), Brno University of Technology

🔗 **GitHub:** [BUTSpeechFIT](https://github.com/BUTSpeechFIT)