File size: 4,964 Bytes
4355b59
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0aabf8a
4355b59
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
---
library_name: transformers
tags:
- speech
- automatic-speech-recognition
- whisper
- multilingual
- speaker-diarization
- meeting-transcription
- target-speaker-asr
- SE-DiCoW
- BUT-FIT
pipeline_tag: automatic-speech-recognition
license: apache-2.0
datasets:
- microsoft/NOTSOFAR
- edinburghcstr/ami
- LibriSpeechMix
- LibriMix
---

# 🧠 SE-DiCoW — Self-Enrolled Diarization-Conditioned Whisper

This repository hosts the **SE-DiCoW** model developed by [BUT Speech@FIT](https://github.com/BUTSpeechFIT), in collaboration with **JHU CLSP/HLTCOE** and **CMU LTI**, tailored for **target-speaker multi-talker automatic speech recognition (TS-ASR)**.

## 🔧 Key Innovations

* **Self-Enrollment (SE):**  
  Automatically selects the most informative segment of the target speaker within a conversation and integrates it via **cross-attention** at each encoder layer.
* **Improved Initialization & Segmentation:**  
  Refined FDDT initialization and corrected data segmentation for more stable training.
* **Augmentations:**  
  - Gaussian noise injection to STNO masks  
  - Segment-wise flipping of dominant STNO classes  
  - Joint **SpecAugment** on input + STNO  
  - **MUSAN** noise mixing  

➡️ Together, these yield **49.7% tcpWER reduction** over the original DiCoW on the **EMMA MT-ASR benchmark**, with over **70% gains** on heavily overlapped Libri3Mix.

![SE-DiCoW Architecture](./SE-DiCoW_figure.png)
---

## 🛠️ Model Usage

```python
from transformers import AutoModelForSpeechSeq2Seq

MODEL_NAME = "BUT-FIT/SE_DiCoW"
model = AutoModelForSpeechSeq2Seq.from_pretrained(MODEL_NAME, trust_remote_code=True)
````

➡️ Training and inference pipelines:

* [**Training Code (TS-ASR-Whisper)**](https://github.com/BUTSpeechFIT/TS-ASR-Whisper)
* [**Inference Code**](https://github.com/BUTSpeechFIT/DiCoW)

---

## 🏆 Performance

**Benchmark:** EMMA MT-ASR (multi-domain, multi-talker)

* SE-DiCoW outperforms DiCoW and DiCoW v3.2 under both **oracle** and **real diarization**, particularly in highly overlapped conditions (Libri3Mix).
* Achieves **state-of-the-art** or comparable performance to domain-tuned systems on AMI, NOTSOFAR-1, and synthetic LibriMix mixtures.

🔗 [**EMMA-MT ASR Leaderboard**](https://huggingface.co/spaces/BUT-FIT/EMMA_leaderboard)

---

## 📦 Model Details

* **Base Model:** [Whisper large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo)
* **Training Datasets:**

  * [NOTSOFAR-1](https://github.com/microsoft/NOTSOFAR1-Challenge)
  * [AMI Meeting Corpus](http://groups.inf.ed.ac.uk/ami/corpus/)
  * [Libri2Mix / Libri3Mix](https://github.com/JorisCos/LibriMix)
  * [LibriSpeech](https://www.openslr.org/12) synthetic mixtures

---

## 🧬 Source Repositories

* 🔧 [Training Code: TS-ASR-Whisper](https://github.com/BUTSpeechFIT/TS-ASR-Whisper)
* 🚀 [Inference (DiCoW)](https://github.com/BUTSpeechFIT/DiCoW)

---

## 📚 Related Publications

* 📰 **ICASSP 2026:**
  *SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper*
  [IEEE ICASSP 2026]

* 📰 **Journal Paper (CSL 2026):**
  *DiCoW: Diarization-Conditioned Whisper for Target Speaker ASR*
  [Computer Speech & Language, 2026](https://www.sciencedirect.com/science/article/pii/S088523082500066X)

* 📰 **ICASSP 2025:**
  *Target Speaker ASR with Whisper*
  [IEEE ICASSP 2025](https://doi.org/10.1109/ICASSP49660.2025.10887683)

---

## 📝 Citation

If you use this model, please cite the following works:

```bibtex
@INPROCEEDINGS{polok2026sedicow,
  author={Polok, Alexander and Klement, Dominik and Cornell, Samuele and Wiesner, Matthew and Černocký, Jan and Khudanpur, Sanjeev and Burget, Lukáš},
  booktitle={ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper}, 
  year={2026},
  pages={1-5},
}

@article{POLOK2026101841,
    title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition},
    journal = {Computer Speech & Language},
    volume = {95},
    pages = {101841},
    year = {2026},
    doi = {https://doi.org/10.1016/j.csl.2025.101841},
    author = {Alexander Polok and Dominik Klement and Martin Kocour and Jiangyu Han and Federico Landini and Bolaji Yusuf and Matthew Wiesner and Sanjeev Khudanpur and Jan Černocký and Lukáš Burget},
}

@INPROCEEDINGS{10887683,
  author={Polok, Alexander and Klement, Dominik and Wiesner, Matthew and Khudanpur, Sanjeev and Černocký, Jan and Burget, Lukáš},
  booktitle={ICASSP 2025}, 
  title={Target Speaker ASR with Whisper}, 
  year={2025},
  doi={10.1109/ICASSP49660.2025.10887683}
}
```

---

## 📬 Contact

For questions or collaboration inquiries:

📧 **Email:** [ipoloka@fit.vut.cz](mailto:ipoloka@fit.vut.cz)

🏢 **Affiliation:** [BUT Speech@FIT](https://github.com/BUTSpeechFIT), Brno University of Technology

🔗 **GitHub:** [BUTSpeechFIT](https://github.com/BUTSpeechFIT)