File size: 5,038 Bytes
96b9702
 
061b6d1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
96b9702
 
061b6d1
96b9702
061b6d1
96b9702
 
061b6d1
dd8badc
061b6d1
96b9702
061b6d1
96b9702
061b6d1
 
 
 
 
96b9702
061b6d1
96b9702
061b6d1
96b9702
061b6d1
 
96b9702
061b6d1
 
 
 
 
96b9702
061b6d1
96b9702
061b6d1
96b9702
061b6d1
 
96b9702
061b6d1
 
e27e39b
061b6d1
 
96b9702
061b6d1
 
 
 
96b9702
061b6d1
96b9702
061b6d1
96b9702
061b6d1
96b9702
061b6d1
 
96b9702
061b6d1
 
 
96b9702
061b6d1
 
96b9702
061b6d1
 
 
 
96b9702
061b6d1
96b9702
061b6d1
96b9702
061b6d1
96b9702
061b6d1
 
 
 
 
 
96b9702
061b6d1
96b9702
061b6d1
96b9702
061b6d1
96b9702
061b6d1
 
 
96b9702
061b6d1
96b9702
061b6d1
96b9702
061b6d1
96b9702
061b6d1
 
 
 
 
 
 
96b9702
061b6d1
 
 
 
 
 
 
 
96b9702
061b6d1
96b9702
061b6d1
96b9702
061b6d1
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
---
library_name: transformers
tags:
- speech
- automatic-speech-recognition
- whisper
- multilingual
- speaker-diarization
- meeting-transcription
- target-speaker-asr
- SE-DiCoW
- BUT-FIT
pipeline_tag: automatic-speech-recognition
license: apache-2.0
datasets:
- microsoft/NOTSOFAR
- edinburghcstr/ami
- LibriSpeechMix
- LibriMix
---

# 🧠 SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper

This repository hosts **SE-DiCoW**, the state-of-the-art Target-Speaker ASR model developed by [BUT Speech@FIT](https://github.com/BUTSpeechFIT) in collaboration with **JHU CLSP/HLTCOE** and **CMU LTI**. 


<div align="center">
  <img src="https://huggingface.co/BUT-FIT/SE-DiCoW/resolve/main/SE-DiCoW.png" alt="SE-DiCoW Architecture" width="800"/>
</div>

## 🔧 Key Innovations

* **🔍 Self-Enrollment (SE):** Automatically selects the most informative segment of the target speaker from the conversation and integrates it via **cross-attention**. This solves the ambiguity problem in fully overlapped regions.
* **⚡ Improved Conditioning:** Introduces **FDDT (Frame-Level Diarization Dependent Transformation)** layers *before* positional embeddings for better signal modulation.
* **📉 Reduced Error:** achieved **~75% relative reduction** in tcpWER on Libri3Mix compared to v1.
* **🛠️ Training Stability:** Uses less suppressive initialization and flexible data segmentation (no forced end-timestamps).
* **🔄 Robustness:** Trained with **STNO noise injection** and **SpecAugment** to handle imperfect diarization.

---

## ⚡ Quick Usage

### 1. Run Interactive Demo (Gradio)
The easiest way to use this model is via the [**DiCoW inference repository**](https://github.com/BUTSpeechFIT/DiCoW). We provide a Gradio app that handles diarization, self-enrollment selection, and mask generation automatically:

```bash
git clone https://github.com/BUTSpeechFIT/DiCoW
cd DiCoW
python app.py
```

### 2. Load in Python

If you want to load the model manually (e.g., for custom scripts):

```python
from transformers import AutoModelForSpeechSeq2Seq

# 1. Load the model (requires remote code for custom Self-Enrollment layers)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    "BUT-FIT/SE-DiCoW", 
    trust_remote_code=True
)

# Note: This model requires specific conditioning (STNO masks + Enrollment Audio).
# It cannot be run with standard Whisper pipelines.
# See inference code in the GitHub repo for details.
```

---

## 🧬 Reproducibility & Training

This model is fully open-source and can be easily reproduced using our toolkit.

**1. Data Preparation**
Clone the **[mt-asr-data-prep](https://github.com/BUTSpeechFIT/mt-asr-data-prep)** repository and run the setup script:

```bash
./prepare.sh --single-mic-only --root-dir /path/to/workdir
```

**2. Training**
Clone the training repository **[TS-ASR-Whisper](https://github.com/BUTSpeechFIT/TS-ASR-Whisper)** and launch the experiment using the `se_dicow` recipe:

```bash
# Run this from the root of the TS-ASR-Whisper repository
sbatch --export SRC_ROOT=$PWD scripts/submit_slurm.sh +train=se_dicow
```

---

## 🏆 Performance Snapshot (tcpWER)

*Metric: Time-Constrained Minimum Permutation WER (5s collar) - DiariZen Diarization*

| Dataset                   | DiCoW v1 (Baseline) | **SE-DiCoW (This Model)** |
|---------------------------|---------------------|---------------------------|
| **Libri2Mix (Both)**      | 21.6%               | **9.7%**                  |
| **LibriSpeechMix (2)**    | 17.9%               | **3.1%**                  |
| **AMI (SDM)**             | 21.4%               | **18.5%**                 |
| **NOTSOFAR-1 (Small-SC)** | 29.8%               | **26.2%**                 |

🔗 **[View Full Leaderboard](https://huggingface.co/spaces/BUT-FIT/EMMA_leaderboard)**

---

## 📦 Model Details

* **Base Model:** [Whisper large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo)
* **Training Datasets:** NOTSOFAR-1, AMI, LibriMix (2/3 spk), Synthetic LibriSpeech.
* **Mechanism:** Diarization-Conditioned + Self-Enrollment Cross-Attention.

---

## 📚 Citations

If you use this model, please cite our **ICASSP 2026** and **CS&L 2026** papers:

```bibtex
@INPROCEEDINGS{polok2026sedicow,
  author={Polok, Alexander and Klement, Dominik and Cornell, Samuele and Wiesner, Matthew and Černocký, Jan and Khudanpur, Sanjeev and Burget, Lukáš},
  booktitle={ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper}, 
  year={2026},
}

@article{POLOK2026101841,
    title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition},
    journal = {Computer Speech & Language},
    volume = {95},
    year = {2026},
    doi = {10.1016/j.csl.2025.101841},
    author = {Alexander Polok et al.}
}

```

## 📬 Contact

* **Issues:** [GitHub Issues](https://github.com/BUTSpeechFIT/DiCoW/issues)
* **Email:** [ipoloka@fit.vut.cz](mailto:ipoloka@fit.vut.cz)