File size: 5,038 Bytes
96b9702 061b6d1 96b9702 061b6d1 96b9702 061b6d1 96b9702 061b6d1 dd8badc 061b6d1 96b9702 061b6d1 96b9702 061b6d1 96b9702 061b6d1 96b9702 061b6d1 96b9702 061b6d1 96b9702 061b6d1 96b9702 061b6d1 96b9702 061b6d1 96b9702 061b6d1 96b9702 061b6d1 e27e39b 061b6d1 96b9702 061b6d1 96b9702 061b6d1 96b9702 061b6d1 96b9702 061b6d1 96b9702 061b6d1 96b9702 061b6d1 96b9702 061b6d1 96b9702 061b6d1 96b9702 061b6d1 96b9702 061b6d1 96b9702 061b6d1 96b9702 061b6d1 96b9702 061b6d1 96b9702 061b6d1 96b9702 061b6d1 96b9702 061b6d1 96b9702 061b6d1 96b9702 061b6d1 96b9702 061b6d1 96b9702 061b6d1 96b9702 061b6d1 96b9702 061b6d1 96b9702 061b6d1 96b9702 061b6d1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 | ---
library_name: transformers
tags:
- speech
- automatic-speech-recognition
- whisper
- multilingual
- speaker-diarization
- meeting-transcription
- target-speaker-asr
- SE-DiCoW
- BUT-FIT
pipeline_tag: automatic-speech-recognition
license: apache-2.0
datasets:
- microsoft/NOTSOFAR
- edinburghcstr/ami
- LibriSpeechMix
- LibriMix
---
# 🧠 SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper
This repository hosts **SE-DiCoW**, the state-of-the-art Target-Speaker ASR model developed by [BUT Speech@FIT](https://github.com/BUTSpeechFIT) in collaboration with **JHU CLSP/HLTCOE** and **CMU LTI**.
<div align="center">
<img src="https://huggingface.co/BUT-FIT/SE-DiCoW/resolve/main/SE-DiCoW.png" alt="SE-DiCoW Architecture" width="800"/>
</div>
## 🔧 Key Innovations
* **🔍 Self-Enrollment (SE):** Automatically selects the most informative segment of the target speaker from the conversation and integrates it via **cross-attention**. This solves the ambiguity problem in fully overlapped regions.
* **⚡ Improved Conditioning:** Introduces **FDDT (Frame-Level Diarization Dependent Transformation)** layers *before* positional embeddings for better signal modulation.
* **📉 Reduced Error:** achieved **~75% relative reduction** in tcpWER on Libri3Mix compared to v1.
* **🛠️ Training Stability:** Uses less suppressive initialization and flexible data segmentation (no forced end-timestamps).
* **🔄 Robustness:** Trained with **STNO noise injection** and **SpecAugment** to handle imperfect diarization.
---
## ⚡ Quick Usage
### 1. Run Interactive Demo (Gradio)
The easiest way to use this model is via the [**DiCoW inference repository**](https://github.com/BUTSpeechFIT/DiCoW). We provide a Gradio app that handles diarization, self-enrollment selection, and mask generation automatically:
```bash
git clone https://github.com/BUTSpeechFIT/DiCoW
cd DiCoW
python app.py
```
### 2. Load in Python
If you want to load the model manually (e.g., for custom scripts):
```python
from transformers import AutoModelForSpeechSeq2Seq
# 1. Load the model (requires remote code for custom Self-Enrollment layers)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
"BUT-FIT/SE-DiCoW",
trust_remote_code=True
)
# Note: This model requires specific conditioning (STNO masks + Enrollment Audio).
# It cannot be run with standard Whisper pipelines.
# See inference code in the GitHub repo for details.
```
---
## 🧬 Reproducibility & Training
This model is fully open-source and can be easily reproduced using our toolkit.
**1. Data Preparation**
Clone the **[mt-asr-data-prep](https://github.com/BUTSpeechFIT/mt-asr-data-prep)** repository and run the setup script:
```bash
./prepare.sh --single-mic-only --root-dir /path/to/workdir
```
**2. Training**
Clone the training repository **[TS-ASR-Whisper](https://github.com/BUTSpeechFIT/TS-ASR-Whisper)** and launch the experiment using the `se_dicow` recipe:
```bash
# Run this from the root of the TS-ASR-Whisper repository
sbatch --export SRC_ROOT=$PWD scripts/submit_slurm.sh +train=se_dicow
```
---
## 🏆 Performance Snapshot (tcpWER)
*Metric: Time-Constrained Minimum Permutation WER (5s collar) - DiariZen Diarization*
| Dataset | DiCoW v1 (Baseline) | **SE-DiCoW (This Model)** |
|---------------------------|---------------------|---------------------------|
| **Libri2Mix (Both)** | 21.6% | **9.7%** |
| **LibriSpeechMix (2)** | 17.9% | **3.1%** |
| **AMI (SDM)** | 21.4% | **18.5%** |
| **NOTSOFAR-1 (Small-SC)** | 29.8% | **26.2%** |
🔗 **[View Full Leaderboard](https://huggingface.co/spaces/BUT-FIT/EMMA_leaderboard)**
---
## 📦 Model Details
* **Base Model:** [Whisper large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo)
* **Training Datasets:** NOTSOFAR-1, AMI, LibriMix (2/3 spk), Synthetic LibriSpeech.
* **Mechanism:** Diarization-Conditioned + Self-Enrollment Cross-Attention.
---
## 📚 Citations
If you use this model, please cite our **ICASSP 2026** and **CS&L 2026** papers:
```bibtex
@INPROCEEDINGS{polok2026sedicow,
author={Polok, Alexander and Klement, Dominik and Cornell, Samuele and Wiesner, Matthew and Černocký, Jan and Khudanpur, Sanjeev and Burget, Lukáš},
booktitle={ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper},
year={2026},
}
@article{POLOK2026101841,
title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition},
journal = {Computer Speech & Language},
volume = {95},
year = {2026},
doi = {10.1016/j.csl.2025.101841},
author = {Alexander Polok et al.}
}
```
## 📬 Contact
* **Issues:** [GitHub Issues](https://github.com/BUTSpeechFIT/DiCoW/issues)
* **Email:** [ipoloka@fit.vut.cz](mailto:ipoloka@fit.vut.cz)
|