File size: 4,964 Bytes
4355b59 0aabf8a 4355b59 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 |
---
library_name: transformers
tags:
- speech
- automatic-speech-recognition
- whisper
- multilingual
- speaker-diarization
- meeting-transcription
- target-speaker-asr
- SE-DiCoW
- BUT-FIT
pipeline_tag: automatic-speech-recognition
license: apache-2.0
datasets:
- microsoft/NOTSOFAR
- edinburghcstr/ami
- LibriSpeechMix
- LibriMix
---
# 🧠 SE-DiCoW — Self-Enrolled Diarization-Conditioned Whisper
This repository hosts the **SE-DiCoW** model developed by [BUT Speech@FIT](https://github.com/BUTSpeechFIT), in collaboration with **JHU CLSP/HLTCOE** and **CMU LTI**, tailored for **target-speaker multi-talker automatic speech recognition (TS-ASR)**.
## 🔧 Key Innovations
* **Self-Enrollment (SE):**
Automatically selects the most informative segment of the target speaker within a conversation and integrates it via **cross-attention** at each encoder layer.
* **Improved Initialization & Segmentation:**
Refined FDDT initialization and corrected data segmentation for more stable training.
* **Augmentations:**
- Gaussian noise injection to STNO masks
- Segment-wise flipping of dominant STNO classes
- Joint **SpecAugment** on input + STNO
- **MUSAN** noise mixing
➡️ Together, these yield **49.7% tcpWER reduction** over the original DiCoW on the **EMMA MT-ASR benchmark**, with over **70% gains** on heavily overlapped Libri3Mix.

---
## 🛠️ Model Usage
```python
from transformers import AutoModelForSpeechSeq2Seq
MODEL_NAME = "BUT-FIT/SE_DiCoW"
model = AutoModelForSpeechSeq2Seq.from_pretrained(MODEL_NAME, trust_remote_code=True)
````
➡️ Training and inference pipelines:
* [**Training Code (TS-ASR-Whisper)**](https://github.com/BUTSpeechFIT/TS-ASR-Whisper)
* [**Inference Code**](https://github.com/BUTSpeechFIT/DiCoW)
---
## 🏆 Performance
**Benchmark:** EMMA MT-ASR (multi-domain, multi-talker)
* SE-DiCoW outperforms DiCoW and DiCoW v3.2 under both **oracle** and **real diarization**, particularly in highly overlapped conditions (Libri3Mix).
* Achieves **state-of-the-art** or comparable performance to domain-tuned systems on AMI, NOTSOFAR-1, and synthetic LibriMix mixtures.
🔗 [**EMMA-MT ASR Leaderboard**](https://huggingface.co/spaces/BUT-FIT/EMMA_leaderboard)
---
## 📦 Model Details
* **Base Model:** [Whisper large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo)
* **Training Datasets:**
* [NOTSOFAR-1](https://github.com/microsoft/NOTSOFAR1-Challenge)
* [AMI Meeting Corpus](http://groups.inf.ed.ac.uk/ami/corpus/)
* [Libri2Mix / Libri3Mix](https://github.com/JorisCos/LibriMix)
* [LibriSpeech](https://www.openslr.org/12) synthetic mixtures
---
## 🧬 Source Repositories
* 🔧 [Training Code: TS-ASR-Whisper](https://github.com/BUTSpeechFIT/TS-ASR-Whisper)
* 🚀 [Inference (DiCoW)](https://github.com/BUTSpeechFIT/DiCoW)
---
## 📚 Related Publications
* 📰 **ICASSP 2026:**
*SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper*
[IEEE ICASSP 2026]
* 📰 **Journal Paper (CSL 2026):**
*DiCoW: Diarization-Conditioned Whisper for Target Speaker ASR*
[Computer Speech & Language, 2026](https://www.sciencedirect.com/science/article/pii/S088523082500066X)
* 📰 **ICASSP 2025:**
*Target Speaker ASR with Whisper*
[IEEE ICASSP 2025](https://doi.org/10.1109/ICASSP49660.2025.10887683)
---
## 📝 Citation
If you use this model, please cite the following works:
```bibtex
@INPROCEEDINGS{polok2026sedicow,
author={Polok, Alexander and Klement, Dominik and Cornell, Samuele and Wiesner, Matthew and Černocký, Jan and Khudanpur, Sanjeev and Burget, Lukáš},
booktitle={ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper},
year={2026},
pages={1-5},
}
@article{POLOK2026101841,
title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition},
journal = {Computer Speech & Language},
volume = {95},
pages = {101841},
year = {2026},
doi = {https://doi.org/10.1016/j.csl.2025.101841},
author = {Alexander Polok and Dominik Klement and Martin Kocour and Jiangyu Han and Federico Landini and Bolaji Yusuf and Matthew Wiesner and Sanjeev Khudanpur and Jan Černocký and Lukáš Burget},
}
@INPROCEEDINGS{10887683,
author={Polok, Alexander and Klement, Dominik and Wiesner, Matthew and Khudanpur, Sanjeev and Černocký, Jan and Burget, Lukáš},
booktitle={ICASSP 2025},
title={Target Speaker ASR with Whisper},
year={2025},
doi={10.1109/ICASSP49660.2025.10887683}
}
```
---
## 📬 Contact
For questions or collaboration inquiries:
📧 **Email:** [ipoloka@fit.vut.cz](mailto:ipoloka@fit.vut.cz)
🏢 **Affiliation:** [BUT Speech@FIT](https://github.com/BUTSpeechFIT), Brno University of Technology
🔗 **GitHub:** [BUTSpeechFIT](https://github.com/BUTSpeechFIT)
|