Link model to SE-DiCoW paper and update metadata
Browse filesThis PR improves the model card by:
- Linking the latest paper associated with this work: [SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper](https://huggingface.co/papers/2601.19194), which describes the stabilization and enhancements implemented in version 3.3.
- Adding `base_model: openai/whisper-large-v3-turbo` to the metadata to improve discoverability and correctly attribute the architecture.
- Updating the citation section to include the SE-DiCoW paper.
README.md
CHANGED
|
@@ -1,5 +1,11 @@
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
| 2 |
library_name: transformers
|
|
|
|
|
|
|
|
|
|
| 3 |
tags:
|
| 4 |
- speech
|
| 5 |
- automatic-speech-recognition
|
|
@@ -10,17 +16,14 @@ tags:
|
|
| 10 |
- target-speaker-asr
|
| 11 |
- DiCoW
|
| 12 |
- BUT-FIT
|
| 13 |
-
pipeline_tag: automatic-speech-recognition
|
| 14 |
-
license: cc-by-4.0
|
| 15 |
-
datasets:
|
| 16 |
-
- microsoft/NOTSOFAR
|
| 17 |
-
- edinburghcstr/ami
|
| 18 |
---
|
| 19 |
|
| 20 |
# 🧠 DiCoW v3.3 — Target-Speaker ASR
|
| 21 |
|
| 22 |
This repository hosts **DiCoW v3.3**, a Target-Speaker ASR (TS-ASR) model developed by [BUT Speech@FIT](https://github.com/BUTSpeechFIT). It is designed to transcribe the speech of a specific speaker within a multi-talker mixture by conditioning on speaker diarization outputs.
|
| 23 |
|
|
|
|
|
|
|
| 24 |
<div align="center">
|
| 25 |
<img src="https://huggingface.co/BUT-FIT/DiCoW_v3_3/resolve/main/DiCoW_v3_3.png" alt="DiCoW Architecture" width="700"/>
|
| 26 |
</div>
|
|
@@ -110,15 +113,22 @@ sbatch --export SRC_ROOT=$PWD scripts/submit_slurm.sh +train=dicow_v3
|
|
| 110 |
## ⚠️ Limitations
|
| 111 |
|
| 112 |
* **Diarization Dependent:** Performance is heavily dependent on the quality of the input diarization.
|
| 113 |
-
* **Ambiguity:** In scenarios with >2 fully overlapping speakers, the model may struggle to distinguish the target (addressed in
|
| 114 |
|
| 115 |
---
|
| 116 |
|
| 117 |
## 📚 Citations
|
| 118 |
|
| 119 |
-
If you use this model, please cite
|
| 120 |
|
| 121 |
```bibtex
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 122 |
@article{POLOK2026101841,
|
| 123 |
title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition},
|
| 124 |
journal = {Computer Speech & Language},
|
|
@@ -135,10 +145,9 @@ If you use this model, please cite our **CS&L 2026** and **ICASSP 2025** papers:
|
|
| 135 |
year={2025},
|
| 136 |
doi={10.1109/ICASSP49660.2025.10887683}
|
| 137 |
}
|
| 138 |
-
|
| 139 |
```
|
| 140 |
|
| 141 |
## 📬 Contact
|
| 142 |
|
| 143 |
* **Issues:** [GitHub Issues](https://github.com/BUTSpeechFIT/TS-ASR-Whisper/issues)
|
| 144 |
-
* **Email:** [ipoloka@fit.vut.cz](mailto:ipoloka@fit.vut.cz)
|
|
|
|
| 1 |
---
|
| 2 |
+
datasets:
|
| 3 |
+
- microsoft/NOTSOFAR
|
| 4 |
+
- edinburghcstr/ami
|
| 5 |
library_name: transformers
|
| 6 |
+
license: cc-by-4.0
|
| 7 |
+
pipeline_tag: automatic-speech-recognition
|
| 8 |
+
base_model: openai/whisper-large-v3-turbo
|
| 9 |
tags:
|
| 10 |
- speech
|
| 11 |
- automatic-speech-recognition
|
|
|
|
| 16 |
- target-speaker-asr
|
| 17 |
- DiCoW
|
| 18 |
- BUT-FIT
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
---
|
| 20 |
|
| 21 |
# 🧠 DiCoW v3.3 — Target-Speaker ASR
|
| 22 |
|
| 23 |
This repository hosts **DiCoW v3.3**, a Target-Speaker ASR (TS-ASR) model developed by [BUT Speech@FIT](https://github.com/BUTSpeechFIT). It is designed to transcribe the speech of a specific speaker within a multi-talker mixture by conditioning on speaker diarization outputs.
|
| 24 |
|
| 25 |
+
This model version incorporates the refinements and training strategies described in the paper [SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper](https://huggingface.co/papers/2601.19194).
|
| 26 |
+
|
| 27 |
<div align="center">
|
| 28 |
<img src="https://huggingface.co/BUT-FIT/DiCoW_v3_3/resolve/main/DiCoW_v3_3.png" alt="DiCoW Architecture" width="700"/>
|
| 29 |
</div>
|
|
|
|
| 113 |
## ⚠️ Limitations
|
| 114 |
|
| 115 |
* **Diarization Dependent:** Performance is heavily dependent on the quality of the input diarization.
|
| 116 |
+
* **Ambiguity:** In scenarios with >2 fully overlapping speakers, the model may struggle to distinguish the target (addressed in the **SE-DiCoW** model).
|
| 117 |
|
| 118 |
---
|
| 119 |
|
| 120 |
## 📚 Citations
|
| 121 |
|
| 122 |
+
If you use this model, please cite the following papers:
|
| 123 |
|
| 124 |
```bibtex
|
| 125 |
+
@article{polok2026sedicow,
|
| 126 |
+
title={SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper},
|
| 127 |
+
author={Alexander Polok and Dominik Klement and Samuele Cornell and Matthew Wiesner and Jan Černocký and Sanjeev Khudanpur and Lukáš Burget},
|
| 128 |
+
journal={arXiv preprint arXiv:2601.19194},
|
| 129 |
+
year={2026}
|
| 130 |
+
}
|
| 131 |
+
|
| 132 |
@article{POLOK2026101841,
|
| 133 |
title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition},
|
| 134 |
journal = {Computer Speech & Language},
|
|
|
|
| 145 |
year={2025},
|
| 146 |
doi={10.1109/ICASSP49660.2025.10887683}
|
| 147 |
}
|
|
|
|
| 148 |
```
|
| 149 |
|
| 150 |
## 📬 Contact
|
| 151 |
|
| 152 |
* **Issues:** [GitHub Issues](https://github.com/BUTSpeechFIT/TS-ASR-Whisper/issues)
|
| 153 |
+
* **Email:** [ipoloka@fit.vut.cz](mailto:ipoloka@fit.vut.cz)
|