nielsr HF Staff commited on
Commit
139de11
·
verified ·
1 Parent(s): 0aba92c

Link model to SE-DiCoW paper and update metadata

Browse files

This PR improves the model card by:
- Linking the latest paper associated with this work: [SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper](https://huggingface.co/papers/2601.19194), which describes the stabilization and enhancements implemented in version 3.3.
- Adding `base_model: openai/whisper-large-v3-turbo` to the metadata to improve discoverability and correctly attribute the architecture.
- Updating the citation section to include the SE-DiCoW paper.

Files changed (1) hide show
  1. README.md +18 -9
README.md CHANGED
@@ -1,5 +1,11 @@
1
  ---
 
 
 
2
  library_name: transformers
 
 
 
3
  tags:
4
  - speech
5
  - automatic-speech-recognition
@@ -10,17 +16,14 @@ tags:
10
  - target-speaker-asr
11
  - DiCoW
12
  - BUT-FIT
13
- pipeline_tag: automatic-speech-recognition
14
- license: cc-by-4.0
15
- datasets:
16
- - microsoft/NOTSOFAR
17
- - edinburghcstr/ami
18
  ---
19
 
20
  # 🧠 DiCoW v3.3 — Target-Speaker ASR
21
 
22
  This repository hosts **DiCoW v3.3**, a Target-Speaker ASR (TS-ASR) model developed by [BUT Speech@FIT](https://github.com/BUTSpeechFIT). It is designed to transcribe the speech of a specific speaker within a multi-talker mixture by conditioning on speaker diarization outputs.
23
 
 
 
24
  <div align="center">
25
  <img src="https://huggingface.co/BUT-FIT/DiCoW_v3_3/resolve/main/DiCoW_v3_3.png" alt="DiCoW Architecture" width="700"/>
26
  </div>
@@ -110,15 +113,22 @@ sbatch --export SRC_ROOT=$PWD scripts/submit_slurm.sh +train=dicow_v3
110
  ## ⚠️ Limitations
111
 
112
  * **Diarization Dependent:** Performance is heavily dependent on the quality of the input diarization.
113
- * **Ambiguity:** In scenarios with >2 fully overlapping speakers, the model may struggle to distinguish the target (addressed in our upcoming **SE-DiCoW** model).
114
 
115
  ---
116
 
117
  ## 📚 Citations
118
 
119
- If you use this model, please cite our **CS&L 2026** and **ICASSP 2025** papers:
120
 
121
  ```bibtex
 
 
 
 
 
 
 
122
  @article{POLOK2026101841,
123
  title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition},
124
  journal = {Computer Speech & Language},
@@ -135,10 +145,9 @@ If you use this model, please cite our **CS&L 2026** and **ICASSP 2025** papers:
135
  year={2025},
136
  doi={10.1109/ICASSP49660.2025.10887683}
137
  }
138
-
139
  ```
140
 
141
  ## 📬 Contact
142
 
143
  * **Issues:** [GitHub Issues](https://github.com/BUTSpeechFIT/TS-ASR-Whisper/issues)
144
- * **Email:** [ipoloka@fit.vut.cz](mailto:ipoloka@fit.vut.cz)
 
1
  ---
2
+ datasets:
3
+ - microsoft/NOTSOFAR
4
+ - edinburghcstr/ami
5
  library_name: transformers
6
+ license: cc-by-4.0
7
+ pipeline_tag: automatic-speech-recognition
8
+ base_model: openai/whisper-large-v3-turbo
9
  tags:
10
  - speech
11
  - automatic-speech-recognition
 
16
  - target-speaker-asr
17
  - DiCoW
18
  - BUT-FIT
 
 
 
 
 
19
  ---
20
 
21
  # 🧠 DiCoW v3.3 — Target-Speaker ASR
22
 
23
  This repository hosts **DiCoW v3.3**, a Target-Speaker ASR (TS-ASR) model developed by [BUT Speech@FIT](https://github.com/BUTSpeechFIT). It is designed to transcribe the speech of a specific speaker within a multi-talker mixture by conditioning on speaker diarization outputs.
24
 
25
+ This model version incorporates the refinements and training strategies described in the paper [SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper](https://huggingface.co/papers/2601.19194).
26
+
27
  <div align="center">
28
  <img src="https://huggingface.co/BUT-FIT/DiCoW_v3_3/resolve/main/DiCoW_v3_3.png" alt="DiCoW Architecture" width="700"/>
29
  </div>
 
113
  ## ⚠️ Limitations
114
 
115
  * **Diarization Dependent:** Performance is heavily dependent on the quality of the input diarization.
116
+ * **Ambiguity:** In scenarios with >2 fully overlapping speakers, the model may struggle to distinguish the target (addressed in the **SE-DiCoW** model).
117
 
118
  ---
119
 
120
  ## 📚 Citations
121
 
122
+ If you use this model, please cite the following papers:
123
 
124
  ```bibtex
125
+ @article{polok2026sedicow,
126
+ title={SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper},
127
+ author={Alexander Polok and Dominik Klement and Samuele Cornell and Matthew Wiesner and Jan Černocký and Sanjeev Khudanpur and Lukáš Burget},
128
+ journal={arXiv preprint arXiv:2601.19194},
129
+ year={2026}
130
+ }
131
+
132
  @article{POLOK2026101841,
133
  title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition},
134
  journal = {Computer Speech & Language},
 
145
  year={2025},
146
  doi={10.1109/ICASSP49660.2025.10887683}
147
  }
 
148
  ```
149
 
150
  ## 📬 Contact
151
 
152
  * **Issues:** [GitHub Issues](https://github.com/BUTSpeechFIT/TS-ASR-Whisper/issues)
153
+ * **Email:** [ipoloka@fit.vut.cz](mailto:ipoloka@fit.vut.cz)