Link model to SE-DiCoW paper and update metadata

This PR improves the model card by:
- Linking the latest paper associated with this work: [SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper](https://huggingface.co/papers/2601.19194), which describes the stabilization and enhancements implemented in version 3.3.
- Adding `base_model: openai/whisper-large-v3-turbo` to the metadata to improve discoverability and correctly attribute the architecture.
- Updating the citation section to include the SE-DiCoW paper.

Files changed (1) hide show

README.md +18 -9

README.md CHANGED Viewed

@@ -1,5 +1,11 @@
 ---
 library_name: transformers
 tags:
 - speech
 - automatic-speech-recognition
@@ -10,17 +16,14 @@ tags:
 - target-speaker-asr
 - DiCoW
 - BUT-FIT
-pipeline_tag: automatic-speech-recognition
-license: cc-by-4.0
-datasets:
-- microsoft/NOTSOFAR
-- edinburghcstr/ami
 ---
 # 🧠 DiCoW v3.3 — Target-Speaker ASR
 This repository hosts **DiCoW v3.3**, a Target-Speaker ASR (TS-ASR) model developed by [BUT Speech@FIT](https://github.com/BUTSpeechFIT). It is designed to transcribe the speech of a specific speaker within a multi-talker mixture by conditioning on speaker diarization outputs.
 <div align="center">
 <img src="https://huggingface.co/BUT-FIT/DiCoW_v3_3/resolve/main/DiCoW_v3_3.png" alt="DiCoW Architecture" width="700"/>
 </div>
@@ -110,15 +113,22 @@ sbatch --export SRC_ROOT=$PWD scripts/submit_slurm.sh +train=dicow_v3
 ## ⚠️ Limitations
 * **Diarization Dependent:** Performance is heavily dependent on the quality of the input diarization.
-* **Ambiguity:** In scenarios with >2 fully overlapping speakers, the model may struggle to distinguish the target (addressed in our upcoming **SE-DiCoW** model).
 ---
 ## 📚 Citations
-If you use this model, please cite our **CS&L 2026** and **ICASSP 2025** papers:
 ```bibtex
 @article{POLOK2026101841,
     title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition},
     journal = {Computer Speech & Language},
@@ -135,10 +145,9 @@ If you use this model, please cite our **CS&L 2026** and **ICASSP 2025** papers:
     year={2025},
     doi={10.1109/ICASSP49660.2025.10887683}
 }
 ```
 ## 📬 Contact
 * **Issues:** [GitHub Issues](https://github.com/BUTSpeechFIT/TS-ASR-Whisper/issues)
-* **Email:** [ipoloka@fit.vut.cz](mailto:ipoloka@fit.vut.cz)

 ---
+datasets:
+- microsoft/NOTSOFAR
+- edinburghcstr/ami
 library_name: transformers
+license: cc-by-4.0
+pipeline_tag: automatic-speech-recognition
+base_model: openai/whisper-large-v3-turbo
 tags:
 - speech
 - automatic-speech-recognition
 - target-speaker-asr
 - DiCoW
 - BUT-FIT
 ---
 # 🧠 DiCoW v3.3 — Target-Speaker ASR
 This repository hosts **DiCoW v3.3**, a Target-Speaker ASR (TS-ASR) model developed by [BUT Speech@FIT](https://github.com/BUTSpeechFIT). It is designed to transcribe the speech of a specific speaker within a multi-talker mixture by conditioning on speaker diarization outputs.
+This model version incorporates the refinements and training strategies described in the paper [SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper](https://huggingface.co/papers/2601.19194).
 <div align="center">
 <img src="https://huggingface.co/BUT-FIT/DiCoW_v3_3/resolve/main/DiCoW_v3_3.png" alt="DiCoW Architecture" width="700"/>
 </div>
 ## ⚠️ Limitations
 * **Diarization Dependent:** Performance is heavily dependent on the quality of the input diarization.
+* **Ambiguity:** In scenarios with >2 fully overlapping speakers, the model may struggle to distinguish the target (addressed in the **SE-DiCoW** model).
 ---
 ## 📚 Citations
+If you use this model, please cite the following papers:
 ```bibtex
+@article{polok2026sedicow,
+  title={SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper},
+  author={Alexander Polok and Dominik Klement and Samuele Cornell and Matthew Wiesner and Jan Černocký and Sanjeev Khudanpur and Lukáš Burget},
+  journal={arXiv preprint arXiv:2601.19194},
+  year={2026}
+}
 @article{POLOK2026101841,
     title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition},
     journal = {Computer Speech & Language},
     year={2025},
     doi={10.1109/ICASSP49660.2025.10887683}
 }
 ```
 ## 📬 Contact
 * **Issues:** [GitHub Issues](https://github.com/BUTSpeechFIT/TS-ASR-Whisper/issues)
+* **Email:** [ipoloka@fit.vut.cz](mailto:ipoloka@fit.vut.cz)