|
|
--- |
|
|
library_name: transformers |
|
|
tags: |
|
|
- speech |
|
|
- automatic-speech-recognition |
|
|
- whisper |
|
|
- multilingual |
|
|
- fine-tuned |
|
|
- mlc-slm |
|
|
- speaker-diarization |
|
|
- meeting-transcription |
|
|
- DiCoW |
|
|
- BUT-FIT |
|
|
pipeline_tag: automatic-speech-recognition |
|
|
license: cc-by-4.0 |
|
|
datasets: |
|
|
- microsoft/NOTSOFAR |
|
|
- edinburghcstr/ami |
|
|
--- |
|
|
|
|
|
# DiCoW\_v3\_MLC — BUT-FIT Model for MLC-SLM Challenge |
|
|
|
|
|
This repository contains the **DiCoW\_v3\_MLC** model developed by [BUT Speech@FIT](https://github.com/BUTSpeechFIT) for the [MLC-SLM Challenge](https://www.nexdata.ai/competition/mlc-slm). |
|
|
Diarization-Conditioned Whisper (DiCoW) is a novel approach to target-speaker ASR that leverages speaker diarization outputs as conditioning information. |
|
|
|
|
|
This model is available under the terms of CC BY 4.0. It incorporates an MIT-licensed base model and CC BY 4.0 licensed training data. |
|
|
|
|
|
The model is described in detail in the following papers: |
|
|
|
|
|
* 📰 **Journal paper (main DiCoW paper):** [DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition](https://authors.elsevier.com/a/1lI9m_K8BYumVY) |
|
|
* 📰 **ICASSP paper (initial DiCoW experiments):** [Target Speaker ASR with Whisper](https://ieeexplore.ieee.org/document/10887683) |
|
|
* 📰 **MLC-SLM Challenge submission paper:** [BUT System for the MLC-SLM Challenge](https://www.arxiv.org/abs/2506.13414) |
|
|
|
|
|
## Model Summary |
|
|
|
|
|
The model is based on **Whisper large-v3-turbo**, initially trained on: |
|
|
|
|
|
* **NOTSOFAR-1** |
|
|
* **AMI** Meeting Corpus |
|
|
* **Libri2Mix** dataset |
|
|
|
|
|
It is then fine-tuned on the **MLC-SLM dataset** as part of the MLC-SLM Challenge. |
|
|
|
|
|
|
|
|
## Model Details |
|
|
|
|
|
* **Developed by:** BUT Speech\@FIT, Brno University of Technology |
|
|
* **Model type:** Whisper large-v3-turbo + DiCoW composition |
|
|
* **Language(s):** Multilingual (primarily English, but supports multiple languages) |
|
|
* **License:** apache-2.0 |
|
|
* **Fine-tuned from:** openai/whisper-large-v3-turbo |
|
|
* **Challenge:** MLC-SLM (Multilingual Conversational Speech Language Model) |
|
|
|
|
|
## Model Sources |
|
|
|
|
|
* **Training Code:** [TS-ASR-Whisper GitHub](https://github.com/BUTSpeechFIT/TS-ASR-Whisper) |
|
|
* **Inference Code & DiCoW framework:** [DiCoW GitHub](https://github.com/BUTSpeechFIT/DiCoW) |
|
|
|
|
|
|
|
|
## Getting Started |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForSpeechSeq2Seq |
|
|
|
|
|
MODEL_NAME = "BUT-FIT/DiCoW_v3_MLC" |
|
|
dicow = AutoModelForSpeechSeq2Seq.from_pretrained(MODEL_NAME, trust_remote_code=True) |
|
|
``` |
|
|
|
|
|
For detailed inference and full pipelines, refer to: |
|
|
👉 [DiCoW GitHub inference repo](https://github.com/BUTSpeechFIT/DiCoW) |
|
|
|
|
|
|
|
|
### tcpWER/CER (%) on the MLC-SLM development set |
|
|
|
|
|
| Language | Baseline (GT) | DiCoW (GT) | FT (GT) | Baseline (Real diar) | DiCoW (Real diar) | FT (Real diar) | |
|
|
|----------------|---------------|------------|---------|-----------------------|-------------------|----------------| |
|
|
| American En. | 14.1 | 20.6 | 11.1 | 53.7 | 36.5 | 22.5 | |
|
|
| Australian En. | 11.7 | 19.4 | 7.4 | 52.6 | 23.6 | 13.0 | |
|
|
| British En. | 10.1 | 16.7 | 7.7 | 71.9 | 26.1 | 17.6 | |
|
|
| Filipino En. | 9.2 | 17.7 | 7.5 | 50.4 | 25.5 | 15.2 | |
|
|
| Indian En. | 14.0 | 14.3 | 13.3 | 70.7 | 14.9 | 14.0 | |
|
|
| French | 28.1 | 27.7 | 16.1 | 96.0 | 37.8 | 27.5 | |
|
|
| German | 20.7 | 21.2 | 23.9 | 86.7 | 30.1 | 27.3 | |
|
|
| Italian | 17.9 | 16.2 | 12.3 | 83.3 | 19.8 | 16.4 | |
|
|
| Japanese (\*) | 21.6 | 19.2 | 13.7 | 71.3 | 25.8 | 23.3 | |
|
|
| Korean (\*) | 13.8 | 12.8 | 8.5 | 59.6 | 24.5 | 22.8 | |
|
|
| Portuguese | 21.2 | 24.5 | 19.5 | 118.8 | 33.1 | 29.7 | |
|
|
| Russian | 17.7 | 17.6 | 11.6 | 69.2 | 22.5 | 16.7 | |
|
|
| Spanish | 12.3 | 11.6 | 8.7 | 75.6 | 18.2 | 16.3 | |
|
|
| Thai (\*) | 14.5 | 31.9 | 14.2 | 83.6 | 34.4 | 20.1 | |
|
|
| Vietnamese | 27.2 | 30.0 | 15.3 | 82.8 | 33.8 | 24.7 | |
|
|
| **Overall** | **16.8** | **22.0** | **12.9**| **76.1** | **28.4** | **20.8** | |
|
|
|
|
|
> *Results marked with an asterisk (*) are reported using tcpCER, following the official evaluation protocol.* |
|
|
|
|
|
**Notes:** |
|
|
|
|
|
- GT = Ground-Truth Segmentation |
|
|
- Real diar = Real Diarization |
|
|
- Baseline uses Whisper large-v3 with chunked inference + finetunned Pyannote diarization. |
|
|
- DiCoW uses fine-tuned DiariZen diarization. |
|
|
|
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@article{POLOK2026101841, |
|
|
title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition}, |
|
|
journal = {Computer Speech & Language}, |
|
|
volume = {95}, |
|
|
pages = {101841}, |
|
|
year = {2026}, |
|
|
issn = {0885-2308}, |
|
|
doi = {https://doi.org/10.1016/j.csl.2025.101841}, |
|
|
url = {https://www.sciencedirect.com/science/article/pii/S088523082500066X}, |
|
|
author = {Alexander Polok and Dominik Klement and Martin Kocour and Jiangyu Han and Federico Landini and Bolaji Yusuf and Matthew Wiesner and Sanjeev Khudanpur and Jan Černocký and Lukáš Burget}, |
|
|
keywords = {Diarization-conditioned Whisper, Target-speaker ASR, Speaker diarization, Long-form ASR, Whisper adaptation}, |
|
|
} |
|
|
|
|
|
@INPROCEEDINGS{10887683, |
|
|
author={Polok, Alexander and Klement, Dominik and Wiesner, Matthew and Khudanpur, Sanjeev and Černocký, Jan and Burget, Lukáš}, |
|
|
booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, |
|
|
title={Target Speaker ASR with Whisper}, |
|
|
year={2025}, |
|
|
volume={}, |
|
|
number={}, |
|
|
pages={1-5}, |
|
|
keywords={Transforms;Signal processing;Transformers;Acoustics;Speech processing;target-speaker ASR;diarization conditioning;multi-speaker ASR;Whisper}, |
|
|
doi={10.1109/ICASSP49660.2025.10887683} |
|
|
} |
|
|
|
|
|
@misc{polok2025mlcslmchallenge, |
|
|
title={BUT System for the MLC-SLM Challenge}, |
|
|
author={Alexander Polok and Jiangyu Han and Dominik Klement and Samuele Cornell and Jan Černocký and Lukáš Burget}, |
|
|
year={2025}, |
|
|
eprint={2506.13414}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={eess.AS}, |
|
|
url={https://arxiv.org/abs/2506.13414}, |
|
|
} |
|
|
``` |
|
|
|
|
|
## Contact |
|
|
|
|
|
For questions or collaborations, feel free to email: [ipoloka@fit.vut.cz](mailto:ipoloka@fit.vut.cz) |
|
|
**BUT Speech@FIT, Brno University of Technology** |
|
|
GitHub: [BUTSpeechFIT](https://github.com/BUTSpeechFIT) |