DiCoW_v3_MLC / README.md
Lakoc's picture
Update README.md
99c64e8 verified
---
library_name: transformers
tags:
- speech
- automatic-speech-recognition
- whisper
- multilingual
- fine-tuned
- mlc-slm
- speaker-diarization
- meeting-transcription
- DiCoW
- BUT-FIT
pipeline_tag: automatic-speech-recognition
license: cc-by-4.0
datasets:
- microsoft/NOTSOFAR
- edinburghcstr/ami
---
# DiCoW\_v3\_MLC — BUT-FIT Model for MLC-SLM Challenge
This repository contains the **DiCoW\_v3\_MLC** model developed by [BUT Speech@FIT](https://github.com/BUTSpeechFIT) for the [MLC-SLM Challenge](https://www.nexdata.ai/competition/mlc-slm).
Diarization-Conditioned Whisper (DiCoW) is a novel approach to target-speaker ASR that leverages speaker diarization outputs as conditioning information.
This model is available under the terms of CC BY 4.0. It incorporates an MIT-licensed base model and CC BY 4.0 licensed training data.
The model is described in detail in the following papers:
* 📰 **Journal paper (main DiCoW paper):** [DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition](https://authors.elsevier.com/a/1lI9m_K8BYumVY)
* 📰 **ICASSP paper (initial DiCoW experiments):** [Target Speaker ASR with Whisper](https://ieeexplore.ieee.org/document/10887683)
* 📰 **MLC-SLM Challenge submission paper:** [BUT System for the MLC-SLM Challenge](https://www.arxiv.org/abs/2506.13414)
## Model Summary
The model is based on **Whisper large-v3-turbo**, initially trained on:
* **NOTSOFAR-1**
* **AMI** Meeting Corpus
* **Libri2Mix** dataset
It is then fine-tuned on the **MLC-SLM dataset** as part of the MLC-SLM Challenge.
## Model Details
* **Developed by:** BUT Speech\@FIT, Brno University of Technology
* **Model type:** Whisper large-v3-turbo + DiCoW composition
* **Language(s):** Multilingual (primarily English, but supports multiple languages)
* **License:** apache-2.0
* **Fine-tuned from:** openai/whisper-large-v3-turbo
* **Challenge:** MLC-SLM (Multilingual Conversational Speech Language Model)
## Model Sources
* **Training Code:** [TS-ASR-Whisper GitHub](https://github.com/BUTSpeechFIT/TS-ASR-Whisper)
* **Inference Code & DiCoW framework:** [DiCoW GitHub](https://github.com/BUTSpeechFIT/DiCoW)
## Getting Started
```python
from transformers import AutoModelForSpeechSeq2Seq
MODEL_NAME = "BUT-FIT/DiCoW_v3_MLC"
dicow = AutoModelForSpeechSeq2Seq.from_pretrained(MODEL_NAME, trust_remote_code=True)
```
For detailed inference and full pipelines, refer to:
👉 [DiCoW GitHub inference repo](https://github.com/BUTSpeechFIT/DiCoW)
### tcpWER/CER (%) on the MLC-SLM development set
| Language | Baseline (GT) | DiCoW (GT) | FT (GT) | Baseline (Real diar) | DiCoW (Real diar) | FT (Real diar) |
|----------------|---------------|------------|---------|-----------------------|-------------------|----------------|
| American En. | 14.1 | 20.6 | 11.1 | 53.7 | 36.5 | 22.5 |
| Australian En. | 11.7 | 19.4 | 7.4 | 52.6 | 23.6 | 13.0 |
| British En. | 10.1 | 16.7 | 7.7 | 71.9 | 26.1 | 17.6 |
| Filipino En. | 9.2 | 17.7 | 7.5 | 50.4 | 25.5 | 15.2 |
| Indian En. | 14.0 | 14.3 | 13.3 | 70.7 | 14.9 | 14.0 |
| French | 28.1 | 27.7 | 16.1 | 96.0 | 37.8 | 27.5 |
| German | 20.7 | 21.2 | 23.9 | 86.7 | 30.1 | 27.3 |
| Italian | 17.9 | 16.2 | 12.3 | 83.3 | 19.8 | 16.4 |
| Japanese (\*) | 21.6 | 19.2 | 13.7 | 71.3 | 25.8 | 23.3 |
| Korean (\*) | 13.8 | 12.8 | 8.5 | 59.6 | 24.5 | 22.8 |
| Portuguese | 21.2 | 24.5 | 19.5 | 118.8 | 33.1 | 29.7 |
| Russian | 17.7 | 17.6 | 11.6 | 69.2 | 22.5 | 16.7 |
| Spanish | 12.3 | 11.6 | 8.7 | 75.6 | 18.2 | 16.3 |
| Thai (\*) | 14.5 | 31.9 | 14.2 | 83.6 | 34.4 | 20.1 |
| Vietnamese | 27.2 | 30.0 | 15.3 | 82.8 | 33.8 | 24.7 |
| **Overall** | **16.8** | **22.0** | **12.9**| **76.1** | **28.4** | **20.8** |
> *Results marked with an asterisk (*) are reported using tcpCER, following the official evaluation protocol.*
**Notes:**
- GT = Ground-Truth Segmentation
- Real diar = Real Diarization
- Baseline uses Whisper large-v3 with chunked inference + finetunned Pyannote diarization.
- DiCoW uses fine-tuned DiariZen diarization.
## Citation
If you use this model, please cite:
```bibtex
@article{POLOK2026101841,
title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition},
journal = {Computer Speech & Language},
volume = {95},
pages = {101841},
year = {2026},
issn = {0885-2308},
doi = {https://doi.org/10.1016/j.csl.2025.101841},
url = {https://www.sciencedirect.com/science/article/pii/S088523082500066X},
author = {Alexander Polok and Dominik Klement and Martin Kocour and Jiangyu Han and Federico Landini and Bolaji Yusuf and Matthew Wiesner and Sanjeev Khudanpur and Jan Černocký and Lukáš Burget},
keywords = {Diarization-conditioned Whisper, Target-speaker ASR, Speaker diarization, Long-form ASR, Whisper adaptation},
}
@INPROCEEDINGS{10887683,
author={Polok, Alexander and Klement, Dominik and Wiesner, Matthew and Khudanpur, Sanjeev and Černocký, Jan and Burget, Lukáš},
booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={Target Speaker ASR with Whisper},
year={2025},
volume={},
number={},
pages={1-5},
keywords={Transforms;Signal processing;Transformers;Acoustics;Speech processing;target-speaker ASR;diarization conditioning;multi-speaker ASR;Whisper},
doi={10.1109/ICASSP49660.2025.10887683}
}
@misc{polok2025mlcslmchallenge,
title={BUT System for the MLC-SLM Challenge},
author={Alexander Polok and Jiangyu Han and Dominik Klement and Samuele Cornell and Jan Černocký and Lukáš Burget},
year={2025},
eprint={2506.13414},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2506.13414},
}
```
## Contact
For questions or collaborations, feel free to email: [ipoloka@fit.vut.cz](mailto:ipoloka@fit.vut.cz)
**BUT Speech@FIT, Brno University of Technology**
GitHub: [BUTSpeechFIT](https://github.com/BUTSpeechFIT)