File size: 7,005 Bytes

10d19dc
 
 
 
 
 
 
 
 
 
 
 
 
 
99c64e8
10d19dc
 
 
 
 
bd2e2ab
702de8f
bd2e2ab
 
702de8f
99c64e8
 
bd2e2ab
702de8f
bd2e2ab
 
 
702de8f
bd2e2ab
702de8f
bd2e2ab
702de8f
bd2e2ab
 
 
702de8f
bd2e2ab
702de8f
 
bd2e2ab
702de8f
bd2e2ab
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
989b69f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bd2e2ab
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d3a6cd8
 
10d19dc

---
library_name: transformers
tags:
- speech
- automatic-speech-recognition
- whisper
- multilingual
- fine-tuned
- mlc-slm
- speaker-diarization
- meeting-transcription
- DiCoW
- BUT-FIT
pipeline_tag: automatic-speech-recognition
license: cc-by-4.0
datasets:
- microsoft/NOTSOFAR
- edinburghcstr/ami
---

# DiCoW\_v3\_MLC — BUT-FIT Model for MLC-SLM Challenge

This repository contains the **DiCoW\_v3\_MLC** model developed by [BUT Speech@FIT](https://github.com/BUTSpeechFIT) for the [MLC-SLM Challenge](https://www.nexdata.ai/competition/mlc-slm). 
Diarization-Conditioned Whisper (DiCoW) is a novel approach to target-speaker ASR that leverages speaker diarization outputs as conditioning information.

This model is available under the terms of CC BY 4.0. It incorporates an MIT-licensed base model and CC BY 4.0 licensed training data.

The model is described in detail in the following papers:

* 📰 **Journal paper (main DiCoW paper):** [DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition](https://authors.elsevier.com/a/1lI9m_K8BYumVY)
* 📰 **ICASSP paper (initial DiCoW experiments):** [Target Speaker ASR with Whisper](https://ieeexplore.ieee.org/document/10887683)
* 📰 **MLC-SLM Challenge submission paper:** [BUT System for the MLC-SLM Challenge](https://www.arxiv.org/abs/2506.13414)

## Model Summary

The model is based on **Whisper large-v3-turbo**, initially trained on:

* **NOTSOFAR-1**
* **AMI** Meeting Corpus
* **Libri2Mix** dataset

It is then fine-tuned on the **MLC-SLM dataset** as part of the MLC-SLM Challenge.


## Model Details

* **Developed by:** BUT Speech\@FIT, Brno University of Technology
* **Model type:** Whisper large-v3-turbo + DiCoW composition
* **Language(s):** Multilingual (primarily English, but supports multiple languages)
* **License:** apache-2.0
* **Fine-tuned from:** openai/whisper-large-v3-turbo
* **Challenge:** MLC-SLM (Multilingual Conversational Speech Language Model)

## Model Sources

* **Training Code:** [TS-ASR-Whisper GitHub](https://github.com/BUTSpeechFIT/TS-ASR-Whisper)
* **Inference Code & DiCoW framework:** [DiCoW GitHub](https://github.com/BUTSpeechFIT/DiCoW)


## Getting Started

```python
from transformers import AutoModelForSpeechSeq2Seq

MODEL_NAME = "BUT-FIT/DiCoW_v3_MLC"
dicow = AutoModelForSpeechSeq2Seq.from_pretrained(MODEL_NAME, trust_remote_code=True)
```

For detailed inference and full pipelines, refer to:
👉 [DiCoW GitHub inference repo](https://github.com/BUTSpeechFIT/DiCoW)


### tcpWER/CER (%) on the MLC-SLM development set

| Language       | Baseline (GT) | DiCoW (GT) | FT (GT) | Baseline (Real diar) | DiCoW (Real diar) | FT (Real diar) |
|----------------|---------------|------------|---------|-----------------------|-------------------|----------------|
| American En.   | 14.1          | 20.6       | 11.1    | 53.7                  | 36.5              | 22.5           |
| Australian En. | 11.7          | 19.4       | 7.4     | 52.6                  | 23.6              | 13.0           |
| British En.    | 10.1          | 16.7       | 7.7     | 71.9                  | 26.1              | 17.6           |
| Filipino En.   | 9.2           | 17.7       | 7.5     | 50.4                  | 25.5              | 15.2           |
| Indian En.     | 14.0          | 14.3       | 13.3    | 70.7                  | 14.9              | 14.0           |
| French         | 28.1          | 27.7       | 16.1    | 96.0                  | 37.8              | 27.5           |
| German         | 20.7          | 21.2       | 23.9    | 86.7                  | 30.1              | 27.3           |
| Italian        | 17.9          | 16.2       | 12.3    | 83.3                  | 19.8              | 16.4           |
| Japanese (\*)   | 21.6          | 19.2       | 13.7    | 71.3                  | 25.8              | 23.3           |
| Korean (\*)     | 13.8          | 12.8       | 8.5     | 59.6                  | 24.5              | 22.8           |
| Portuguese     | 21.2          | 24.5       | 19.5    | 118.8                 | 33.1              | 29.7           |
| Russian        | 17.7          | 17.6       | 11.6    | 69.2                  | 22.5              | 16.7           |
| Spanish        | 12.3          | 11.6       | 8.7     | 75.6                  | 18.2              | 16.3           |
| Thai (\*)       | 14.5          | 31.9       | 14.2    | 83.6                  | 34.4              | 20.1           |
| Vietnamese     | 27.2          | 30.0       | 15.3    | 82.8                  | 33.8              | 24.7           |
| **Overall**    | **16.8**      | **22.0**   | **12.9**| **76.1**              | **28.4**          | **20.8**       |

> *Results marked with an asterisk (*) are reported using tcpCER, following the official evaluation protocol.*

**Notes:**

- GT = Ground-Truth Segmentation  
- Real diar = Real Diarization
- Baseline uses Whisper large-v3 with chunked inference + finetunned Pyannote diarization.  
- DiCoW uses fine-tuned DiariZen diarization.


## Citation

If you use this model, please cite:

```bibtex
@article{POLOK2026101841,
    title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition},
    journal = {Computer Speech & Language},
    volume = {95},
    pages = {101841},
    year = {2026},
    issn = {0885-2308},
    doi = {https://doi.org/10.1016/j.csl.2025.101841},
    url = {https://www.sciencedirect.com/science/article/pii/S088523082500066X},
    author = {Alexander Polok and Dominik Klement and Martin Kocour and Jiangyu Han and Federico Landini and Bolaji Yusuf and Matthew Wiesner and Sanjeev Khudanpur and Jan Černocký and Lukáš Burget},
    keywords = {Diarization-conditioned Whisper, Target-speaker ASR, Speaker diarization, Long-form ASR, Whisper adaptation},
}

@INPROCEEDINGS{10887683,
    author={Polok, Alexander and Klement, Dominik and Wiesner, Matthew and Khudanpur, Sanjeev and Černocký, Jan and Burget, Lukáš},
    booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
    title={Target Speaker ASR with Whisper}, 
    year={2025},
    volume={},
    number={},
    pages={1-5},
    keywords={Transforms;Signal processing;Transformers;Acoustics;Speech processing;target-speaker ASR;diarization conditioning;multi-speaker ASR;Whisper},
    doi={10.1109/ICASSP49660.2025.10887683}
}

@misc{polok2025mlcslmchallenge,
    title={BUT System for the MLC-SLM Challenge}, 
    author={Alexander Polok and Jiangyu Han and Dominik Klement and Samuele Cornell and Jan Černocký and Lukáš Burget},
    year={2025},
    eprint={2506.13414},
    archivePrefix={arXiv},
    primaryClass={eess.AS},
    url={https://arxiv.org/abs/2506.13414}, 
}
```

## Contact

For questions or collaborations, feel free to email: [ipoloka@fit.vut.cz](mailto:ipoloka@fit.vut.cz)  
**BUT Speech@FIT, Brno University of Technology**  
GitHub: [BUTSpeechFIT](https://github.com/BUTSpeechFIT)