File size: 7,005 Bytes
10d19dc
 
 
 
 
 
 
 
 
 
 
 
 
 
99c64e8
10d19dc
 
 
 
 
bd2e2ab
702de8f
bd2e2ab
 
702de8f
99c64e8
 
bd2e2ab
702de8f
bd2e2ab
 
 
702de8f
bd2e2ab
702de8f
bd2e2ab
702de8f
bd2e2ab
 
 
702de8f
bd2e2ab
702de8f
 
bd2e2ab
702de8f
bd2e2ab
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
989b69f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bd2e2ab
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d3a6cd8
 
10d19dc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
---
library_name: transformers
tags:
- speech
- automatic-speech-recognition
- whisper
- multilingual
- fine-tuned
- mlc-slm
- speaker-diarization
- meeting-transcription
- DiCoW
- BUT-FIT
pipeline_tag: automatic-speech-recognition
license: cc-by-4.0
datasets:
- microsoft/NOTSOFAR
- edinburghcstr/ami
---

# DiCoW\_v3\_MLC — BUT-FIT Model for MLC-SLM Challenge

This repository contains the **DiCoW\_v3\_MLC** model developed by [BUT Speech@FIT](https://github.com/BUTSpeechFIT) for the [MLC-SLM Challenge](https://www.nexdata.ai/competition/mlc-slm). 
Diarization-Conditioned Whisper (DiCoW) is a novel approach to target-speaker ASR that leverages speaker diarization outputs as conditioning information.

This model is available under the terms of CC BY 4.0. It incorporates an MIT-licensed base model and CC BY 4.0 licensed training data.

The model is described in detail in the following papers:

* 📰 **Journal paper (main DiCoW paper):** [DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition](https://authors.elsevier.com/a/1lI9m_K8BYumVY)
* 📰 **ICASSP paper (initial DiCoW experiments):** [Target Speaker ASR with Whisper](https://ieeexplore.ieee.org/document/10887683)
* 📰 **MLC-SLM Challenge submission paper:** [BUT System for the MLC-SLM Challenge](https://www.arxiv.org/abs/2506.13414)

## Model Summary

The model is based on **Whisper large-v3-turbo**, initially trained on:

* **NOTSOFAR-1**
* **AMI** Meeting Corpus
* **Libri2Mix** dataset

It is then fine-tuned on the **MLC-SLM dataset** as part of the MLC-SLM Challenge.


## Model Details

* **Developed by:** BUT Speech\@FIT, Brno University of Technology
* **Model type:** Whisper large-v3-turbo + DiCoW composition
* **Language(s):** Multilingual (primarily English, but supports multiple languages)
* **License:** apache-2.0
* **Fine-tuned from:** openai/whisper-large-v3-turbo
* **Challenge:** MLC-SLM (Multilingual Conversational Speech Language Model)

## Model Sources

* **Training Code:** [TS-ASR-Whisper GitHub](https://github.com/BUTSpeechFIT/TS-ASR-Whisper)
* **Inference Code & DiCoW framework:** [DiCoW GitHub](https://github.com/BUTSpeechFIT/DiCoW)


## Getting Started

```python
from transformers import AutoModelForSpeechSeq2Seq

MODEL_NAME = "BUT-FIT/DiCoW_v3_MLC"
dicow = AutoModelForSpeechSeq2Seq.from_pretrained(MODEL_NAME, trust_remote_code=True)
```

For detailed inference and full pipelines, refer to:
👉 [DiCoW GitHub inference repo](https://github.com/BUTSpeechFIT/DiCoW)


### tcpWER/CER (%) on the MLC-SLM development set

| Language       | Baseline (GT) | DiCoW (GT) | FT (GT) | Baseline (Real diar) | DiCoW (Real diar) | FT (Real diar) |
|----------------|---------------|------------|---------|-----------------------|-------------------|----------------|
| American En.   | 14.1          | 20.6       | 11.1    | 53.7                  | 36.5              | 22.5           |
| Australian En. | 11.7          | 19.4       | 7.4     | 52.6                  | 23.6              | 13.0           |
| British En.    | 10.1          | 16.7       | 7.7     | 71.9                  | 26.1              | 17.6           |
| Filipino En.   | 9.2           | 17.7       | 7.5     | 50.4                  | 25.5              | 15.2           |
| Indian En.     | 14.0          | 14.3       | 13.3    | 70.7                  | 14.9              | 14.0           |
| French         | 28.1          | 27.7       | 16.1    | 96.0                  | 37.8              | 27.5           |
| German         | 20.7          | 21.2       | 23.9    | 86.7                  | 30.1              | 27.3           |
| Italian        | 17.9          | 16.2       | 12.3    | 83.3                  | 19.8              | 16.4           |
| Japanese (\*)   | 21.6          | 19.2       | 13.7    | 71.3                  | 25.8              | 23.3           |
| Korean (\*)     | 13.8          | 12.8       | 8.5     | 59.6                  | 24.5              | 22.8           |
| Portuguese     | 21.2          | 24.5       | 19.5    | 118.8                 | 33.1              | 29.7           |
| Russian        | 17.7          | 17.6       | 11.6    | 69.2                  | 22.5              | 16.7           |
| Spanish        | 12.3          | 11.6       | 8.7     | 75.6                  | 18.2              | 16.3           |
| Thai (\*)       | 14.5          | 31.9       | 14.2    | 83.6                  | 34.4              | 20.1           |
| Vietnamese     | 27.2          | 30.0       | 15.3    | 82.8                  | 33.8              | 24.7           |
| **Overall**    | **16.8**      | **22.0**   | **12.9**| **76.1**              | **28.4**          | **20.8**       |

> *Results marked with an asterisk (*) are reported using tcpCER, following the official evaluation protocol.*

**Notes:**

- GT = Ground-Truth Segmentation  
- Real diar = Real Diarization
- Baseline uses Whisper large-v3 with chunked inference + finetunned Pyannote diarization.  
- DiCoW uses fine-tuned DiariZen diarization.


## Citation

If you use this model, please cite:

```bibtex
@article{POLOK2026101841,
    title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition},
    journal = {Computer Speech & Language},
    volume = {95},
    pages = {101841},
    year = {2026},
    issn = {0885-2308},
    doi = {https://doi.org/10.1016/j.csl.2025.101841},
    url = {https://www.sciencedirect.com/science/article/pii/S088523082500066X},
    author = {Alexander Polok and Dominik Klement and Martin Kocour and Jiangyu Han and Federico Landini and Bolaji Yusuf and Matthew Wiesner and Sanjeev Khudanpur and Jan Černocký and Lukáš Burget},
    keywords = {Diarization-conditioned Whisper, Target-speaker ASR, Speaker diarization, Long-form ASR, Whisper adaptation},
}

@INPROCEEDINGS{10887683,
    author={Polok, Alexander and Klement, Dominik and Wiesner, Matthew and Khudanpur, Sanjeev and Černocký, Jan and Burget, Lukáš},
    booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
    title={Target Speaker ASR with Whisper}, 
    year={2025},
    volume={},
    number={},
    pages={1-5},
    keywords={Transforms;Signal processing;Transformers;Acoustics;Speech processing;target-speaker ASR;diarization conditioning;multi-speaker ASR;Whisper},
    doi={10.1109/ICASSP49660.2025.10887683}
}

@misc{polok2025mlcslmchallenge,
    title={BUT System for the MLC-SLM Challenge}, 
    author={Alexander Polok and Jiangyu Han and Dominik Klement and Samuele Cornell and Jan Černocký and Lukáš Burget},
    year={2025},
    eprint={2506.13414},
    archivePrefix={arXiv},
    primaryClass={eess.AS},
    url={https://arxiv.org/abs/2506.13414}, 
}
```

## Contact

For questions or collaborations, feel free to email: [ipoloka@fit.vut.cz](mailto:ipoloka@fit.vut.cz)  
**BUT Speech@FIT, Brno University of Technology**  
GitHub: [BUTSpeechFIT](https://github.com/BUTSpeechFIT)