DiCoW_v3_MLC / README.md

Update README.md

99c64e8 verified 4 months ago

7.01 kB

	---
	library_name: transformers
	tags:
	- speech
	- automatic-speech-recognition
	- whisper
	- multilingual
	- fine-tuned
	- mlc-slm
	- speaker-diarization
	- meeting-transcription
	- DiCoW
	- BUT-FIT
	pipeline_tag: automatic-speech-recognition
	license: cc-by-4.0
	datasets:
	- microsoft/NOTSOFAR
	- edinburghcstr/ami
	---

	# DiCoW\_v3\_MLC — BUT-FIT Model for MLC-SLM Challenge

	This repository contains the DiCoW\_v3\_MLC model developed by [BUT Speech@FIT](https://github.com/BUTSpeechFIT) for the [MLC-SLM Challenge](https://www.nexdata.ai/competition/mlc-slm).
	Diarization-Conditioned Whisper (DiCoW) is a novel approach to target-speaker ASR that leverages speaker diarization outputs as conditioning information.

	This model is available under the terms of CC BY 4.0. It incorporates an MIT-licensed base model and CC BY 4.0 licensed training data.

	The model is described in detail in the following papers:

	* 📰 Journal paper (main DiCoW paper): [DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition](https://authors.elsevier.com/a/1lI9m_K8BYumVY)
	* 📰 ICASSP paper (initial DiCoW experiments): [Target Speaker ASR with Whisper](https://ieeexplore.ieee.org/document/10887683)
	* 📰 MLC-SLM Challenge submission paper: [BUT System for the MLC-SLM Challenge](https://www.arxiv.org/abs/2506.13414)

	## Model Summary

	The model is based on Whisper large-v3-turbo, initially trained on:

	* NOTSOFAR-1
	* AMI Meeting Corpus
	* Libri2Mix dataset

	It is then fine-tuned on the MLC-SLM dataset as part of the MLC-SLM Challenge.


	## Model Details

	* Developed by: BUT Speech\@FIT, Brno University of Technology
	* Model type: Whisper large-v3-turbo + DiCoW composition
	* Language(s): Multilingual (primarily English, but supports multiple languages)
	* License: apache-2.0
	* Fine-tuned from: openai/whisper-large-v3-turbo
	* Challenge: MLC-SLM (Multilingual Conversational Speech Language Model)

	## Model Sources

	* Training Code: [TS-ASR-Whisper GitHub](https://github.com/BUTSpeechFIT/TS-ASR-Whisper)
	* Inference Code & DiCoW framework: [DiCoW GitHub](https://github.com/BUTSpeechFIT/DiCoW)


	## Getting Started

	```python
	from transformers import AutoModelForSpeechSeq2Seq

	MODEL_NAME = "BUT-FIT/DiCoW_v3_MLC"
	dicow = AutoModelForSpeechSeq2Seq.from_pretrained(MODEL_NAME, trust_remote_code=True)
	```

	For detailed inference and full pipelines, refer to:
	👉 [DiCoW GitHub inference repo](https://github.com/BUTSpeechFIT/DiCoW)


	### tcpWER/CER (%) on the MLC-SLM development set

	\| Language \| Baseline (GT) \| DiCoW (GT) \| FT (GT) \| Baseline (Real diar) \| DiCoW (Real diar) \| FT (Real diar) \|
	\|----------------\|---------------\|------------\|---------\|-----------------------\|-------------------\|----------------\|
	\| American En. \| 14.1 \| 20.6 \| 11.1 \| 53.7 \| 36.5 \| 22.5 \|
	\| Australian En. \| 11.7 \| 19.4 \| 7.4 \| 52.6 \| 23.6 \| 13.0 \|
	\| British En. \| 10.1 \| 16.7 \| 7.7 \| 71.9 \| 26.1 \| 17.6 \|
	\| Filipino En. \| 9.2 \| 17.7 \| 7.5 \| 50.4 \| 25.5 \| 15.2 \|
	\| Indian En. \| 14.0 \| 14.3 \| 13.3 \| 70.7 \| 14.9 \| 14.0 \|
	\| French \| 28.1 \| 27.7 \| 16.1 \| 96.0 \| 37.8 \| 27.5 \|
	\| German \| 20.7 \| 21.2 \| 23.9 \| 86.7 \| 30.1 \| 27.3 \|
	\| Italian \| 17.9 \| 16.2 \| 12.3 \| 83.3 \| 19.8 \| 16.4 \|
	\| Japanese (\*) \| 21.6 \| 19.2 \| 13.7 \| 71.3 \| 25.8 \| 23.3 \|
	\| Korean (\*) \| 13.8 \| 12.8 \| 8.5 \| 59.6 \| 24.5 \| 22.8 \|
	\| Portuguese \| 21.2 \| 24.5 \| 19.5 \| 118.8 \| 33.1 \| 29.7 \|
	\| Russian \| 17.7 \| 17.6 \| 11.6 \| 69.2 \| 22.5 \| 16.7 \|
	\| Spanish \| 12.3 \| 11.6 \| 8.7 \| 75.6 \| 18.2 \| 16.3 \|
	\| Thai (\*) \| 14.5 \| 31.9 \| 14.2 \| 83.6 \| 34.4 \| 20.1 \|
	\| Vietnamese \| 27.2 \| 30.0 \| 15.3 \| 82.8 \| 33.8 \| 24.7 \|
	\| Overall \| 16.8 \| 22.0 \| 12.9\| 76.1 \| 28.4 \| 20.8 \|

	> Results marked with an asterisk () are reported using tcpCER, following the official evaluation protocol.*

	Notes:

	- GT = Ground-Truth Segmentation
	- Real diar = Real Diarization
	- Baseline uses Whisper large-v3 with chunked inference + finetunned Pyannote diarization.
	- DiCoW uses fine-tuned DiariZen diarization.


	## Citation

	If you use this model, please cite:

	```bibtex
	@article{POLOK2026101841,
	title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition},
	journal = {Computer Speech & Language},
	volume = {95},
	pages = {101841},
	year = {2026},
	issn = {0885-2308},
	doi = {https://doi.org/10.1016/j.csl.2025.101841},
	url = {https://www.sciencedirect.com/science/article/pii/S088523082500066X},
	author = {Alexander Polok and Dominik Klement and Martin Kocour and Jiangyu Han and Federico Landini and Bolaji Yusuf and Matthew Wiesner and Sanjeev Khudanpur and Jan Černocký and Lukáš Burget},
	keywords = {Diarization-conditioned Whisper, Target-speaker ASR, Speaker diarization, Long-form ASR, Whisper adaptation},
	}

	@INPROCEEDINGS{10887683,
	author={Polok, Alexander and Klement, Dominik and Wiesner, Matthew and Khudanpur, Sanjeev and Černocký, Jan and Burget, Lukáš},
	booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
	title={Target Speaker ASR with Whisper},
	year={2025},
	volume={},
	number={},
	pages={1-5},
	keywords={Transforms;Signal processing;Transformers;Acoustics;Speech processing;target-speaker ASR;diarization conditioning;multi-speaker ASR;Whisper},
	doi={10.1109/ICASSP49660.2025.10887683}
	}

	@misc{polok2025mlcslmchallenge,
	title={BUT System for the MLC-SLM Challenge},
	author={Alexander Polok and Jiangyu Han and Dominik Klement and Samuele Cornell and Jan Černocký and Lukáš Burget},
	year={2025},
	eprint={2506.13414},
	archivePrefix={arXiv},
	primaryClass={eess.AS},
	url={https://arxiv.org/abs/2506.13414},
	}
	```

	## Contact

	For questions or collaborations, feel free to email: [ipoloka@fit.vut.cz](mailto:ipoloka@fit.vut.cz)
	BUT Speech@FIT, Brno University of Technology
	GitHub: [BUTSpeechFIT](https://github.com/BUTSpeechFIT)