---
library_name: transformers
tags:
- speech
- automatic-speech-recognition
- whisper
- multilingual
- speaker-diarization
- meeting-transcription
- DiCoW
- BUT-FIT
pipeline_tag: automatic-speech-recognition
license: cc-by-4.0
datasets:
- microsoft/NOTSOFAR
- edinburghcstr/ami
---

# 🧠 DiCoW\_v3.2 — BUT-FIT Model for MT-ASR

This repository hosts the **DiCoW\_v3.2** model developed by [BUT Speech@FIT](https://github.com/BUTSpeechFIT), tailored for **multi-talker automatic speech recognition (MT-ASR)**.

This model is available under the terms of CC BY 4.0. It incorporates an MIT-licensed base model and CC BY 4.0 licensed training data.

## 🔧 Key Improvements over DiCoW v1

* **FDDT (Frame-Level Diarization Dependent Transformation)** before positional embeddings
* **Less strict suppressive initialization** to ease early training dynamics
* **Enhanced sequential decoding** with fallback seeking
* **Frozen decoder** during fine-tuning to retain language modeling capabilities

### 🧪 Augmentations

* Random **STNO** noise injection
* Segment-wise random class flipping of **STNO tokens**
* **SpecAugment**
* **MUSAN** noise mixing

### ⚙️ Optimization & Inference Enhancements

* Updated **learning schedule**
* Improved **hallucination detection & mitigation** during inference

---


## 🛠️ Model Usage

```python
from transformers import AutoModelForSpeechSeq2Seq

MODEL_NAME = "BUT-FIT/DiCoW_v3_2"
dicow = AutoModelForSpeechSeq2Seq.from_pretrained(MODEL_NAME, trust_remote_code=True)
```

➡️ For detailed inference pipelines, see: [**DiCoW GitHub (Inference)**](https://github.com/BUTSpeechFIT/DiCoW)


---

## 🏆 Performance

See how **DiCoW_v3.2** performs on our multi-talker ASR benchmark:

- 🔗 [**EMMA-MT ASR Leaderboard**](https://huggingface.co/spaces/BUT-FIT/EMMA_leaderboard)

---


## 📦 Model Details

* **Base Model:** Whisper large-v3-turbo
* **Training Datasets:**

  * [NOTSOFAR-1](https://github.com/microsoft/NOTSOFAR1-Challenge)
  * [AMI Meeting Corpus](http://groups.inf.ed.ac.uk/ami/corpus/)
  * [Libri2Mix](https://github.com/JorisCos/LibriMix)

---

## 🧬 Source Repositories

* 🔧 [Training Code: TS-ASR-Whisper](https://github.com/BUTSpeechFIT/TS-ASR-Whisper)
* 🚀 [Inference](https://github.com/BUTSpeechFIT/DiCoW)

---


## 📚 Related Publications

* 📰 **Journal Paper:**
  *DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition*
  [Computer Speech & Language, 2026](https://www.sciencedirect.com/science/article/pii/S088523082500066X)

* 📰 **ICASSP 2025:**
  *Target Speaker ASR with Whisper*
  [IEEE ICASSP 2025](https://doi.org/10.1109/ICASSP49660.2025.10887683)

* 📰 **CHiME-8 System Description:**
  *BUT/JHU System Description for CHiME-8 NOTSOFAR-1 Challenge*
  [CHiME 2024 Proceedings](https://doi.org/10.21437/CHiME.2024-4)

* 📰 **MLC-SLM Challenge Submission:**
  *BUT System for the MLC-SLM Challenge*
  [arXiv:2506.13414](https://arxiv.org/abs/2506.13414)

---

## 📝 Citation

If you use this model, please cite the following works:

```bibtex
@article{POLOK2026101841,
    title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition},
    journal = {Computer Speech & Language},
    volume = {95},
    pages = {101841},
    year = {2026},
    issn = {0885-2308},
    doi = {https://doi.org/10.1016/j.csl.2025.101841},
    url = {https://www.sciencedirect.com/science/article/pii/S088523082500066X},
    author = {Alexander Polok and Dominik Klement and Martin Kocour and Jiangyu Han and Federico Landini and Bolaji Yusuf and Matthew Wiesner and Sanjeev Khudanpur and Jan Černocký and Lukáš Burget},
    keywords = {Diarization-conditioned Whisper, Target-speaker ASR, Speaker diarization, Long-form ASR, Whisper adaptation},
}

@INPROCEEDINGS{10887683,
    author={Polok, Alexander and Klement, Dominik and Wiesner, Matthew and Khudanpur, Sanjeev and Černocký, Jan and Burget, Lukáš},
    booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
    title={Target Speaker ASR with Whisper}, 
    year={2025},
    volume={},
    number={},
    pages={1-5},
    keywords={Transforms;Signal processing;Transformers;Acoustics;Speech processing;target-speaker ASR;diarization conditioning;multi-speaker ASR;Whisper},
    doi={10.1109/ICASSP49660.2025.10887683}
}

```

---

## 📬 Contact

For questions or collaboration inquiries:

📧 **Email:** [ipoloka@fit.vut.cz](mailto:ipoloka@fit.vut.cz)

🏢 **Affiliation:** [BUT Speech@FIT](https://github.com/BUTSpeechFIT), Brno University of Technology

🔗 **GitHub:** [BUTSpeechFIT](https://github.com/BUTSpeechFIT)