File size: 4,706 Bytes
c00ff2c a342419 a60ac43 00a09c6 a60ac43 c00ff2c ad51a50 a342419 c00ff2c a342419 c00ff2c 00a09c6 a342419 c00ff2c a342419 c00ff2c a342419 c00ff2c a342419 c00ff2c a342419 c00ff2c a342419 e6564b6 a342419 c00ff2c a342419 c00ff2c a342419 c00ff2c a342419 c00ff2c a342419 c00ff2c a342419 c00ff2c a342419 c00ff2c a342419 c00ff2c a342419 c00ff2c a342419 c00ff2c a342419 c00ff2c a342419 c00ff2c a342419 c00ff2c a342419 c00ff2c a342419 c00ff2c a342419 c00ff2c a342419 c00ff2c a342419 c00ff2c a342419 c00ff2c a342419 c00ff2c a342419 c00ff2c a342419 c00ff2c a342419 c00ff2c a342419 a60ac43 a342419 c00ff2c a342419 c00ff2c a342419 c00ff2c a342419 c00ff2c a342419 c00ff2c a342419 c00ff2c a342419 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 |
---
library_name: transformers
tags:
- speech
- automatic-speech-recognition
- whisper
- multilingual
- speaker-diarization
- meeting-transcription
- DiCoW
- BUT-FIT
pipeline_tag: automatic-speech-recognition
license: cc-by-4.0
datasets:
- microsoft/NOTSOFAR
- edinburghcstr/ami
---
# 🧠 DiCoW\_v3.2 — BUT-FIT Model for MT-ASR
This repository hosts the **DiCoW\_v3.2** model developed by [BUT Speech@FIT](https://github.com/BUTSpeechFIT), tailored for **multi-talker automatic speech recognition (MT-ASR)**.
This model is available under the terms of CC BY 4.0. It incorporates an MIT-licensed base model and CC BY 4.0 licensed training data.
## 🔧 Key Improvements over DiCoW v1
* **FDDT (Frame-Level Diarization Dependent Transformation)** before positional embeddings
* **Less strict suppressive initialization** to ease early training dynamics
* **Enhanced sequential decoding** with fallback seeking
* **Frozen decoder** during fine-tuning to retain language modeling capabilities
### 🧪 Augmentations
* Random **STNO** noise injection
* Segment-wise random class flipping of **STNO tokens**
* **SpecAugment**
* **MUSAN** noise mixing
### ⚙️ Optimization & Inference Enhancements
* Updated **learning schedule**
* Improved **hallucination detection & mitigation** during inference
---
## 🛠️ Model Usage
```python
from transformers import AutoModelForSpeechSeq2Seq
MODEL_NAME = "BUT-FIT/DiCoW_v3_2"
dicow = AutoModelForSpeechSeq2Seq.from_pretrained(MODEL_NAME, trust_remote_code=True)
```
➡️ For detailed inference pipelines, see: [**DiCoW GitHub (Inference)**](https://github.com/BUTSpeechFIT/DiCoW)
---
## 🏆 Performance
See how **DiCoW_v3.2** performs on our multi-talker ASR benchmark:
- 🔗 [**EMMA-MT ASR Leaderboard**](https://huggingface.co/spaces/BUT-FIT/EMMA_leaderboard)
---
## 📦 Model Details
* **Base Model:** Whisper large-v3-turbo
* **Training Datasets:**
* [NOTSOFAR-1](https://github.com/microsoft/NOTSOFAR1-Challenge)
* [AMI Meeting Corpus](http://groups.inf.ed.ac.uk/ami/corpus/)
* [Libri2Mix](https://github.com/JorisCos/LibriMix)
---
## 🧬 Source Repositories
* 🔧 [Training Code: TS-ASR-Whisper](https://github.com/BUTSpeechFIT/TS-ASR-Whisper)
* 🚀 [Inference](https://github.com/BUTSpeechFIT/DiCoW)
---
## 📚 Related Publications
* 📰 **Journal Paper:**
*DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition*
[Computer Speech & Language, 2026](https://www.sciencedirect.com/science/article/pii/S088523082500066X)
* 📰 **ICASSP 2025:**
*Target Speaker ASR with Whisper*
[IEEE ICASSP 2025](https://doi.org/10.1109/ICASSP49660.2025.10887683)
* 📰 **CHiME-8 System Description:**
*BUT/JHU System Description for CHiME-8 NOTSOFAR-1 Challenge*
[CHiME 2024 Proceedings](https://doi.org/10.21437/CHiME.2024-4)
* 📰 **MLC-SLM Challenge Submission:**
*BUT System for the MLC-SLM Challenge*
[arXiv:2506.13414](https://arxiv.org/abs/2506.13414)
---
## 📝 Citation
If you use this model, please cite the following works:
```bibtex
@article{POLOK2026101841,
title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition},
journal = {Computer Speech & Language},
volume = {95},
pages = {101841},
year = {2026},
issn = {0885-2308},
doi = {https://doi.org/10.1016/j.csl.2025.101841},
url = {https://www.sciencedirect.com/science/article/pii/S088523082500066X},
author = {Alexander Polok and Dominik Klement and Martin Kocour and Jiangyu Han and Federico Landini and Bolaji Yusuf and Matthew Wiesner and Sanjeev Khudanpur and Jan Černocký and Lukáš Burget},
keywords = {Diarization-conditioned Whisper, Target-speaker ASR, Speaker diarization, Long-form ASR, Whisper adaptation},
}
@INPROCEEDINGS{10887683,
author={Polok, Alexander and Klement, Dominik and Wiesner, Matthew and Khudanpur, Sanjeev and Černocký, Jan and Burget, Lukáš},
booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={Target Speaker ASR with Whisper},
year={2025},
volume={},
number={},
pages={1-5},
keywords={Transforms;Signal processing;Transformers;Acoustics;Speech processing;target-speaker ASR;diarization conditioning;multi-speaker ASR;Whisper},
doi={10.1109/ICASSP49660.2025.10887683}
}
```
---
## 📬 Contact
For questions or collaboration inquiries:
📧 **Email:** [ipoloka@fit.vut.cz](mailto:ipoloka@fit.vut.cz)
🏢 **Affiliation:** [BUT Speech@FIT](https://github.com/BUTSpeechFIT), Brno University of Technology
🔗 **GitHub:** [BUTSpeechFIT](https://github.com/BUTSpeechFIT) |