--- datasets: - microsoft/NOTSOFAR - edinburghcstr/ami library_name: transformers license: cc-by-4.0 pipeline_tag: automatic-speech-recognition base_model: openai/whisper-large-v3-turbo tags: - speech - automatic-speech-recognition - whisper - multilingual - speaker-diarization - meeting-transcription - target-speaker-asr - DiCoW - BUT-FIT --- # 🧠 DiCoW v3.3 — Target-Speaker ASR This repository hosts **DiCoW v3.3**, a Target-Speaker ASR (TS-ASR) model developed by [BUT Speech@FIT](https://github.com/BUTSpeechFIT). It is designed to transcribe the speech of a specific speaker within a multi-talker mixture by conditioning on speaker diarization outputs. This model version incorporates the refinements and training strategies described in the paper [SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper](https://huggingface.co/papers/2601.19194).
DiCoW Architecture
## 🔧 What's New in v3.3? This version represents a significant stabilization and enhancement over the original DiCoW (v1): * **⚡ Improved Conditioning:** Introduces **FDDT (Frame-Level Diarization Dependent Transformation)** layers *before* positional embeddings for better signal modulation. * **📉 Reduced Error:** achieved **~50% relative reduction** in tcpWER on Libri3Mix compared to v1. * **🛠️ Training Stability:** Uses less suppressive initialization and flexible data segmentation (no forced end-timestamps). * **🔄 Robustness:** Trained with **STNO noise injection** and **SpecAugment** to handle imperfect diarization. --- ## ⚡ Quick Usage ### 1. Run Interactive Demo (Gradio) The easiest way to use this model is via the [**DiCoW inference repository**](https://github.com/BUTSpeechFIT/DiCoW). We provide a Gradio app that handles diarization and STNO mask generation automatically: ```bash python app.py ```` ### 2. Load in Python If you want to download and load the model manually for your own scripts: ```python from transformers import AutoModelForSpeechSeq2Seq # Load the model (requires remote code for custom FDDT layers) model = AutoModelForSpeechSeq2Seq.from_pretrained( "BUT-FIT/DiCoW_v3_3", trust_remote_code=True ) # Note: The model expects specific STNO conditioning inputs. # See inference.py in the GitHub repo for the full pipeline. ``` --- ## 🧬 Want to build your own DiCoW? It's all yours with just two commands! This model is fully open-source and reproducible using our toolkit. **1. Data Preparation** Clone the [**mt-asr-data-prep**](https://github.com/BUTSpeechFIT/mt-asr-data-prep) repository and run the setup script to generate the required manifests: ```bash ./prepare.sh --single-mic-only --root-dir /path/to/workdir ``` **2. Training** Clone the training repository **[TS-ASR-Whisper](https://github.com/BUTSpeechFIT/TS-ASR-Whisper)** and launch the experiment using the pre-configured `dicow_v3` recipe: ```bash sbatch --export SRC_ROOT=$PWD scripts/submit_slurm.sh +train=dicow_v3 ``` --- ## 🏆 Performance Snapshot (tcpWER) *Metric: Time-Constrained Minimum Permutation WER (5s collar)* | Dataset | DiCoW v1 (Baseline) | **DiCoW v3.3 (This Model)** | |---------------------------|---------------------|-----------------------------| | **Libri2Mix (Both)** | 21.6% | **9.7%** | | **LibriSpeechMix (2)** | 17.9% | **3.1%** | | **AMI (SDM)** | 21.4% | **18.7%** | | **NOTSOFAR-1 (Small-SC)** | 29.8% | **26.6%** | *Scores based on DiariZen Diarization. See paper for Real Diarization results.* 🔗 **[View Full Leaderboard](https://huggingface.co/spaces/BUT-FIT/EMMA_leaderboard)** --- ## ⚙️ Model Details * **Base Architecture:** [Whisper large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo) * **Conditioning:** Frame-Level Diarization-Dependent Transformations (FDDT) * **Input:** 30s Audio + 4-channel STNO Mask * **Training Data:** AMI, NOTSOFAR-1, LibriMix (2/3 spk), Synthetic LibriSpeech Mixtures. ## ⚠️ Limitations * **Diarization Dependent:** Performance is heavily dependent on the quality of the input diarization. * **Ambiguity:** In scenarios with >2 fully overlapping speakers, the model may struggle to distinguish the target (addressed in the **SE-DiCoW** model). --- ## 📚 Citations If you use this model, please cite the following papers: ```bibtex @article{polok2026sedicow, title={SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper}, author={Alexander Polok and Dominik Klement and Samuele Cornell and Matthew Wiesner and Jan Černocký and Sanjeev Khudanpur and Lukáš Burget}, journal={arXiv preprint arXiv:2601.19194}, year={2026} } @article{POLOK2026101841, title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition}, journal = {Computer Speech & Language}, volume = {95}, year = {2026}, doi = {10.1016/j.csl.2025.101841}, author = {Alexander Polok et al.} } @INPROCEEDINGS{10887683, title={Target Speaker ASR with Whisper}, author={Polok, Alexander et al.}, booktitle={ICASSP 2025}, year={2025}, doi={10.1109/ICASSP49660.2025.10887683} } ``` ## 📬 Contact * **Issues:** [GitHub Issues](https://github.com/BUTSpeechFIT/TS-ASR-Whisper/issues) * **Email:** [ipoloka@fit.vut.cz](mailto:ipoloka@fit.vut.cz)