π§ DiCoW v3.3 β Target-Speaker ASR
This repository hosts DiCoW v3.3, a Target-Speaker ASR (TS-ASR) model developed by BUT Speech@FIT. It is designed to transcribe the speech of a specific speaker within a multi-talker mixture by conditioning on speaker diarization outputs.
π§ What's New in v3.3?
This version represents a significant stabilization and enhancement over the original DiCoW (v1):
- β‘ Improved Conditioning: Introduces FDDT (Frame-Level Diarization Dependent Transformation) layers before positional embeddings for better signal modulation.
- π Reduced Error: achieved ~50% relative reduction in tcpWER on Libri3Mix compared to v1.
- π οΈ Training Stability: Uses less suppressive initialization and flexible data segmentation (no forced end-timestamps).
- π Robustness: Trained with STNO noise injection and SpecAugment to handle imperfect diarization.
β‘ Quick Usage
1. Run Interactive Demo (Gradio)
The easiest way to use this model is via the DiCoW inference repository. We provide a Gradio app that handles diarization and STNO mask generation automatically:
python app.py
2. Load in Python
If you want to download and load the model manually for your own scripts:
from transformers import AutoModelForSpeechSeq2Seq
# Load the model (requires remote code for custom FDDT layers)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
"BUT-FIT/DiCoW_v3_3",
trust_remote_code=True
)
# Note: The model expects specific STNO conditioning inputs.
# See inference.py in the GitHub repo for the full pipeline.
𧬠Want to build your own DiCoW?
It's all yours with just two commands! This model is fully open-source and reproducible using our toolkit.
1. Data Preparation Clone the mt-asr-data-prep repository and run the setup script to generate the required manifests:
./prepare.sh --single-mic-only --root-dir /path/to/workdir
2. Training
Clone the training repository TS-ASR-Whisper and launch the experiment using the pre-configured dicow_v3 recipe:
sbatch --export SRC_ROOT=$PWD scripts/submit_slurm.sh +train=dicow_v3
π Performance Snapshot (tcpWER)
Metric: Time-Constrained Minimum Permutation WER (5s collar)
| Dataset | DiCoW v1 (Baseline) | DiCoW v3.3 (This Model) |
|---|---|---|
| Libri2Mix (Both) | 21.6% | 9.7% |
| LibriSpeechMix (2) | 17.9% | 3.1% |
| AMI (SDM) | 21.4% | 18.7% |
| NOTSOFAR-1 (Small-SC) | 29.8% | 26.6% |
Scores based on DiariZen Diarization. See paper for Real Diarization results. π View Full Leaderboard
βοΈ Model Details
- Base Architecture: Whisper large-v3-turbo
- Conditioning: Frame-Level Diarization-Dependent Transformations (FDDT)
- Input: 30s Audio + 4-channel STNO Mask
- Training Data: AMI, NOTSOFAR-1, LibriMix (2/3 spk), Synthetic LibriSpeech Mixtures.
β οΈ Limitations
- Diarization Dependent: Performance is heavily dependent on the quality of the input diarization.
- Ambiguity: In scenarios with >2 fully overlapping speakers, the model may struggle to distinguish the target (addressed in our upcoming SE-DiCoW model).
π Citations
If you use this model, please cite our CS&L 2026 and ICASSP 2025 papers:
@article{POLOK2026101841,
title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition},
journal = {Computer Speech & Language},
volume = {95},
year = {2026},
doi = {10.1016/j.csl.2025.101841},
author = {Alexander Polok et al.}
}
@INPROCEEDINGS{10887683,
title={Target Speaker ASR with Whisper},
author={Polok, Alexander et al.},
booktitle={ICASSP 2025},
year={2025},
doi={10.1109/ICASSP49660.2025.10887683}
}
π¬ Contact
- Issues: GitHub Issues
- Email: ipoloka@fit.vut.cz
- Downloads last month
- 46