This repository provides the checkpoints and the datasets used in the paper:
J-E. Ayilo, M. Sadeghi, R. Serizel and X. Alameda-Pineda "Diffusion-based Unsupervised Audio-visual Speech Enhancement" accepted at IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2025.
Please, read also the main github repository for a complete understanding.
About the checkpoints:
- aonly_tcd_speech_modeling_default_28M.ckpt : the audio-only diffusion model trained on the TCD-TIMIT clean speech
- av_tcd_speech_modeling_concat_attn_masking_light_avhubert_p0_28M_enc_dec.ckpt : the audiovisual diffusion model trained on the TCD-TIMIT clean speech
About the datasets:
- CROPPED_MOUTH_ldmark_28_68_size_112_112.tar.gz : lips video in format 112Γ112 extracted from TCD-TIMIT (train/valid/test) and LRS3 test set
- CROPPED_MOUTH_ldmark_48_68_size_88_88.tar.gz : lips video in format 88Γ88 extracted from TCD-TIMIT (train/valid/test) and LRS3 test set
- LRS3_NTCD.tar.gz : test set of noisy speech obtained by mix LRS3 clean speech and noise from NTCD
- TCD_DEMAND.tar.gz : test set of noisy speech obtained by mix LRS3 clean speech and noise from NTCD
Struture of the video folder
LRS3
CROPPED_MOUTH_ldmark_48_68_size_88_88/LRS3 βββ test βββ 0Fi83BHQsMA β βββ 00002_mouthcrop.mp4 β βββ 00004_mouthcrop.mp4 β βββ 00005_mouthcrop.mp4 β βββ 00006_mouthcrop.mp4 βββ 0gks6ceq4eQ ... βββ zuYzOn0U2PY βββ 00001_mouthcrop.mp4 βββ 00002_mouthcrop.mp4 βββ 00003_mouthcrop.mp4 βββ 00005_mouthcrop.mp4TCD-TIMIT
CROPPED_MOUTH_ldmark_48_68_size_88_88/TCD-TIMIT βββ test β βββ 09F β β βββ straightcam β β βββ sa1_mouthcrop.mp4 β β βββ sa2_mouthcrop.mp4 ... ... β βββ 56M β βββ straightcam β βββ sa1_mouthcrop.mp4 β βββ sa2_mouthcrop.mp4 β βββ si1055_mouthcrop.mp4 ... βββ train β βββ 01M β β βββ straightcam β β βββ sa1_mouthcrop.mp4 β β βββ sa2_mouthcrop.mp4 ... ... β βββ 58F β βββ straightcam β βββ sa1_mouthcrop.mp4 β βββ sa2_mouthcrop.mp4 ... βββ valid βββ 06M β βββ straightcam β βββ sa1_mouthcrop.mp4 β βββ sa2_mouthcrop.mp4 ... ... βββ 59F βββ straightcam
Struture of the noisy speech dataset
- LRS3_NTCD
LRS3_NTCD
βββ new_lrs3_ntcd_test.pkl
βββ test
βββ 0Fi83BHQsMA
β βββ 00002.wav
β βββ 00004.wav
β βββ 00005.wav
β βββ 00006.wav
βββ 0gks6ceq4eQ
βββ 0iIh5YYDR2o
...
βββ zuYzOn0U2PY
- TCD-TIMIT
TCD_DEMAND
βββ new_tcd_demand_test.pkl
βββ new_tcd_demand_train.pkl
βββ new_tcd_demand_val.pkl
βββ test
β βββ 09F
β β βββ si1094_OOFFICE_-5.wav
β β βββ si1094_OOFFICE_5.wav
β β βββ si1094_SPSQUARE_-5.wav
...
βββ 59F
βββ valid
βββ train
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support