This repository provides the checkpoints and the datasets used in the paper:

J-E. Ayilo, M. Sadeghi, R. Serizel and X. Alameda-Pineda "Diffusion-based Unsupervised Audio-visual Speech Enhancement" accepted at IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2025.

Please, read also the main github repository for a complete understanding.

About the checkpoints:

aonly_tcd_speech_modeling_default_28M.ckpt : the audio-only diffusion model trained on the TCD-TIMIT clean speech
av_tcd_speech_modeling_concat_attn_masking_light_avhubert_p0_28M_enc_dec.ckpt : the audiovisual diffusion model trained on the TCD-TIMIT clean speech

About the datasets:

CROPPED_MOUTH_ldmark_28_68_size_112_112.tar.gz : lips video in format 112×112 extracted from TCD-TIMIT (train/valid/test) and LRS3 test set
CROPPED_MOUTH_ldmark_48_68_size_88_88.tar.gz : lips video in format 88×88 extracted from TCD-TIMIT (train/valid/test) and LRS3 test set
LRS3_NTCD.tar.gz : test set of noisy speech obtained by mix LRS3 clean speech and noise from NTCD
TCD_DEMAND.tar.gz : test set of noisy speech obtained by mix LRS3 clean speech and noise from NTCD

Struture of the video folder

LRS3

 CROPPED_MOUTH_ldmark_48_68_size_88_88/LRS3
 └── test
     ├── 0Fi83BHQsMA
     │   ├── 00002_mouthcrop.mp4
     │   ├── 00004_mouthcrop.mp4
     │   ├── 00005_mouthcrop.mp4
     │   └── 00006_mouthcrop.mp4
     ├── 0gks6ceq4eQ
     ...
     └── zuYzOn0U2PY
         ├── 00001_mouthcrop.mp4
         ├── 00002_mouthcrop.mp4
         ├── 00003_mouthcrop.mp4
         └── 00005_mouthcrop.mp4

TCD-TIMIT

CROPPED_MOUTH_ldmark_48_68_size_88_88/TCD-TIMIT
├── test
│   ├── 09F
│   │   └── straightcam
│   │       ├── sa1_mouthcrop.mp4
│   │       ├── sa2_mouthcrop.mp4
         ...
     ...

│   └── 56M
│       └── straightcam
│           ├── sa1_mouthcrop.mp4
│           ├── sa2_mouthcrop.mp4
│           ├── si1055_mouthcrop.mp4
         ...
     
├── train
│   ├── 01M
│   │   └── straightcam
│   │       ├── sa1_mouthcrop.mp4
│   │       ├── sa2_mouthcrop.mp4
         ...
     ...
│   └── 58F
│       └── straightcam
│           ├── sa1_mouthcrop.mp4
│           ├── sa2_mouthcrop.mp4
             ...


└── valid
    ├── 06M
    │   └── straightcam
    │       ├── sa1_mouthcrop.mp4
    │       ├── sa2_mouthcrop.mp4
         ...
     ...
    └── 59F
        └── straightcam

Struture of the noisy speech dataset

LRS3_NTCD

  LRS3_NTCD
  ├── new_lrs3_ntcd_test.pkl
  └── test
      ├── 0Fi83BHQsMA
      │   ├── 00002.wav
      │   ├── 00004.wav
      │   ├── 00005.wav
      │   └── 00006.wav
      ├── 0gks6ceq4eQ
      ├── 0iIh5YYDR2o
       ...
      └── zuYzOn0U2PY

TCD-TIMIT

  TCD_DEMAND
  ├── new_tcd_demand_test.pkl
  ├── new_tcd_demand_train.pkl
  ├── new_tcd_demand_val.pkl
  ├── test
  │   ├── 09F
  │   │   ├── si1094_OOFFICE_-5.wav
  │   │   ├── si1094_OOFFICE_5.wav
  │   │   ├── si1094_SPSQUARE_-5.wav
       ...
      └── 59F    
  └── valid
  └── train

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support