This repository provides the checkpoints and the datasets used in the paper:

J-E. Ayilo, M. Sadeghi, R. Serizel and X. Alameda-Pineda "Diffusion-based Unsupervised Audio-visual Speech Enhancement" accepted at IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2025.

Please, read also the main github repository for a complete understanding.

About the checkpoints:

  • aonly_tcd_speech_modeling_default_28M.ckpt : the audio-only diffusion model trained on the TCD-TIMIT clean speech
  • av_tcd_speech_modeling_concat_attn_masking_light_avhubert_p0_28M_enc_dec.ckpt : the audiovisual diffusion model trained on the TCD-TIMIT clean speech

About the datasets:

  • CROPPED_MOUTH_ldmark_28_68_size_112_112.tar.gz : lips video in format 112Γ—112 extracted from TCD-TIMIT (train/valid/test) and LRS3 test set
  • CROPPED_MOUTH_ldmark_48_68_size_88_88.tar.gz : lips video in format 88Γ—88 extracted from TCD-TIMIT (train/valid/test) and LRS3 test set
  • LRS3_NTCD.tar.gz : test set of noisy speech obtained by mix LRS3 clean speech and noise from NTCD
  • TCD_DEMAND.tar.gz : test set of noisy speech obtained by mix LRS3 clean speech and noise from NTCD

Struture of the video folder

  • LRS3

     CROPPED_MOUTH_ldmark_48_68_size_88_88/LRS3
     └── test
         β”œβ”€β”€ 0Fi83BHQsMA
         β”‚   β”œβ”€β”€ 00002_mouthcrop.mp4
         β”‚   β”œβ”€β”€ 00004_mouthcrop.mp4
         β”‚   β”œβ”€β”€ 00005_mouthcrop.mp4
         β”‚   └── 00006_mouthcrop.mp4
         β”œβ”€β”€ 0gks6ceq4eQ
         ...
         └── zuYzOn0U2PY
             β”œβ”€β”€ 00001_mouthcrop.mp4
             β”œβ”€β”€ 00002_mouthcrop.mp4
             β”œβ”€β”€ 00003_mouthcrop.mp4
             └── 00005_mouthcrop.mp4
    
  • TCD-TIMIT

    CROPPED_MOUTH_ldmark_48_68_size_88_88/TCD-TIMIT
    β”œβ”€β”€ test
    β”‚   β”œβ”€β”€ 09F
    β”‚   β”‚   └── straightcam
    β”‚   β”‚       β”œβ”€β”€ sa1_mouthcrop.mp4
    β”‚   β”‚       β”œβ”€β”€ sa2_mouthcrop.mp4
             ...
         ...
    
    β”‚   └── 56M
    β”‚       └── straightcam
    β”‚           β”œβ”€β”€ sa1_mouthcrop.mp4
    β”‚           β”œβ”€β”€ sa2_mouthcrop.mp4
    β”‚           β”œβ”€β”€ si1055_mouthcrop.mp4
             ...
         
    β”œβ”€β”€ train
    β”‚   β”œβ”€β”€ 01M
    β”‚   β”‚   └── straightcam
    β”‚   β”‚       β”œβ”€β”€ sa1_mouthcrop.mp4
    β”‚   β”‚       β”œβ”€β”€ sa2_mouthcrop.mp4
             ...
         ...
    β”‚   └── 58F
    β”‚       └── straightcam
    β”‚           β”œβ”€β”€ sa1_mouthcrop.mp4
    β”‚           β”œβ”€β”€ sa2_mouthcrop.mp4
                 ...
    
    
    └── valid
        β”œβ”€β”€ 06M
        β”‚   └── straightcam
        β”‚       β”œβ”€β”€ sa1_mouthcrop.mp4
        β”‚       β”œβ”€β”€ sa2_mouthcrop.mp4
             ...
         ...
        └── 59F
            └── straightcam
    

Struture of the noisy speech dataset

  • LRS3_NTCD
  LRS3_NTCD
  β”œβ”€β”€ new_lrs3_ntcd_test.pkl
  └── test
      β”œβ”€β”€ 0Fi83BHQsMA
      β”‚   β”œβ”€β”€ 00002.wav
      β”‚   β”œβ”€β”€ 00004.wav
      β”‚   β”œβ”€β”€ 00005.wav
      β”‚   └── 00006.wav
      β”œβ”€β”€ 0gks6ceq4eQ
      β”œβ”€β”€ 0iIh5YYDR2o
       ...
      └── zuYzOn0U2PY
  • TCD-TIMIT
  TCD_DEMAND
  β”œβ”€β”€ new_tcd_demand_test.pkl
  β”œβ”€β”€ new_tcd_demand_train.pkl
  β”œβ”€β”€ new_tcd_demand_val.pkl
  β”œβ”€β”€ test
  β”‚   β”œβ”€β”€ 09F
  β”‚   β”‚   β”œβ”€β”€ si1094_OOFFICE_-5.wav
  β”‚   β”‚   β”œβ”€β”€ si1094_OOFFICE_5.wav
  β”‚   β”‚   β”œβ”€β”€ si1094_SPSQUARE_-5.wav
       ...
      └── 59F    
  └── valid
  └── train
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support