DimasMP3's picture
Upload folder using huggingface_hub
5c69097 verified

TalkSet Generation

You can check the 'train.txt' and 'test.txt' to generate TalkSet by your own.

This generate_TalkSet.py code is just used for your reference, I did not check it recently.

Input the LRS3, VoxCeleb2, 3 list files in lists_in Output TalkSet, train.txt, test.txt (Here the test set is the validation set actually)

Usage:

Set the following parser based on the location of your data:

out_path: the output TalkSet location Vox_audio: Location of the Vox2, training set, audio location Vox_video: Location of the Vox2, training set, video location lrs3_audio: Location of the LRS3, audio location lrs3_video: Location of the LRS3, video location task: The part of the TalkSet you want to generate, eg: TAudio num_cpu: The num of the threads, higher will be faster, based on your PC performance, eg: 10

python TalkSet/generate_TalkSet.py --task 'TAudio'
python TalkSet/generate_TalkSet.py --task 'FAudio'
python TalkSet/generate_TalkSet.py --task 'TFAudio'
python TalkSet/generate_TalkSet.py --task 'TSilence'
python TalkSet/generate_TalkSet.py --task 'FSilence'
python TalkSet/generate_TalkSet.py --task 'Fusion'

Description:

For lists_out\*.txt files:

  • The 1st row is the face clips data type,
    • TAudio: audio is active, lip is moving, audio and lip are sync
    • FAudio: audio is active, lip is moving, audio and lip are not sync (Speech from others)
    • TFAudio: one part is 'TAudio', the other part is 'FAudio'
    • TSilence: one part is 'TAudio', in the other part, audio is non-active, lip is not moving
    • FSilence: one part is 'silence'(audio is non-active, lip is not moving), in the other part, audio is active, lip is not moving (Speech from others)
  • The 2nd row is the path for the audio file (filename started from 'silence' is the data from LRS3, filename started from 'id.....' is the data from VoxCeleb2)
  • The 3rd row is the path for the video file
  • The 4th row is the length(seconds) of this data
  • The 5th row is the start of 'active' clip (in FSilence, it presents the 'silence' part)
  • The 6th row is the end of 'active' clip
  • The 7th row is the start of 'non-active' clip (in FSilence, it presents the 'speech from others' part)
  • The 8th row is the end of 'non-active' clip
  • The 9th row is the file ID

The dataset generated will not be fixed each time because we randomly select FSlience data, and the change point is the random number. We believe the result will be similar. The whole time to generate the TalkSet will use about 3 to 6 hours in our experiments.