Spaces:
Running on Zero
Running on Zero
File size: 3,415 Bytes
ddb382a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 | # Dataset Preparation Guide
This guide provides step-by-step instructions for preparing datasets to train models in this repository.
## 0. Pre-requisites
Ensure the following checkpoint files exist in the `ckpts/` directory before continuing:
* `ckpts/vae.ckpt`
* `ckpts/synchformer_state_dict.pth`
## 1. Preparing Video-Text Datasets
To convert raw videos and CoT into training features, use the following command:
```bash
torchrun --nproc_per_node=8 data_utils/extract_training_video.py \
--root <video_path> \
--tsv_path <csv_path> \
--save-dir <feature_output_dir> \
--duration_sec <uniform_video_duration_in_seconds> \
--audio_samples duration_sec*44100
```
* `<video_path>`: Path to the root directory containing all .mp4 videos to be processed (all videos must be of equal duration).
* `<csv_path>`: Path to the TSV/CSV file that lists video-text pairs.(see `demo_test.csv` for format).
* `<feature_output_dir>`: Directory where extracted video features will be saved.
* `<uniform_video_duration_in_seconds>`: Duration to which all videos will be uniformly trimmed or padded.
## 2. Preparing Audio-Text Datasets
You can also include audio-text pairs for training. Use the following command to extract features:
```bash
torchrun --nproc_per_node=8 data_utils/extract_training_audio.py \
--root <audio_path> \
--tsv_path <csv_path> \
--save-dir <feature_output_dir> \
--duration_sec <uniform_audio_duration_in_seconds> \
--audio_samples duration_sec*44100
```
* `<audio_path>`: Path to the raw audio files.
* `<csv_path>`: Path to the TSV/CSV file that lists audio-text pairs.
* `<feature_output_dir>`: Directory where extracted audio features will be saved.
* `<uniform_audio_duration_in_seconds>`: Duration to which all audios will be uniformly trimmed or padded.
* Note that the audio input for feature extraction must be trimmed to match the duration of the video-text datasets.
## 3. Organizing Feature Files
For each dataset (video or audio), create a `.txt` file listing all feature file names (one per line), for example:
```
item1.pth
item2.pth
item3.pth
...
```
This file acts as the training split and will be referenced in the dataset config.
## 4. Creating the Dataset Configuration JSON
Create a JSON file following the structure below (adapted from `ThinkSound/configs/multimodal_dataset_demo.json`):
```json
{
"dataset_type": "multimodal_dir",
"video_datasets": [
{
"id": "video_dataset_id",
"path": "path_to_video_feature_dir",
"split_path": "path_to_video_split_txt"
}
],
"audio_datasets": [
{
"id": "audio_dataset_id",
"path": "path_to_audio_feature_dir",
"split_path": "path_to_audio_split_txt"
}
],
"val_datasets": [
{
"id": "val_dataset_id",
"path": "path_to_val_feature_dir",
"split_path": "path_to_val_split_txt"
}
],
"random_crop": true,
"input_type": "prompt"
}
```
You can include multiple datasets under `video_datasets` and `audio_datasets` by appending additional dictionary blocks to each list. The `val_datasets` is encouraged and must be a video-text dataset.
## 5. Proceed to Training
Refer to [`docs/ThinkSound/Training.md`](./Training.md) for detailed training instructions once the dataset configuration is complete.
|