Preparing and Training with Video Metadata

This guide walks you through preparing your video metadata, splitting it for efficient training, and running the training scripts.

1. Prepare Your Data in `.jsonl` Format

Your video metadata should be organized in JSON Lines (.jsonl) format, where each line is a valid JSON object representing one video.

Example:

{
  "video_path": "data/infinitystar_toy_data/videos/e06b8ca5dbc6.mp4",
  "begin_frame_id": 0,
  "end_frame_id": 120,
  "tarsier2_caption": "The video features an animated character with long light orange hair and brown eyes.",
  "width": 1280,
  "height": 720,
  "h_div_w": 0.5625,
  "fps": 24
}

2. Split Metadata for Training

For efficient training, large .jsonl files can be split into smaller chunks.

python3 data/infinitystar_toy_data/split_jsonls_for_training.py --jsonl_folder_list JSONL_DIR --save_dir SAVE_DIR --chunk_size 100

3. Extract Video Features

To extract video features, modify the extract_video_features.sh script. Set the video_data_path and choose the desired resolution.

480p (5s): pn=0.40M
480p (10s): pn=0.40M with video_frames=161
720p (5s): pn=0.90M

Then, run the script:

bash scripts/extract_video_features.sh

4. Run Training Scripts

Once your metadata is prepared and features are extracted, you can start training.

480p Training (5s or 10s):

bash scripts/train_480p.sh

720p Training (only 5s):

bash scripts/train_720p.sh

The 480p configuration supports both 5-second and 10-second video training. For 10-second training, ensure that video_frames is set to 161 in extract_video_features.sh and train_480p.sh.

Preparing and Training with Video Metadata

1. Prepare Your Data in .jsonl Format

2. Split Metadata for Training

3. Extract Video Features

4. Run Training Scripts

1. Prepare Your Data in `.jsonl` Format