BryanW's picture
Upload folder using huggingface_hub
3d1c0e1 verified

Preparing and Training with Video Metadata

This guide walks you through preparing your video metadata, splitting it for efficient training, and running the training scripts.

1. Prepare Your Data in .jsonl Format

Your video metadata should be organized in JSON Lines (.jsonl) format, where each line is a valid JSON object representing one video.

Example:

{
  "video_path": "data/infinitystar_toy_data/videos/e06b8ca5dbc6.mp4",
  "begin_frame_id": 0,
  "end_frame_id": 120,
  "tarsier2_caption": "The video features an animated character with long light orange hair and brown eyes.",
  "width": 1280,
  "height": 720,
  "h_div_w": 0.5625,
  "fps": 24
}

2. Split Metadata for Training

For efficient training, large .jsonl files can be split into smaller chunks.

python3 data/infinitystar_toy_data/split_jsonls_for_training.py --jsonl_folder_list JSONL_DIR --save_dir SAVE_DIR --chunk_size 100

3. Extract Video Features

To extract video features, modify the extract_video_features.sh script. Set the video_data_path and choose the desired resolution.

  • 480p (5s): pn=0.40M
  • 480p (10s): pn=0.40M with video_frames=161
  • 720p (5s): pn=0.90M

Then, run the script:

bash scripts/extract_video_features.sh

4. Run Training Scripts

Once your metadata is prepared and features are extracted, you can start training.

480p Training (5s or 10s):

bash scripts/train_480p.sh

720p Training (only 5s):

bash scripts/train_720p.sh

The 480p configuration supports both 5-second and 10-second video training. For 10-second training, ensure that video_frames is set to 161 in extract_video_features.sh and train_480p.sh.