Preparing and Training with Video Metadata
This guide walks you through preparing your video metadata, splitting it for efficient training, and running the training scripts.
1. Prepare Your Data in .jsonl Format
Your video metadata should be organized in JSON Lines (.jsonl) format, where each line is a valid JSON object representing one video.
Example:
{
"video_path": "data/infinitystar_toy_data/videos/e06b8ca5dbc6.mp4",
"begin_frame_id": 0,
"end_frame_id": 120,
"tarsier2_caption": "The video features an animated character with long light orange hair and brown eyes.",
"width": 1280,
"height": 720,
"h_div_w": 0.5625,
"fps": 24
}
2. Split Metadata for Training
For efficient training, large .jsonl files can be split into smaller chunks.
python3 data/infinitystar_toy_data/split_jsonls_for_training.py --jsonl_folder_list JSONL_DIR --save_dir SAVE_DIR --chunk_size 100
3. Extract Video Features
To extract video features, modify the extract_video_features.sh script. Set the video_data_path and choose the desired resolution.
- 480p (5s):
pn=0.40M - 480p (10s):
pn=0.40Mwithvideo_frames=161 - 720p (5s):
pn=0.90M
Then, run the script:
bash scripts/extract_video_features.sh
4. Run Training Scripts
Once your metadata is prepared and features are extracted, you can start training.
480p Training (5s or 10s):
bash scripts/train_480p.sh
720p Training (only 5s):
bash scripts/train_720p.sh
The 480p configuration supports both 5-second and 10-second video training. For 10-second training, ensure that video_frames is set to 161 in extract_video_features.sh and train_480p.sh.